Comparison of Image Classification Models on Varying Dataset Sizes

67
Hasso Plattner Institute Master’s Thesis Comparison of Image Classification Models on Varying Dataset Sizes Author: Timur Pratama Wiradarma First Supervisor: Dr. Harald Sack Second Supervisor: Christian Hentschel M.Sc. A thesis submitted in fulfilment of the requirements for the degree of Master of IT Systems Engineering in the Semantic Web Technology Internet Technology and System September 2015

Transcript of Comparison of Image Classification Models on Varying Dataset Sizes

Hasso Plattner Institute

Master’s Thesis

Comparison of Image ClassificationModels on Varying Dataset Sizes

Author:

Timur Pratama Wiradarma

First Supervisor:

Dr. Harald Sack

Second Supervisor:

Christian Hentschel M.Sc.

A thesis submitted in fulfilment of the requirements

for the degree of Master of IT Systems Engineering

in the

Semantic Web Technology

Internet Technology and System

September 2015

Declaration of Authorship

I, Timur Pratama Wiradarma, declare that this thesis titled, ’Comparison of Image

Classification Models on Varying Dataset Sizes’ and the work presented in it are my

own. I confirm that:

� This work was done wholly or mainly while in candidature for a research degree

at this University.

� Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly

stated.

� Where I have consulted the published work of others, this is always clearly at-

tributed.

� Where I have quoted from the work of others, the source is always given. With

the exception of such quotations, this thesis is entirely my own work.

� I have acknowledged all main sources of help.

� Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

Date:

i

HASSO PLATTNER INSTITUTE

Abstract

Faculty Name

Internet Technology and System

Master of IT Systems Engineering

Comparison of Image Classification Models on Varying Dataset Sizes

by Timur Pratama Wiradarma

This thesis aims to compare two competing approaches for image classification, namely

Bag-of-Visual-Words (BoVW) and Convolutional Neural Networks (CNNs). Recent

works have shown that CNNs have surpassed hand-crafted feature extraction techniques

in image classification problems. The success of CNNs can be mainly attributed to two

factors: recent advances in GPU supported computation (e.g., shorter training time and

parallelization) and the availability of large training datasets for selected applications

scenarios. However, these factors come with a series of drawbacks: access to powerful

GPUs may be limited and assembling a large set of manually annotated training images

requires a lot of resources and time. This thesis focuses on the latter aspect by analyz-

ing the impact of the available training data on the classification accuracy obtained by

BoVW and CNN. It is assumed that CNNs benefit from growing datasets while BoVW-

based classifiers outperform CNNs when limited data is available. Evidence is given by

experiments which utilize gradually increasing training data and visualizations of the

classification models.

ii

Contents

Declaration of Authorship i

Abstract ii

Contents iii

List of Figures v

List of Tables vi

Acronyms vii

1 Introduction 1

1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Traditional Image Classification 7

2.1 Building Local Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Building Visual Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Global Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . 10

2.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Convolutional Neural Networks 14

3.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Models of a Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.2 MultiLayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.3 Learning in Neural Network . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Motivation of Convolutional Neural Networks . . . . . . . . . . . . . . . . 18

3.2.1 Convolution and Cross-Correlation . . . . . . . . . . . . . . . . . . 19

3.3 Types of Layer in Convolution Neural Networks . . . . . . . . . . . . . . . 20

3.3.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.3 Local Response Normalization . . . . . . . . . . . . . . . . . . . . 22

iii

3.3.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.5 Softmax Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Backpropagation in Convolutional Neural Networks . . . . . . . . . . . . . 23

3.4.1 Gradients in Convolution Layer . . . . . . . . . . . . . . . . . . . . 24

3.4.2 Gradients in Max Pooling . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.3 Gradients in Local Response Normalization . . . . . . . . . . . . . 25

3.4.4 Backpropagation Trick . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Example Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Experimental Setup 28

4.1 ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 Imbalanced Dataset with CNN . . . . . . . . . . . . . . . . . . . . 29

4.2 Wikipainting Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Algorithm Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Caffe Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 35

4.3.3 Fisher Vector Encodings . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Evaluation and Discussion of Results 38

5.1 Evaluation on Imagenet Dataset . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 Visualizing Model Predictions . . . . . . . . . . . . . . . . . . . . . 40

5.1.2 Evaluation on Imbalanced Dataset . . . . . . . . . . . . . . . . . . 43

5.2 Evaluation on Wikipainting Dataset . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion and Future Work 48

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Imagenet Evaluation Results 51

B Wikipainting Evaluation Results 53

Bibliography 55

iv

List of Figures

1.1 The graph of daily numbers of uploaded and shared photos taken from2005-2014 on 4 social network platforms . . . . . . . . . . . . . . . . . . . 2

1.2 An example of semantic gap . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 The winners of the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) from 2010 to 2014 . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 A bag of visual words pipeline adapted from the similar diagram in [1]. . . 7

2.2 Building a local gradient histogram in scale-invariant feature transform . 8

2.3 A support vector machine algorithm that separates two classes . . . . . . 10

2.4 A confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 The precision and recall curve . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 A basic non-linear model of a neuron adapted from Haykin . . . . . . . . 15

3.2 A fully connected three-layer feed-forward neural network . . . . . . . . . 16

3.3 Neural networks in 2D spatial domain . . . . . . . . . . . . . . . . . . . . 18

3.4 A visualization of a convolutional layer . . . . . . . . . . . . . . . . . . . . 20

3.5 A rectified linear unit function removes values lower than zero . . . . . . . 21

3.6 Description of a dropout technique . . . . . . . . . . . . . . . . . . . . . . 23

3.7 An example of a convolutional neural network on a 3-channel RGB image 26

4.1 The Wikipainting dataset, a collection of painting images. . . . . . . . . . 33

4.2 Caffe Convolutional Neural Network (CNN) architecture . . . . . . . . . . 37

5.1 A comparison of Fisher Vectors (FV) and CNN Mean Average Precision(MAP) scores on the incrementing ImageNet datasets . . . . . . . . . . . 39

5.2 A comparison of FV and CNN Average Precision (AP) scores per classfrom their best performing model . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 An image in the ImageNet muzzle class, with examples of some regionsset to zero in order to visualize heatmaps. . . . . . . . . . . . . . . . . . . 41

5.4 Heatmaps of an true positive image in the muzzle category displayed usingall models trained from the CNN full sets (i.e., 100%) . . . . . . . . . . . 42

5.6 A comparison of CNN default setting and CNN imbalanced setting MAPscores with different sizes of imbalanced data . . . . . . . . . . . . . . . . 43

5.5 Heatmaps computed for some of the easier classes and harder classes . . . 44

5.7 A comparison of FV, CNN trained from scratch and pre-trained CNNMAP scores on the incrementing Wikipainting datasets. . . . . . . . . . . 45

5.8 Contrasting the hardest classes and the easiest classes in the Wikipaintingdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.9 A Wikipainting confusion matrix based on prediction accuracies from theCNN pre-trained model on 22 different classes. . . . . . . . . . . . . . . . 47

v

List of Tables

4.1 Training sets generated by randomly sub-selecting images from the 2012ILSVRC training set according to the denoted ratio. . . . . . . . . . . . . 29

4.2 Training sets generated by randomly sub-selecting images from the Octo-ber 2013 wikipainting website according to the denoted ratio. . . . . . . . 29

4.3 Easiest and hardest categories to classify based on evaluation of the meanerror of the top 5 predictions from all submissions to the 2012 ILSVRC. . 30

4.4 Imbalanced training sets generated by randomly sub-selecting negativeclasses in addition to the 10 positive classes . . . . . . . . . . . . . . . . . 31

4.5 A list of Caffe library dependencies applied in this experiment . . . . . . . 34

4.6 The training time required for training the 7 subset splits (i.e., 5, 10, 20,40, 60, 80 and 100% ratios) in the ImageNet and Wikipainting scenarios. . 36

A.1 Each class result of FV AP scores on the incrementing ImageNet datasets 51

A.2 Each class result of CNN AP scores on the incrementing ImageNet datasets 52

A.3 Each class result of CNN AP scores on the imbalance ImageNet datasets . 52

B.1 Each class result of CNN and FV AP scores on the incrementing Wikipaint-ing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vi

Acronyms

ANN Articial Neural Network.

AP Average Precision.

BoVW Bag-of-Visual-Words.

CNN Convolutional Neural Network.

DOG Difference of Gaussian.

FV Fisher Vectors.

GMM Gaussian Mixture Model.

HOG Histogram of Oriented Gradients.

IFV Improved Fisher Vectors.

ILSVRC ImageNet Large Scale Visual Recognition Challenge.

LRN Local Response Normalization.

MAP Mean Average Precision.

MLP Multi Layer Perceptron.

MSE Mean Squared Error.

PCA Principal Component Analysis.

ReLU Rectified Linear Unit.

SGD Stochastic Gradient Descent.

SIFT Scale-Invariant Feature Transform.

SVM Support Vector Machine.

vii

Chapter 1

Introduction

Content-based image classification is an important means to address challenges of searchand retrieval in large image datasets such as community and stock photo galleries. Im-age classification is the task of assigning images into different categories; content-basedmeans that the classification targets intrinsic image features (e.g. intensity, color, andtexture) rather than its surrounding metadata, such as keywords, tags or descriptionsassociated with the image. Image classification was first introduced in 1979 at theDatabase Techniques for Pictorial Application Conference, in Florence [2]. At thattime, images were manually tagged and stored in a database, to be retrieved later us-ing keywords tagged on the images. Since that event, image classification has been usedmore prevalently in many domains of applications. On social networking platforms, suchas Facebook1 and Twitter2, users tag their photos as a way to share their experienceswith their close friends or even to expand their social network by inviting people whohave similar interests. In the field of Biomedicine, for example, image classification hasbeen carried out as a means to aid doctors in improving diagnosis by identifying similarpast cases [3]. In e-commerce, image classification improves users’ shopping experienceby recommending similar items.

Despite the importance of image search and retrieval, in which image classification isapplied, in practice, its implementations is still highly dependent on manual annotation.This approach however suffers from a series of drawbacks [4]. Firstly, manual annotationis a labor intensive process that is slow and not scalable when data grows. Moreover,with today’s growth rate of social medias (Figure 1.1), one can imagine that manuallyannotating that amount of data is not a feasible task. Secondly, in some cases, cate-gorizing images into correct classes requires specific domain knowledge, for instance, itmay require a painting curator to group a collection of paintings into different art peri-ods. Another potential case, such as listing a product image catalog, requires employeesthat understand company products thoroughly. Thirdly, manual annotation is a highlysubjective task which may lead to different interpretations for an ambiguous image.

1www.facebook.com2www.twitter.com

1

Chapter 1. Introduction 2

Figure 1.1: The graph of daily numbers of uploaded and shared photos taken from2005-2014 on 4 social network platforms (the numbers are in millions). 3

Given the problems described above, having a machine to classify images automaticallyis perceived to be an attractive alternative. Hence, many research efforts have beendedicated to creating good artificial intelligence that could replace manual image classi-fication. However, as a matter of course, having a machine to replace human tasks hasalways been a challenging research topic.

Moran [5] describes two factors what makes classifying images automatically such adifficult task. The first is the variability of objects in an image. Machines need toextract image representations that are robust to confounding factors, such as occlusions,illumination, scale, viewpoint and translation. For instance, the same object in differentimages may look differently by the presence of occlusions or if it is captured from differentangles. Secondly, while an object in an image may be easily detected by human; however,that task may not be straightforward for computers. For example, one may easilyrecognize figure 1.2 as a picture of a dog swimming in the water, but machines understandit as a matrix of pixels in a spatial dimension; therefore interpreting these low-levelrepresentations (i.e., pixels, colors) into their higher level semantic concept is an intricatetask. This problem is called the semantic gap. The formal definition of semantic gap isdefined by Smeulders [6]:

“The lack of coincidence between the information that one can extract from the visualdata and the interpretation that the same data have for a user in a given situation.”

A lot of research efforts have been dedicated to bridging this gap. In this thesis, twocutting edge approaches that have been widely used in the image classification domainhave been investigated; Bag-of-Visual-Words (BoVW) and Convolutional Neural Net-works (CNNs). BoVW is an adaptation of the Bag of Words model that has beensuccessfully applied for text classification and retrieval and hence was extended to vi-sual features for image classification. On the other hand, CNNs are an extension ofneural networks, which are inspired by biological neural networks.

3Source: KPCB (Kleiner Perkins Caufield and Byers) estimates based on publicly disclosed companydata, 2014 year to date data per latest as of 5/14.

Chapter 1. Introduction 3

Figure 1.2: An example of semantic gap, the challenge is to bridge the concepts andthe actual information seen by a computer4. Adapted from a similar figure in [5]

1.1 Problem Context

The major difference between BoVW and CNNs is how they select the correct featuresfrom an image. BoVW relies on hand-crafted features (e.g., Scale-Invariant FeatureTransform (SIFT) [7] or Histogram of Oriented Gradients (HOG) [8]) to generate avocabulary of visual words. Hence, an image can be represented by occurences of uniquevisual words. This type of feature descriptions requires prior knowledge to select whichattributes best represent an image (e.g., local histogram of orientation gradients inSIFT). On the other hand, CNNs discover their features through the process of learningdirectly from raw data input (e.g., 2D image pixels). Therefore, instead of deciding whichfeatures should be used to represent an image, CNNs choose the important features bythemselves.

BoVW [9] represented the state-of-the-art in image classification for a couple of years,until it was outperformed by CNNs [10]. While the general idea of using CNNs in imageclassification is not new [11] only recently has the availability of large scale computingpower as well as large training datasets – assembled using crowd sourcing – made CNNssuccessful. This provision of training data and shorter training time enable CNNs tolearn robust features that lead to a more accurate model. In practice, a CNN model istrained on a Graphics Processing Unit (GPU)5 [10, 12, 13], which offers a high degree ofparallelization and enables the model to learn from thousands of hand-labeled trainingimages. One could argue that the necessity for powerful computing hardware will lesslikely be a constraining factor in the future due to technological progress, however,the effort of providing large amounts of manually annotated training data cannot beeasily substituted. For many classification scenarios it is simply impossible to providea sufficiently large amount of labeled images as it is a tedious and time-consuming

4The picture is taken from the ImageNet 2012 dog class5Nvidia created a parallel computing platform and programming model called CUDA (Compute Uni-

fied Device Architecture http://www.nvidia.com/object/cuda_home_new.html) meant to be a generalpurpose architecture not limited to computer graphics.

Chapter 1. Introduction 4

task. Therefore, in scenarios with limited amount of training data, hand-crafted featureextractors may provide better results since CNNs will likely fail to extract good features.

As its main contribution, this thesis is intended to help researchers and practitioners todecide, which approach best suits certain use cases. Therefore, the two aforementionedapproaches (BoVW and CNNs) are compared with respect to the classification accuracyobtained for varying amounts of training data. Different training set sizes are evaluatedin order to identify the threshold where CNNs outperform BoVW approaches. Moreover,deeper analysis that leads to the results will be presented by means of visualization.

1.2 Related Work

At the beginning of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)6,BoVW based approaches had been the most commonly used methods (Figure 1.3). Theirvariations, such as Local Coordinate Coding [14] and Compressed Fisher Vectors [15]managed to win the competition for two consecutive years (i.e., 2010 and 2011 respec-tively). However, starting from 2012 onwards, since Krizhevsky et al., [10] proposed aCNN based approach that outperformed other competitors by a large margin, almost allof the participants have adopted CNN-based approaches. The latest result was reportedby the google team in the ILSVRC 2014 [16], where they reached a top-5 error rate7 of7%. Impressively, it is only a 2% difference from human-level performance (i.e., 5.1% asreported in [17]).

Hence, the success of CNNs was partly attributed to the fact that the ILSVRC authorswere able to provide massive amounts of training data (1.2 million images were manuallyassigned to more than 1000 categories) assembled in a huge crowd-sourcing effort [17].This amount of training data could attenuate the problem of overfitting occurred on adeep CNN model (e.g., the CNN model proposed by [10] containing more than 60 millionparameters). However, this gives rise to the question of how to handle classification tasksin which these huge amounts of training data is not available and where it would be toocostly and time-consuming to provide these datasets. Furthermore, the authors in [10]stated that training the proposed CNN model took between 5 to 6 days on two GTX 5803GB GPUs, hardware resources few researchers have access to, even today. Therefore,many research works have been focusing on alleviating both issues – the necessity forlarge scale training sets as well as highly optimized hardware resources.

One method has been proposed by Levi and Hassner [18], by applying a lean CNNarchitecture (i.e., fewer model parameters) on a limited number of data. They trainedtheir model on fine-grained face datasets (i.e., based on age and gender), in which theirCNN model managed to outperform the state-of-the-art hand-crafted features. Theysuggest of using less layers and fewer parameters to reduce the risk of overfitting in aCNN model in situations where a limited amount of data is available.

However, the most promising idea comes from CNNs pre-trained on a large dataset(e.g., the ImageNet dataset). There are currently two approaches of using the pre-trained models for classifying a new dataset. One approach uses the penultimate layerof pre-trained CNN models as a powerful image descriptor and then applies machine

6ILSVRC 2012 – http://www.image-net.org/challenges/LSVRC/2012/7The top-5 error rate is an error fraction for which the correct label is not among of the top 5 guesses

predicted by a classification model [10].

Chapter 1. Introduction 5

Figure 1.3: The winners of ILSVRC from 2010-2014, with each team name beingdisplayed. Blue bar denotes the BoVW-based models that won the competition in 2010

and 2011, until CNN (red bar) started to dominate from 2012 onwards.

learning (e.g. Support Vector Machine (SVM)) to train new target models (e.g., [19, 20]).Another approach is to fine-tune the pre-trained CNN model on the new target outputs[12, 21, 22]. The latter approach has been reported to give a better accuracy thandirectly applying SVM to the penultimate layer.

Wei et al., [22] reported the superiority of pre-trained CNN models over BoVW basedon Fisher Vector encodings in a multi label classification scenario conducted on thePASCAL VOC-2007 dataset [23]. Similarly, Chatfield et al., [12] concluded that pre-trained CNN outperformed Fisher Vectors (FV) models on the VOC and Caltech [24]datasets. However, both datasets – VOC and Caltech – contain highly overlappingobject categories and similar visual characteristics to the ImageNet dataset, where thedataset is used to pre-train the CNN model. This assumption might explain the successof the achieved results.

Moreover, using CNNs without any pre-training on a small dataset, such as the Caltechdataset, exhibited classification performance worse than a BoVW-based approach asreported in [20] and [21], which reaffirms the necessity of large training data for traininga CNN model. This also shows that BoVW based approaches could be an alternativeapproach for scenarios with limited amounts of training data.

In addition, many research works have been done in relation to the impact of increasingtraining data on different classification algorithms. Brill [25] suggests the necessityof large training data in order to improve the classification performance of models. Heevaluated different models on a natural language disambiguation task, where the simplermodel (i.e., a memory-based model) was performing better with fewer training data,until it was eventually outperformed by the more complex models (i.e., Naive-bayes,Winnow) as the number of training data increased. Similarly, Zeng et al. [26] showedthat Clustering Based Classification (CBC) could perform better than a discriminative

Chapter 1. Introduction 6

algorithm (i.e., SVM) in limited labeled data for a semi-supervised text classificationscenario. Although both works reported the results in the field of Natural LanguageProcessing (NLP), the same assumption may also be applied to in image classification.

1.3 Motivation

Despite the aforementioned superiority of CNNs over BoVW, a question of which algo-rithms to choose on various dataset types and sizes is still significant; especially whenthe dataset has different visual characteristics from the ImageNet dataset, where usingpre-trained CNN models might not be worthwhile.

Therefore, in this thesis, the impact of varying training set sizes on the classificationperformance of BoVW and CNNs will be evaluated. Both models CNNs and BoVWwill be trained on increasing training data in order to find a decision threshold, at whichCNNs will overtake BoVW. This threshold point could give users a rough estimation,which model to favor based on different dataset sizes. Furthermore, the CNN model willbe trained without first being pre-trained to avoid biases induced by similar datasets.

Moreover, a comparison of pre-trained CNN models with the BoVW on the Wikipaintingdataset 8 used by Karayev et al. [27] will be investigated. The painting images inthe Wikipainting dataset may have different visual characteristics from the classes inthe ImageNet dataset, which is typically used to pre-train a CNN model. This thesishypothesizes that in a dataset where CNNs require to learn new features other thanthe features in a pre-trained dataset, BoVW with pre-engineering features can be acompetitive candidate. This work differs from the implementations in [27], since adifferent hand-crafted feature approach (i.e., BoVW with FV encodings) and a CNNmodel training (i.e., with fine-tuning instead of using the penultimate layer) will beemployed.

The hypothesis is that for either scenario, BoVW will perform better than CNNs onsmaller training sets, since a CNN model needs first to learn the respective featuredescriptors, whereas BoVW uses pre-engineered features. Moreover, a deeper analysisby a means of visualization will be employed. By investigating both approaches ona wide variety of settings and datasets, the reported results can give researchers andpractitioners insights of which approach is best suited for different scenarios.

1.4 Thesis Structure

This thesis is structured as follows: Chapter 2 presents the traditional image classifi-cation approach that employs the Bag of Visual Words method. Chapter 3 discussesthe transition from hand-crafted features to learned features, including the theoreticalfoundations of CNNs. Chapter 4 explains the datasets and the training setups used forconducting this thesis experiments. Chapter 5 shows the results and evaluations of bothapproaches, including plots and graphs of the reported results. Finally, Chapter 6 givesthe conclusion of the entire work, followed by the possibilities of future work.

8WikiArt.org is an online visual art encyclopedia, containing a collection of paintings from differenteras

Chapter 2

Traditional Image Classification

The success of hand crafted features is ascribed to Bag-of-Visual-Words (BoVW) thathad been proven as the state-of-the-art for image classification in the past [28]. Bagof word models have firstly employed in the text classification domain, by representinga text or article as a collection of distinct words. The same idea is adapted to imageclassification by exploiting low-level features, such as color, texture or local gradientsto build visual word representations [9]. Thus, in relation to image classification, anobject can be represented by the occurences of its unique visual words. For example,a building may have a visual words with strong horizontal and vertical gradients asopposed to scenery-type images, where homogenous areas with same color are moreprominent.

In general, there are four main steps of building a BoVWs model: i) building localdescriptors, ii) quantizing the descriptors into visual vocabularies iii) representing imageswith global encoding iv) model training for the visual object recognition (Figure 2.1).

2.1 Building Local Descriptor

A local descriptor is information that best represents local regions in an image, oneexamples is the local image properties, such as color histogram, pixel intensities and

Figure 2.1: A BoVW pipeline adapted from Farquhar et al. diagram [1].

7

Chapter 2. Traditional Image Classification 8

Figure 2.2: SIFT computes gradients around a keypoint and each gradient is weightedby a gaussian function (i.e., pixels located near the center are weighted more than pixelsaround the edges). Subsequently, The local neighbourhood gradients are combined intoan 8 orientations gradient that is stored into a local bin. This example shows 8 × 8neighbourhood gradients and 2× 2 keypoint descriptors, whereas in practice SIFT uses16 × 16 neighbourhood gradients and 4 x 4 keypoint descriptors. The Figure is taken

from the original paper of the SIFT descriptor [7]

local gradients. A good local descriptor should be invariant to many image transfor-mations (e.g., rotation, scale and translation). A paper by Mikolajczyk and Schmid[29] presents that the SIFT local descriptor performs best when compared to 10 otherdescriptors. Furthermore, it has been widely used in many vision applications, such asobject recognition, 3D modelling and gesture recognition.

A SIFT descriptor is a 3-D spatial histogram of the local image gradients used in charac-terizing the appearance of a keypoint – keypoints are salient image patches that containrich local information of an image. The gradient at each pixel is regarded as a sample ofa three-dimensional feature vector, formed by the x and y pixel locations and gradientorientations [7]. The gradients are weighted by a gaussian function, with pixels locatednear the center being weighted more than pixels around the edges. SIFT aggregates16 × 16 neighbourhood gradients around a keypoint into 4 × 4 keypoint descriptors,composed of 8 orientation gradients per bin (Figure 2.2). Thus, each keypoint locationwill be represented by a 4× 4× 8 = 128 dimension vector. Additionaly, neighbourhoodgradients are rotated relative to the strongest gradient in order to be invariant to rota-tion. SIFT descriptors are extracted from different image scale in order to obtain scaleinvariance.

In the SIFT original paper [7], Lowe proposes Difference of Gaussian (DOG) as a SIFTkeypoint detector. However, in this thesis, SIFT descriptors were sampled from over-lapping dense regions, which according to the authors in [30] is best suited for imageclassification scenarios.

2.2 Building Visual Vocabulary

An image can be represented by sets of keypoint descriptors (e.g., SIFT), however thenumbers of descriptors could vary per image. This creates difficulties for a learningalgorithm (e.g., SVM) that requires fixed feature dimension as input data. Therefore,a global distribution (i.e., visual vocabulary) is constructed from local descriptors of all

Chapter 2. Traditional Image Classification 9

images. This visual vocabulary is employed to represent each image with a fixed sizevector of visual word occurences functioned as input to a learning algorithm.

The BoVW model uses K-means or Gaussian Mixture Model (GMM) to approximatethe distribution of local features with fixed size components or clusters, the latter beingpreferable, since it offers better assignment (soft assignment) to each descriptor withrespect to the entire distribution. Each component in GMM or K-means is analogousto a word in the Bag of Words model for text document analysis.

These clustering techniques are conducted in an unsupervised setting with the priorassumption that a number of clusters are to be decided in advance. All vectors of localdescriptors from entire images in training set are stacked together, forming a matrixof descriptors, the column size representing the dimension of a local descriptor (e.g.,128D for SIFT descriptors) and the row size representing the total of local descriptors.Consequently, an approximation technique such as ML (Maximum Likelihood) is carriedout to approximate cluster parameters that best fit the matrix. In the case of a largedataset, local descriptors are randomly sampled in order to reduce the size of the matrixand speed up its estimation.

2.3 Global Encoding

Mid-level descriptors are extracted by assigning image local descriptors to clusters withcertain probabilities (i.e., soft assignment). Hence, this approach creates a histogramof vocabulary for a given image. Intuitively, each dimension in the histogram encodesoccurrences of similar local descriptors in an image, which is similar to TF (Term Fre-quency) in text classification. In the extended version of BoVW, instead of having ahistogram of local descriptor (zero order statistic), the mid-level descriptors are repre-sented as the gradients of mean and standard deviation - first and second order statisticrespectively - of its local descriptors with respect to each component in the GMM dis-tribution. It generates a higher dimensional vector called FV. The advantage of the FVapproach is that it has more compact vocabularies than the traditional BoVWs. Thus,with the same size vocabulary FV can encode more image information than a histogramof visual word occurences. However, there is a drawbacks, as a bigger memory footprintis required due to higher dimensional vector [31]. From this point onward, the term FVwill be used to address BoVW models combined with fisher vector encoding.

2.4 Classification

The last step is to associate the extracted mid-level features to higher level concepts(e.g., cars, cats, dogs, etc) [9]. Classification is the task of learning a function thatcan map input data to discrete output data. An input can be represented by a fixeddimension feature vector. For this reason, FV descriptors can be considered as featurevectors for training a classification model.

A supervised setting is a typical scheme to train a classification model. Normally, twodatasets are provided in this scheme, a training and testing set, with each datasetcontaining pairs of inputs and outputs. A model will learn an approximation from thetraining set, which is then used therefore to predict unseen data in the testing set. In

Chapter 2. Traditional Image Classification 10

Figure 2.3: A support vector machine algorithm that separates two classes by maxi-mizing margin between support vectors from different classes.

the next subsection, a discriminative model is introduced that has been widely employedtogether with FV [12, 31, 32].

2.4.1 Linear Support Vector Machine

Support Vector Machines (SVMs) are linear model classifiers, that separate differentclasses by maximizing a linear decision surface [33]. SVMs use the principle of struc-tural risk minimization that minimizes the probability of misclassifying unseen samples(i.e., test data), from an unknown probability distribution in training data. Thus, theprobability of misclassification in unseen samples is minimized when the decision valuemargin between two classes in the training data is maximized. This principle helps toalleviate the problem of overfitting, when the training data is small.

Assuming a binary classification scenario, yi ∈ {−1, 1} and xi ∈ Rd with a training set(xi, yi) for i = 1...N . Margin can be regarded as the distance between the hyperplanewTx+ b = 0 and the outer margins in a positive class wTx+ b = 1 and a negative classwTx+ b = −1 (Figure 2.3). Maximizing margin 2

||w|| is equal to minimizing the weight

||w||, learning the SVM therefore can be formulated as an optimization:

minw

1

2||w||2

s.t. yi(wTxi + b) ≥ 1, i = 1, 2, ..., N

(2.1)

In reality, however, samples are not usually linearly separable. Thus, the previousequation can be generalized by introducing N as a non negative slack variable ξi, whichpenalizes data points that violate the margin requirements. If ξ is sufficiently large, itcan satisfy every constraint. Adding a slack variable to equation 2.1, it is then writtenas;

Chapter 2. Traditional Image Classification 11

minw,ξi

1

2||w||2 + C

∑ξi

s.t. yi(wTxi + b) ≥ 1− ξi, i = 1, 2, ..., N

ξ ≥ 0.

(2.2)

C is a regularization parameter, large C indicates low tolerance to misclassified points(i.e. narrow margin), small C allows misclassified points to be easily ignored (i.e. largemargin). Hence, setting the parameter C is crucial to finding the right margin.

2.5 Performance Evaluation

In order to evaluate a learned model the evaluation test shall be carried out. The goalof evaluation test is to see how good the model classifies unseen data in a test set. Atest set in a classification scenario has the same set up as one used for training. A basicassumption is that a model trained on a training set will be able to predict correctlyunseen data in the test set. These two separate sets are needed, since a model maycorrectly classify a training set but fail to generalize on unseen data, which would meanthat the model has overfit the training data.

The simplest way to measure the model performance is to calculate the accuracy on theunseen data. Accuracy calculates the number of correctly predicted output divided bythe number of samples in the test data. Nevertheless, this approach has a shortcoming.Assuming a scenario of imbalanced datasets, a model can achieve a high accuracy rateby simply choosing a class with more samples. For example, if there are 10 samplesof class A and 90 samples of class B, the model could reach 90% accuracy by simplypredicting class B for every sample data.

actualvalue

Prediction outcome

p n

p′TruePositive(TP)

FalseNegative(FN)

n′FalsePositive(FP)

TrueNegative(TN)

Figure 2.4: A confusion matrix

A confusion matrix (Figure 2.4) visualizes the statistic of prediction results in a clas-sification scenario. It computes the differences between the actual outcome and thepredicted outcome. True Positive means that the model correctly identifies a positivesample as positive, whereas False Positive means that the model has mistakenly markedit as negative. The same definition is applied to True Negative and False Negative, but

Chapter 2. Traditional Image Classification 12

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

ision

Average Precision (AP)=0.76

Precision-Recall curve

Figure 2.5: The precision and recall curve, the accumulated area under the curve isthe average precision score

instead of positive outcomes, negative outcomes are evaluated. Furthermore, those met-rics can further be combined to generate better evaluation metrics (the following metricsare taken from [34]):

Precision (P) is the fraction of instances that are correctly classified.

Precision =TP

TP + FP(2.3)

Recall (R) is the probability of true values that are correctly classified

Recall =TP

TP + FN(2.4)

F1-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

F1 = 2 · Precision · Recall

Precision + Recall(2.5)

Precision and Recall are set-based measures used to evaluate independent samples only.In image classification, rank-based measures are preferable, since they gives a more com-plete overview, by measuring each individual sample with respect to the entire dataset.Average Precision (AP) is a type of a rank-based measure, which works by orderingconfidence values output by a classifier in descending order. Figure 2.5 is the exampleof precision and recall plotted for every rank position, where AP is the area under thecurve. It sums up precision values for every recall position in the ranked sequence.

Assuming the set of relevant images is {d1, ..., dm} and Rk is the set of ranked confidencevalue results from the top result until image dk, then:

Chapter 2. Traditional Image Classification 13

AP =1

m

m∑k=1

Precision(Rk) (2.6)

An individual AP score is used to measure how good a model identifies a class, in orderto evaluate the performance of the model with respect to the entire class C, a MeanAverage Precision (MAP) is calculated by averaging all AP scores.

MAP =1

C

C∑k=1

AP (k) (2.7)

Chapter 3

Convolutional Neural Networks

In Bag-of-Visual-Words (BoVW), an entire model training is split into smaller build-ing blocks (i.e., feature extraction, global encoding and classification). The approachhowever has a shortcoming, because the output of each building block is based on lossyinformation generated from the preceding step. For example, SIFT creates a histogramof local gradients from surrounding neighbourhood pixels around a keypoint; thus, someinformation is lost due to gradient quantization or the keypoint selection step. Moreover,a visual vocabulary is built upon these SIFT descriptors, during which more informationis removed by generating a pre-determined number of visual words. Thus, improvingone building block will not matter a lot to the overall BoVW performance due to limitedinformation from the preceding step.

Convolutional Neural Networks (CNNs), meanwhile, merge feature extraction and modeltraining into a single training step, from raw image pixels as initial input to a set oflabels as classification output. This architecture could retain important information,which accounts for training a classification model. Moreover, training this model is mucheasier, since no steps need to be done separately – such as building a visual vocabularyor extracting features in BoVW.

In this chapter, the theoretical foundations of CNNs will be presented. Section 3.1presents a fundamental Articial Neural Network (ANN) model, including a Multi LayerPerceptron (MLP) model, from which CNN models are derived. Section 3.2 describes theadvantages of convolutional operation in neural network over its predecessors. Section3.3 presents different components of CNNs that are used to form a full-fledged models.Section 3.4 explains in great detail of how a model learning in CNNs is carried out.

3.1 Artificial Neural Network

In this section, ANNs are firstly presented as the underlying principles which are neces-sary in order to understand CNNs. CNNs are inherently a variation of neural networks,therefore many of the concepts regarding CNNs are taken from ANNs (i.e., back propa-gation, forward propagation). Haykin has proposed a definition of neural networks seenas an adaptive machine, [35],

14

Chapter 3. Convolutional Neural Networks 15

Figure 3.1: A basic non-linear model of a neuron adapted from Haykin [35]

A neural network is a massively parallel distributed processor made up of simple process-ing units, which has a natural propensity for storing experiential knowledge and makingit available for use. It resembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through a learningprocess.

2. Interneuron connection strengths, known as synaptic weights, are used to store theacquired knowledge.

By using this definition, neural networks can be viewed as a distributed structureswhich can benefit from modern hardwares to run processes in parallel. Moreover, it hasthe ability to learn from its environment, which can be adapted to solve classificationproblems.

3.1.1 Models of a Neuron

Figure 3.1 is the simple structure of a neuron model. xj is an input signal to synapsej connected to neuron k and is multiplied by the synaptic weight wkj (interconnectinginput signals to target neurons), which are then activated by an activation functionh(·), which is used to limit the output signal to a certain finite value. The model alsoincludes a bias, denoted by bk. The bias has the effect of adjusting the net inputstoward the activation functions; thus, it helps to calculate more variations of values. Inmathematical terms, Figure 3.1 may be written as,

vk =m∑j=1

wkjxj + bk (3.1)

yk = h(vk) (3.2)

Chapter 3. Convolutional Neural Networks 16

Figure 3.2: A fully connected three-layer feed-forward neural network with i, j ando neurons in the input, hidden, and output layers respectively. In this model, every

neuron is connected with others in its neighborhood layers.

Where x1, x2, ..., xm are the input signals; wk1, wk2, ..., wkm are the synaptic weights ofneuron k, vk is the activation potential, a linear combination of input signals and thebias bk. h(·) is the activation function; and yk is the output signal of the neuron.

3.1.2 MultiLayer Perceptron

An Multi Layer Perceptron (MLP) is a type of multilayer feedforward network con-sisting of an input layer, one or more hidden layers and an output layer (Figure 3.2).Feedforward means that the input signal propagates through the network in a forwarddirection, on a layer-by-layer basis. This type of neural network is able to solve morecomplex tasks (e.g., restricted to linear calculations) that can not be solved by a singlelayer neural network. In general, an MLP has three distinctive characteristics [35]:

1. The model of each neuron in the network includes a nonlinear activation function.A commonly used form of nonlinearity is sigmoidal nonlinearity defined by thelogistic function

f(x) =1

1 + e−x(3.3)

2. The network contains one or more hidden layers, which are not the first or lastlayers, making an MLP model have a minimum of three connecting layers.

3. In the network, each neuron is connected to all neurons in the preceding and suc-ceeding layer. Thus, due to its high degree of connectivity, this model is sometimescalled a fully connected network.

3.1.3 Learning in Neural Network

Typically a neural network is trained in a supervised setting through an interactiveprocess of adjustments applied to its weights and biases. One of the simplest methods

Chapter 3. Convolutional Neural Networks 17

to train a neural network is to employ an iterative algorithm to minimize the loss functioncalled gradient descent. Assuming a weight vector w which minimizes the error functionE(w). The error function is minimal when the gradient is equal to zero (∇E(w) = 0).In order to achieve the minimal value, the gradient descent method makes a small stepin the direction of −∇E(w) and thereby further reduces the error.

w(τ+1) = w(τ) − ηN∑n=1

∇En(w(τ)) (3.4)

Where a τ label is the iteration step, the weight vector w will be updated in a successionof steps. The value of ∇E(w) is evaluated at the new weight vector w(τ+1), where theparameter η > 0 is known as the learning rate. After each such update, the gradient isre-evaluated for the new weight vector and the process is repeated. The error value isobtained by accumulating over the entire sample N before making a minimization step.This type of method is called Batch Gradient Descent. One may notice that for eachstep requires the algorithm to see the entire dataset, which can suffer in term of thelearning time if a dataset size is huge.

BackPropagation or error back-propagation learning is a method of calculating gradientsin chain-like models, by propagating gradients from the parent nodes (i.e., neurons) tothe child nodes. In general, Error back-propagation learning consists of two passes:a forward pass and a backward pass. In the forward pass, an input vector is appliedand propagates through the network layer by layer until it reaches the output nodes.During the forward pass the synaptic weights of the network are all fixed. The backwardpass, on the other hand, runs in the opposite way as the forward pass from output tofirst layer while adjusting synaptic weights in conjunction with the error measured bythe gradient. Below are the pseudo-codes of the forward and backward propagationsadapted from [36], with simplification and without regularization parameters.

Algorithm 1 Forward Propagation. Variable x is input signals of a neuron, it is yieldedeither from an input value x(0) or an activation function h(·) (e.g., sigmoid function).Forward propagation starts from the first layer k = 1, and propagates to the last layerm. Variable i denotes a ith neuron at (k− 1)th layer and j is the ith neuron at kth layer,thus wij is a weight that connects two neurons i and j and bj is a bias connected to anoutput neuron j. In fully connected neural networks, all neurons in the preceding layerare connected to every neuron in the current layer. The combination of all the inputsignals from the preceding layer k − 1 with respect to its synaptic weights produces anactivation potential v, which will be activated by the activation function. Lastly, thelast output is a vector of y that will contribute to the loss function L with respect tothe ground truth y.

1: procedure Forward Propagation2: x(0) = Input3: for k = 1...m do4: v

(k)j =

∑iw

(k)ij x

(k−1)i + bj

5: x(k)j = h(v

(k)j )

6: y = x(m)

7: E = L(y, y)

Chapter 3. Convolutional Neural Networks 18

Algorithm 2 Backward Propagation. At the beginning of the backward propagation,the gradient g is calculated from computing the derivation of loss function L with respectto y. Delta is an error value, corresponding to a single neuron. It is calculated bymultiplying the gradient from a succeeding layer to the partial derivate of the activationfunction with respect to its activation potential h′(v). The multiplication of delta valueand the output from a neuron in the bottom layer x(k−1) produces a weight gradient inconjunction with the error function i.e., ∇

w(k)ij

E and ∇b(k)j

E . The last step is to pass

the gradients to the layer below by calculating the weights and delta values. This step isrepeated until it reaches the neurons at the first layer. The gradients of the weights arethen added to the existing weights to yield a new weight, which will be used for anotherforward propagation.

1: procedure Backward Propagation2: g ← ∇yE = ∇yL(y, y)3: for k = m down to 1 do4: calculate the delta of the respected layer

5: g ← ∇v(k)j

E = g h′(v(k)j )

6: compute the gradient of weights and biases from deltas and activated neuronsfrom the layer above

7: ∇w

(k)ij

E = g x(k−1)i

8: ∇b(k)j

E = g

9: Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:

10: g ← ∇x(k−1)i

E = g∑

jW(k)ij

3.2 Motivation of Convolutional Neural Networks

Figure 3.3: Neural networks in 2D spatial domain. Figure a) is a fully connectednetwork (i.e., MLPs) and figure b) and c) are CNNs with their separate properties.a

Multi Layer Perceptrons (MLPs) were used in the past in many image classificationapplications, including face detection and hand-writing recognition. A neuron in 2Dspatial dimensions can be regarded as an image pixel. In a fully-connected network, apixel is connected to every other pixel in the preceding or succeeding layer by synapticweights (Figure 3.3a ). One can notice that even with small sized of images this modelrequires a large number of weight parameters due to its high connectivity; resulting ina large memory footprint to store the weight parameters and high time complexity tocompute many neuron connectivities. Moreover, according to LeCun et al. [37], MLPsare not invariant to shift or distortion of an object. In fact, the usual practice is to put

Chapter 3. Convolutional Neural Networks 19

a desired object to the center of an image before feeding it to the model; hence, such apractice is impractical.

CNNs are the variation of MLPs that are more robust to the aforementioned problems.Two properties are contributing to this statement; sparse connectivity and weight shar-ing. By means of sparse connectivity, a neuron reduces its connectivity to only a limitednumber of local neighborhood neurons (Figure 3.3b). This property reduces propaga-tion time and the size of the memory footprint due to the need for fewer parameters tobe computed and stored respectively. Thus, in contrast to traditional neural networkswhere a weight is unique to each neuron connectivity (i.e., a pair of an input and outputneuron), CNNs share the same weights across an input image. Figure 3.3b shows thatthe same weights are reused to connect to different output neurons – same line colorsdenote the same weight parameters.

These two properties act as a local receptive that detects local features in an image,such as edges, corners or blobs. Different local receptives are associated with differenttypes of features, for example, one local receptive may detect strong horizontal gradientsand others vertical or diagonal gradients. Hence, these low-level features can further becombined in the deeper layers to detect high-level features.

Therefore, given the above advantages – faster propagation time, smaller memory foot-print and an ability to learn high-level features – CNNs have become a more desirablechoice to build a neural network model for image classification purposes.

3.2.1 Convolution and Cross-Correlation

Based on its name CNNs employ a mathematical operation called convolution. Convo-lution is a mathematical operator that combines two functions f and g and produces athird function that in a sense represents the amount of overlap between f and a reversedtranslated version of g.

(f ∗ g)(t)def=

∫ ∞−∞

f(τ)g(t− τ)dτ (3.5)

convolution is denoted by an asterix ∗ symbol, equation (3.5) shows the convolutionof function f and g in domain t (e.g., in signal processing t can be represented as timedomain). The convolution formula can be described as a weighted average of the functionf(τ) at the moment t, where the weighting is given by g(-τ), simply shifted by amountt. Thus, in a 2 dimensional spatial domain the discrete form of convolution formula isused.

(f ∗ g)(i, j)def=∑m

∑n

f [m,n]g[i−m, j − n] (3.6)

Equation (3.6) shows a discrete convolution formula in a two dimensional domain, vari-able i and j represent pixel locations in an image, and the g [i−m, j − n] is the flippedmatrix that convolves function f.

In convolutional network terminology, the first argument f in this equation (3.6) isreferred to as the input and the second argument g – which often has smaller size than

Chapter 3. Convolutional Neural Networks 20

Figure 3.4: A visualization of a convolutional layer in CNNs. The input is 2 channelfeature maps of 5× 5, which is convolved by 3× 3 kernels into two output feature mapsof 3× 3. Colors represent the different weight values, the same color is applied to inputsources associated with a particular feature map. Note, the kernels are not flipped for

simplicity1.

f – is the kernel. This operation produces an output referred to as the feature map.This formula becomes the core operation in CNNs.

The opposite form of convolution is cross-correlation symbolized by a star ? symbol(equation 3.7). It is similar to convolution operators unless the kernel is not flippedg [i+m, j +m]. This operator is introduced in order to calculate backpropagation inCNNs in the following section.

(f ? g)(i, j)def=∑m

∑n

f [m,n]g[i+m, j + n] (3.7)

3.3 Types of Layer in Convolution Neural Networks

Even though the name convolutional neural networks are attributed to the fact thatthey uses mainly convolutional layers in their model construction, CNNs in complexterminology combine many different types of layers that contribute to creating a morerobust model. Complement layers incorporated in a full-fledged CNN model are poolinglayers, normalization layers, and fully connected layers.

1Adapted from a similar figure in http://cs231n.github.io/convolutional-networks/

Chapter 3. Convolutional Neural Networks 21

Figure 3.5: A ReLU function removes values lower than zero. This activation functionenables CNNs to converge faster than a sigmoid function in model training.

3.3.1 Convolutional Layer

A convolutional layer employs convolutional operators in 2D neural networks, by con-volving a set of kernels with input feature maps in the preceding layer, resulting outputfeature maps in the succeeding layer (Figure 3.4). Hence, kernels in this network operateas synaptic weights that connect two layers together.

v(`)j =

∑i

(x(`−1)i ∗W (`)

ij ) + bij

x(`) = h(v(`))

(3.8)

Equation (3.8) calculates a feature map with its respective input in convolutional layers.i and j denote the index of a feature map or neuron at `−1th and `th layer respectively. Afeature map is calculated after summing all of the convolution combinations from featuremaps in the preceding layer x(`−1) with its synaptic weights or kernels W combined withits added bias b. In practice, weights are typically represented as 4-D tensors containingelements of the destination feature map, input, and 2-D size of the kernel. Note, thepixel locations of the source input are not stored, since the same kernel is shared acrossdifferent locations in a feature map (e.g., weight sharing property). The bias b can berepresented as a vector containing one element for every output source.

As typical neural network models, the feature map is activated by an activation func-tion h(·). Krizhevsky et al. [10] have applied a Rectified Linear Unit (ReLU) functionf(x) = max(0, x) (Figure 3.5) in convolutional layer instead of a sigmoid function com-monly used in MLPs. They have put forward that a CNN model benefits from fasterconvergence using a ReLU function. This approach enables learning with deeper convo-lutional networks to be feasible, which was not the case in the past.

3.3.2 Pooling Layer

A pooling function replaces the output of the net at a certain location with a summarystatistic of the nearby outputs [36]. For example, average pooling which calculate theaverage values of the defined region on the input, or max pooling operation which reportsthe maximum values within the rectangular neighbourhood. There are other types ofpooling operations; Stochastic pooling, randomly picking the activation within eachpooling region according to a multinomial distribution [38] and pooling based on L2norms of a rectangular neighbourhood.

Chapter 3. Convolutional Neural Networks 22

x(`) = max(x(`−1), u(n, n)) (3.9)

Equation 3.9 is a max pooling operation, which applies a sliding window function u(x,y)with n x n size - that computes the maximum value in the neighbourhood area - to aninput feature map x(`−1). The result of a max pooling operation is an output featuremap x(`) with lower resolution. For example, a N x N feature map pooled with a n x nmax pooling window with 1 pixel stride yields an N − n+ 1 x N − n+ 1 output featuremap.

The pooling layers helps convolutional network to be invariant to noises or small trans-lations on the input. Furthermore, pooling layers improve training complexity by down-sampling them to lower resolutions and selecting only the most relevant informationfrom the output units. Therefore, it reduces the number of parameters to be learnedand the size of memory footprint.

3.3.3 Local Response Normalization

Krizhevsky et al. [10] have shown that adding a normalization function on top of aconvolutional layer give a slight (1-2 %) improvement.

a = x(`−1), b = x(`)

bpx,y = apx,y/

k + α

min(N−1,p+n/2)∑q=max(0,p−n/2)

(aqx,y)2

β(3.10)

For convenience, a and b are introduced as input and output layers respectively. N is thetotal number of feature maps in the layer, and n is the size of neighbouring feature maps.Local Response Normalization (LRN) works by dividing pixel value at the pth featuremap in (x, y) position with summation of adjacent pixel values from the neighbouringfeature maps q. The constants k, n, α and β are hyper-parameters. In their paper, theyset their values to k = 2 , n = 5, α = 10−4 and β = 0.75.

3.3.4 Dropout

This sort of layer differs from other layers, as it does not require any function to carryout value transformation. The key idea of a dropout layer is to randomly drop unitson the hidden layers, which helps to reduce overfitting of data. Srivastava et al. [39]introduces a novel technique that is computationally more efficient in comparison toother regularization methods such as L1/L2 regularization (e.g., weight decay) or dataaugmentations. Nevertheless, all of these regularization techniques are usually com-bined together in practice. Intuitively, this approximation generalizes the model tolearn important features that are invariant to changes (Figure 3.6). Moreover, the costof applying this layer is almost negligible and computationally cheap (in practice it isapplied by setting unit’s weight to zero).

Chapter 3. Convolutional Neural Networks 23

Figure 3.6: Description of a dropout technique, (a) is a standard multi-layer percep-tron and (b) is a network after randomly dropping units (picture is taken from [39]).

3.3.5 Softmax Loss

As presented in Section 3.1, a neural network model is trained by minimizing a loss func-tion through an iterative algorithm. The most common loss function used for trainingCNN models in multi-class setting is a cross entropy loss with the a softmax function.

E(y, y; θ) = −∑j

yj log p(cj |y) (3.11)

p(ck = 1|y) =exk∑Cj=1 e

xj(3.12)

Equation 3.11 is the negative log-likelihood of cross entropy between two distributions2; the true distribution y (i.e., ground truth) and the approximating distribution p(c|y),which is taken from the softmax function (i.e., Equation 3.12), with C being the numberof classes in multi-class classification and y being the output values of a CNN model.One can notice that in multi-class scenarios the ground truth is a vector sum total to one∑

j yj = 1, with the true class equal to one and the rest are zero, which represents somesort of probability distribution. The goal of a CNN model is to minimize this functionusing backpropagation. The detailed explanation of calculating gradients for each layerwill be discussed in the following section.

3.4 Backpropagation in Convolutional Neural Networks

Similar to MLPs, CNNs employ backpropagation to update its parameters. The stepby step procedure of backpropagation in CNNs can be seen from the pseudo-codes inthe neural network section (i.e., Algorithm 1 and 2). Thus, in this section we presentthe backpropagation formulas with respect to the different layers shown in the previoussection.

2equations are taken from Deep Learning tutorial slide@CVPR – 23 June 2014 by Marc’AurelioRanzato

Chapter 3. Convolutional Neural Networks 24

∂E

∂y= p(c|y)− y (3.13)

As a matter of course, backpropagation starts by deriving a loss function. Equation 3.13is the partial derivative of the softmax loss function (i.e. Equation 3.11) with respectto its input x. This equation calculate the gradients for different classes, which arepropagated down to the layer below.

3.4.1 Gradients in Convolution Layer

This section’s formulas are mainly taken from [40] with slight modification (e.g., cross-correlation is applied instead of convolution).

δ(`)j =

∂E

∂v(`)j

= h′(v`) ◦ ∂E

∂x(`)j

(3.14)

Equation 3.14 presents element wise multiplications (i.e. a hadamard product ◦) between∂E

∂x`jthe gradient from the layer above and the derivation of activation function h

′(.)

(the derivation of a ReLU is another ReLU) with respect to its activation potential v(`),

resulting a delta matrix for a single unit δ(`)j .

∂E

∂W(`)ij

= flip(δ(`)j ? x

(`−1)i ) (3.15)

kernel gradients are equal to cross-correlation (Equation 3.7) its delta matrix with theinput feature map from the layer below x`−1i (Equation 3.15). The result matrix isflipped along its x and y axis to achieve the position back conforming to the initialkernel position.

∂E

∂b(`)ij

=∑x,y

δ`j (3.16)

Biases are calculated simply by aggregating respected delta from a output layers δ`j onthe entire x and y axis (each output layer has one bias value).

∂E

∂x(`−1)i

=∑j

δ(`)j ? W

(`)ij (3.17)

Equation 3.17 passes gradients to the layer below by summing all the cross-correlations

of all delta matrices on `th layer with its respected kernels W(`)ij .

Chapter 3. Convolutional Neural Networks 25

3.4.2 Gradients in Max Pooling

In backpropagation, the max pooling layers simply propagate gradients from the lay-ers above to the locations of maximum values taken from sliding window in forwardpropagation. This approach, will yield sparseness in the gradient distribution of thelayer below, which also means only regions that have high activation will learn from thetraining feedback.

3.4.3 Gradients in Local Response Normalization

Equation 3.18 is the partial derivative of equation 3.10. Gradients in LRN layers areobtained by inserting this formula and hyper-parameters.

C = k + α

min(N−1,p+n/2)∑q=max(0,p−n/2)

(aqx,y)2, a = x(`−1), b = x(`)

∇apx,y = ∇bpx,yC−β − 2βC−β−1α

min(N−1,p+n/2)∑q=max(0,p−n/2)

[aqx,y∇bqx,y

]apx,y

(3.18)

A new variable C is introduced to simplify the equation. Variable ∇bq is equal to∂E

∂x(`)j

,

which is the gradient from the layer above. ∇ap is the calculated gradient matrix passed

to the layer below∂E

∂x(`−1)j

.

3.4.4 Backpropagation Trick

As noted earlier, the batch gradient descent is not scalable when a dataset is grow-ing, because of the number of samples it needs to see before making a descending step.Therefore people usually use the approximation of gradient descents, namely StochasticGradient Descent (SGD). SGD or online learning is the method to estimate gradientson the basis of a randomly picked sample. The advantage of SGD over the gradientdescent algorithm is that SGD can make a single step only by seeing one random sam-ple(Equation 3.19). Thus, similar to the gradient descent, this algorithm will slowlyconverge despite of noise occurrences due to random sampling.

w(τ+1) = w(τ) − η∇E(w(τ)) (3.19)

Another variation of SGD that is more commonly used in the large scale learning ismini-batch SGD. As opposed to Batch or online learning gradient descent, mini-batchSGDupdate its parameters with respect to numbers of random training samples. Thereare two benefits of mini-batch SGD; Firstly, it can harness modern computers by utilizingparallelization to accelerate learning; Secondly, it can suppress noises and variances by

Chapter 3. Convolutional Neural Networks 26

taking more samples in order to better represent the distribution of training data, whichcan lead to a more stable convergence [41].

w(τ+1) = w(τ) − η 1

K

K∑k=1

∇Ek(w(τ)) (3.20)

K is the subset of an entire training sample (i.e., N in equation 3.4). There is no exactvalue for the number of batches per iteration, Krizhevsky et al. [10] for instance used256 samples per batch for training CNN on the entire ImageNet dataset. The gradientis averaged with the total number per batch.

Momentum is the technique to accelerate gradient descents, by adding velocity from theprevious parameter updates. [42]

V (τ+1) = αV (τ) − η∇E(w(τ))

w(τ+1) = w(τ) − V (τ+1)(3.21)

Momentum works by adding V (τ) - updated weight from previous iterations - to thecurrent weight gradients ∇E(w(τ)) and α is the hyperparameter (e.g., 0.9). This methodhelps to accelerate steps while descending into deep minima and also pushes the gradientson the stationary points to keep moving. Stationary points are the point where thegradient vanishes (the error rate is equal to zero).

w(τ+1) = w(τ) − η∇E(w(τ))− ληw(τ) (3.22)

Another trick which is also used in practices is Weight Decay (Equation 3.22). Thistrick works as a L2 regularization that improves generalization by quenching extremeweights. It prevents weights from growing too much unless it is necessary [43]. λ is theregularization parameter that balances weight sizes accordingly.

3.5 Example Model

Figure 3.7: An example of a convolutional neural network on a 3-channel RGB image,consisting of 2 convolutional layers, 2 max pooling layers, and 2 fully connected layers.

Note, the architecture uses similar layers as the LeNet-5 architecture [37].

Chapter 3. Convolutional Neural Networks 27

Figure 3.7 is the example of a CNN model that takes a 32× 32 3-channel RGB image asan input and an output of 2 probability values. The CNN model contains 2 convolutionallayers (C1, C2) each being followed by a max-pooling layer (P1, P2), which are thenconnected to fully-connected layers (FC1, FC2). Hence, for each convolutional layer aReLU activation function is applied.

The first convolutional layer (C1) convolves the 32× 32× 3 input image with 4 kernelsof size 5×5×3 kernels with a step size of 1 pixel, resulting 4 feature maps of size 28×28(i.e., 32− 5 + 1 = 28). The first pooling layer (P1) applies a max-pooling function witha 2× 2 window size, reducing the size of each feature map into 14× 14.

The second convolutional layer (C2) takes as input the output of P1 with 6 kernels ofsize 5× 5× 4 with 1 pixel step-size, generating 6 feature maps of size 10× 10× 6 (i.e.,14− 5 + 1 = 10). The same pooling function is carried out on C2, resulting in 5× 5× 6feature maps (P2). All feature maps of P2 are fully connected to the 5 output neurons(FC1) with 5 kernels of size 5 × 5. Finally, FC1 is connected to the last layer (FC2)that contains 2 output neurons. Typically, the number of neurons at the last layer isset according to the number of predicted classes (e.g., 2 classes in this model). Theoutput values from the last layer will be computed using a softmax function to outputprobability values.

Chapter 4

Experimental Setup

In this chapter, the performance of both approaches (i.e., BoVW and CNNs) described inChapters 2 and 3 with selected datasets are to be investigated. Moreover, as mentionedin Chapter 2, for the sake of brevity, BoVW with FV encodings will be addressed to assimply FV.

Two scenarios was implemented in this thesis. The first scenario was to find a deci-sion threshold, where CNNs would overtake FV. Therefore a dataset with incrementingamount of samples was required. For this reason, the ImageNet dataset was selected,since it contains large enough images to train CNN models.

The second scenario CNNs and FV were tested on non-natural images (i.e., the Wikipaint-ing dataset), where it has different visual characteristics from photo-like images in theImageNet. The purpose of this was twofold: firstly, to analyze how CNNs and FV per-form in non-natural images and to see whether FV can be an alternative in this type ofdataset.

This chapter is structured as follows, Sections 4.1 and 4.2 present the datasets preparedin this experiment. Subsequently, Section 4.3 explains how image classification was con-ducted on those datasets, with the implementation details of the respective approachesin its subsections (i.e., 4.3.2 and 4.3.3).

4.1 ImageNet Dataset

The ImageNet dataset is an image database maintained by groups of researchers mainlyfrom Stanford University and Princeton University, whose objective is to provide re-searchers with an easy-to-access image database. It contains 1,2 million photo imagesin over 1000 different categories, making it the largest annotated image database on theinternet to date.

In preparing this experiment, the ImageNet 2012 training set was split into differentincrementing numbers of subsets. The selection of the subsets was taken from the 10classes of the 5 easiest and 5 most difficult classes to train, considering the mean errorfrom the top 5 predictions from all submissions to the ImageNet 20121 (Table 4.3).

1See http://image-net.org/challenges/LSVRC/2012/ilsvrc2012.pdf for more information.

28

Chapter 4. Experimental Setup 29

No. Classes 5% 10% 20% 40% 60% 80% 100%

10 Classes 622 1,242 2,485 4,969 7,455 9,939 12,424

100 Classes 6404 12,808 25,615 51,230 76,846 102,461 128,076

200 Classes 12,866 25,728 51,456 102,911 154,368 205,823 257,279

Table 4.1: Training sets generated by randomly sub-selecting images from the 2012ILSVRC training set according to the denoted ratio.

No. Classes 20% 40% 60% 80% 100%

22 Classes 4,400 8,800 13,200 17,600 22,000

Table 4.2: Training sets generated by randomly sub-selecting images from the October2013 wikipainting website according to the denoted ratio.

These classes were selected in order to investigate the classification performance of bothCNNs and FV on different class characteristics (i.e., the 5 easier and the 5 harder classes).The assumption was that both approaches could easily predict the easier classes usingfew training data, whereas in harder classes both model would have difficulty capturingdistinctive features to separate the classes.

As one can see from table 4.3, the 5 easier categories resemble scenery-type images,where backgrounds and foregrounds are important to determine their category. Forexample, images in yellow lady’s slipper share same patterns, such as a yellow flowercentered on green leaves background, or odometer images, which contain circular pat-terns with written numbers and dark backgrounds. Meanwhile, classes in the harderclasses are analogous to a object detection problem, where an object is distinct fromits backgrounds. For instance, a spatula can be held by someone or laid on the table.Therefore, investigating both approaches on classes with different visual characteristicswould give interesting insights.

In order to build an incrementing dataset, for each class 7 sets were created. Each setcontained different amounts of images based on different ratio splits: 5, 10, 20, 40, 60, 80and 100 percent of randomly picked samples from the ImageNet training set (Table 4.1– the 10 classes).

In the beginning, the experiment was performed only on the 10-class scenario. However,the number of samples was still considered still too small for CNNs to outperform BoVW.Therefore, in order to improve CNN model accuracies, numbers of samples were increasedby adding 90 and 190 random classes in addition to the existing 10 classes. This stepcreated subsets of 100 and 200 classes with the different split percentages (Table 4.1– the 100 classes and 200 classes). In total, there were 7 x 3 = 21 subsets from theImageNet dataset used to train both algorithms.

Additionally, a test set was created for performance evaluations. The test set was takenfrom the 10 classes of the ImageNet validation set. This set contained 50 images eachclass, making it in total 500 images for the entire test set.

4.1.1 Imbalanced Dataset with CNN

This experiment delved deeper into the CNN algorithm, apart from comparing thedifferent image classification algorithms, by investigating CNN models on an imbalanced

Chapter 4. Experimental Setup 30

Easiest Mean Error Graphics

Geyser 0.001

Odometer 0.011

Canoe 0.013

Yellow lady’sslipper

0.015

Website 0.015

Most Difficult Mean Error Graphics

Ladle 0.877

Hatchet 0.857

Spatula 0.833

Muzzle 0.832

Hook andClaw

0.805

Table 4.3: Easiest and hardest categories to classify based on evaluation of the meanerror of the top 5 predictions from all submissions to the 2012 ILSVRC.

Chapter 4. Experimental Setup 31

No. Classes Negative Samples Negative Classes Ratios toA Positive Class

20 Classes 12,836 10 classes 1 to 10

100 Classes 109,152 90 classes 1 to 90

200 Classes 244,855 190 classes 1 to 200

300 Classes 310,434 290 classes 1 to 260

Table 4.4: Imbalanced training sets generated by randomly sub-selecting negativeclasses in addition to the 10 positive classes. First column is the total number of classes(negative and positive classes). The rest of columns contain information regarding thetotal of negative samples (i.e., 11th class), including the approximate ratios to a single

class from the 10 classes (around 1200 images).

dataset scenario. In the section before, negative samples were increased by adding moreclasses in order to improve model accuracy. However, in practice, most datasets donot have as many classes as the ImageNet dataset (1000 classes). Therefore, instead ofgenerating more classes, which requires a lot of efforts for separating different images,an idea was to have all negative images pooled into one single class.

In this setting, the experiment was conducted using 11 classes, consisting of positiveclasses taken from the 100% of the 10-class training set (Table 4.1) and the 11th classthat contained all negative samples. The 11th class or the negative class was generatedby randomly selecting some of the 990 classes – 1000 ImageNet classes without the 10positive classes. The selected negative classes were pooled into a single class (i.e., the11th class).

Hence, 4 training sets of incrementing imbalanced data were created: 20-class, 100-class,200-class and 300-class training sets (Table 4.4). Each training set consists of 10 positiveclasses and several negative classes pooled into the 11th class. Moreover, the set namesdenotes the total number of classes in a training set, for example, the 100-class trainingset contained 10 positive classes and 90 randomly picked negative classes. In addition,The ratio of the imbalanced class to a positive class were computed by subdividing thetotal of negative samples by the average amount of samples per class of the 10 positiveclasses (around 1200 images).

Based on empirical results, using a default setting, CNNs could quickly overfit the classcontained negative samples (i.e., 11th class). Therefore, a modification to the loss func-tion was proposed in order to overcome this problem. A coefficient K was introducedto adjust the gradients of the softmax loss function.

∂E

∂y= (p(c|y)− y)K (4.1)

The variable K was a classification cost employed to slow down the learning rate when-ever the true label y was equal to the negative class (i.e., the 11th class). In this experi-ment, K was set to 0.1 for the 11th class and 1 for the other classes (10 classes).

Chapter 4. Experimental Setup 32

4.2 Wikipainting Dataset

The Wikipainting dataset (Figure 4.1) contains images that have different visual char-acteristics from object categories in the ImageNet dataset. For this reason, pre-trainedCNN models might need to learn new features that is different from image features oc-curred the ImageNet dataset, where the dataset is typically used to pre-train the CNNmodel.

The Wikipainting2 dataset contains a collection of painting styles from different eras,ranging from Reinassance to Modern Art movements. Figure 4.1 shows sample imagesfrom a subset of the Wikipainting classes. As one can see, the images depict differentvisual characteristics from real-world object images, with strong strokes and uniquecolor compositions. Furthermore, recognizing painting styles is a non-trivial task, whichusually requires a specific domain expert (i.e., a painting curator). For instance, artmovements, such as Color Field (Figure 4.1 B) and Cubism (Figure 4.1 D) typicallyshare a similar visual pattern (i.e. boxes), classifying these classes is difficult for commonpeople.

The main author in [27] has provided the list of image urls on his github repository 3.The image urls contain the painting collection from the October 2013 Wikipainting’s artcollection. Since the offline version of the dataset was not available, automatic crawlingon the Wikipainting website was required in order to gather the dataset (the code tocrawl the website can be found in his repository). The detailed implementation can bereferred to in his blog 4. Each painting comes with information, such as an artist name,art movement (style), the year of creation and the gallery name where the painting isexhibited. For this experiment, classes were taken from paintings’ art movement or style.

In conducting the experiment, the training and test sets were prepared using the samesettings as the ImageNet experiment. Each class was restricted to 1000 images fortraining and 50 images for testing. Since the number of images in the Wikipaintingcategories varied a lot (e.g., some categories are less than 100 images and other are morethan 10,000 images), only the classes that have at least 1050 images were chosen fromthe whole classes (i.e., 22 classes). Similar to the ImageNet scenario, training sets withincrementing numbers of images were generated from different percentage ratios: 5, 10,20, 40, 60, 80 and 100% (Table 4.2). Three types of approaches were investigated on theWikipainting dataset; FV encodings, CNN trained from scratch and pre-trained CNNmodels with fine-tuning.

4.3 Algorithm Implementations

This section clarifies on how different approaches were going to be implemented onthe datasets mentioned in the above section. Both CNNs and FV were evaluated ina multi-class classification scenario. Each approach learned patterns in the trainingdata to predict unseen samples in the test data. As for the benchmarks, Mean AveragePrecision (MAP) scores of all approaches on each test set were computed. The resultswere compared for further discussion in the next chapter.

2www.wikiart.org3https://github.com/sergeyk/vislab4http://vislab.berkeleyvision.org/tutorial.html

Chapter 4. Experimental Setup 33

(a) Baroque (b) Color Field (c) Rococo (d) Cubism

(e) Impressionism (f) Expressionism (g) Surrealism (h) Ukiyo

Figure 4.1: The Wikipainting dataset, a collection of painting images.

4.3.1 Caffe Framework

Caffe is a CNNs framework that has been developed by the Berkeley Vision and LearningCentre (BVLC) research group. It is an open-source library published under the the BSD2-Clause license. Caffe is written in highly optimized C++ with the support of NVIDIACUDA library, which enables to run CNNs with a high-degree of parallelization on aGPU machine. Moreover, Caffe allows users to use various pre-trained CNN modelsfrom the Caffe Model Zoo, which is a collection of various CNN models with differentarchitectures and data. These models were collected from individuals or researchersthat applied CNNs for different purposes (e.g., speech recognition, simple regression).Another benefit of the Caffe library is the availability of complete documentations andcommunity supports, which lack in other CNN libraries such as the cuda-convnet5 library(the library published by the authors in [10]) or Overfeat6.

The entire CNN experiment was carried out with the Caffe framework [44] on a 6GBTesla K20X GPU machine and ran on top of the HPI Future-SOC Lab 7. A CNNmodel with deep layers usually consists of a lot of parameters (i.e., the CNN proposedby [10] containing more than 60 million parameters), therefore a GPU with a large sizeof memory is necessary to run this type of CNN architecture.

In order to run the Caffe framework on the Future SOC, all dependencies on table 4.5were required to be compiled locally on a home directory. This list helps to reproduce

5http://code.google.com/p/cuda-convnet/6http://cilvr.nyu.edu/doku.php?id=code:start7hpi.de/en/research/future-soc-lab.html8http://www.nvidia.com/object/cuda home new.html9http://www.boost.org/

10http://opencv.org/11https://developers.google.com/protocol-buffers/12http:/graphics processing unit/gflags.github.io/gflags/13http://code.google.com/p/google-glog/14http://leveldb.org/15http://www.numpy.org/

Chapter 4. Experimental Setup 34

Library Version Description

Nvidia CUDA 8 6.5

A Nvidia-based GPU libraryto run general purposecomputing on a GPUmachine.

MKL BLAS Library –

Intel based Basic LinearAlgebra Subprograms(BLAS) Library used forvector and matrixmultiplications.

Boost 9 1.56.0A C++ general purposelibrary.

OpenCV 10 2.4.9A computer vision libraryused for image processing.

Google Protobuf 11 2.6.0For serializing class or objectsettings (i.e., configuringCNN model architecture).

Gflags 12 2.1.1

A package contains a C++library that implementscommandline flagsprocessing.

Glog 13 0.3.3A Google logging moduleused for logging purposes inCaffe.

Leveldb 14 1.17An input output library usedmainly to store and fetch rawimages.

Python 2.7An interpreted programminglanguage, an alternative tothe C++ implementation.

Numpy 15 1.9.0Used for speeding upscientific computing inpython.

Table 4.5: A list of Caffe library dependencies applied in this experiment

the results reported in this thesis. Since user access to root directory is limited – forsecurity reason in the shared infrastructure, therefore, every library was required to becompiled locally in user’s home directory, without installing them via a Linux PackageManager. As a consequence, this approach caused a lot of dependency issues due tomismatched versions among different libraries.

In addition, typically in order to achieve the training speed that is similar to the reportedresults on the Caffe website, Images need to be first transformed into a LevelDB formatbefore being fetched by CNN models. Nevertheless, in this experiment, images couldbe directly fetched from raw image source due to the Future SOC cache system. Thisexperiment benefited from the cache size that was large enough to store the entire

Chapter 4. Experimental Setup 35

training images. For this reason, the training speed in this experiment could reach thetraining speed reported on the Caffe website.

4.3.2 Convolutional Neural Networks

In this experiment, the CNN architecture proposed by Krizhevsky, et al in the ImageNet2012 competiton was chosen [10]. A replication of the model architecture was provided bythe Caffe framework, with small modifications at the order of pooling and normalizationsteps (i.e., pooling is applied before normalization). This setting helped to speed up theforward run without reducing accuracies.

As shown in Figure 4.2, the Caffe CNN architecture consists of five convolutional-layersand three fully-connected layers, connected with a softmax layer to calculate a proba-bility for each output. Hence, the number of outputs of the last fully-connected layerdepends on the number of classes to classify (e.g., 22 for the Wikipainting classes and1000 for the entire ImageNet classes). Each convolutional layer was activated by aReLU activation function. The max pooling layers were applied to the 1th, 2th and 5th

convolutional layers. Hence, the LRN were applied after the 1th, 2th pooling layers.

Following [10], every image needed to be resized to 256 × 256 pixels and a center cropof 224 × 224 was extracted as input for the model. Additionally, mean subtraction –obtained by averaging the pixel values from all training images – was carried out fromeach input image. No further data augmentation was applied in this experiment, sinceit was reported to contribute only slightly to the results.

For training the CNN on the ImageNet dataset, the following training parameters wereused: momentum 0.9, weight decay 5.10−4, initial learning rate 103 for the 10-classscenario and 102 for the 100-and 200-class scenarios respectively (which are decreasedby a factor of 10, for every 20 epochs16). Moreover, for each dataset a total of 90 epochswere trained. Weights and biases were initialized for the entire convolutional layersbefore training the models. The weights were set with a gaussian with mean 0 andstandard deviation 0.01, and the biases were simply set to 0. The number of batchesper run (i.e., forward and backward propagations) were set to 128 images in 100-and200-class settings and 64 images for the 10-class setting. Based on empirical results, asmaller learning rate and batch size were required to reduce the error rate in the 10-classsetting.

In the Wikipainting scenario the same setting was used, with a little modification on theinitial learning rate 103 and batch size 128 images. In addition, a pre-trained CNN modelwith fine-tuning was employed. The Caffe library provides ready-to-use pre-trained CNNmodels trained on the entire ImageNet 2012 training set (i.e., 1000 classes). In orderto fine-tune a pre-trained CNN model to a target dataset, the last layer (i.e., the 3th

fully-connected layer) was replaced with the new target outputs – 22 outputs for theWikipainting scenario. Additionally, the learning rate on the last layer is multiplied bya factor of 10. This was the setting proposed by the Caffe author, which enabled theweight parameters at the last layer to learn faster. Moreover, based on some testingexperiments, pre-trained models could converge faster on a smaller epoch rate (i.e., 30epochs). The rest of the settings – such as the initial learning rate, momentum or batchsizes – employed the same set up as CNNs trained from scratch.

16one epoch means that an entire training set has gone through the learning algorithm once.

Chapter 4. Experimental Setup 36

Dataset No. Classes Epochs Batch Size Total Training Time

ImageNet 10 Classes 90 64 images 3 hours

ImageNet 100 Classes 90 128 images 59 hours

ImageNet 200 Classes 90 128 images 126 hours

Wikipainting 22 Classes 90 128 images 13 hours

Table 4.6: The training time required for training the 7 subset splits (i.e., 5, 10, 20,40, 60, 80 and 100% ratios) in the ImageNet and Wikipainting scenarios.

Furthermore, despite model training was much faster on a Future SOC GPU (160 imagesper second in a K20X GPU, including forward and backward propagations), training theentire CNN models used in this experiment still consumed a lot of time. Table 4.6 showsthe training time required in order to train the entire 7 subset splits in the ImageNetand the Wikipainting scenarios – excluding the imbalanced setting which took roughlyanother 95 hours to train. Fortunately, HPI Future SOC provided two Tesla K20XGPUs which could be run separately in order to speed up training the entire scenarios.

4.3.3 Fisher Vector Encodings

To compute a FV encoding a library for internal usage provided by the HPI-SemanticTechnologies research group was implemented, which was based on the VLFeat imple-mentation [45]. The BoVW pipeline started from feature extraction using the DenseSIFT implementation and FV encoding to model training with linear SVM. Firstly,SIFT features were extracted from every 4 pixels step-size at 7 different scales with√

2 decrements. For each local point SIFT features of 128D on a gray channel wereextracted, which were dimensionally reduced to 80D by means of Principal Compo-nent Analysis (PCA). Moreover, the spatially-extended local descriptors modification[46] was employed by appending its image location to the local descriptor dimension,creating a 82D single descriptor. A GMM with K = 256 components was applied tosoft-quantize extracted SIFT features, creating a Fisher Vector representation of 2 x 82x 256 = 41, 984D (2 denoted the gradients of means and standard deviations of sampleswith respect to the GMM components). Finally, the improved version of Fisher Vectors(IFV) applied a signed square-rooting to the individual components of the encoding bya L2-normalization. As described in Chapter 2, a Linear-SVM was carried out, withIFV descriptions as input data. Moreover, a one-versus-rest scenario was implementedin a multi-class scenario by minimizing the hinge loss function. As typical practice theC parameter should be optimized using cross validation; however, in order to reducetraining time, in this implementation the C parameter was set to 10.

In the Wikipainting dataset, the same setting as above was applied with minor modifi-cations. The SIFT features of 128 × 3 = 384D were extracted from a 3-channel RGBinput, which was then dimensionally reduced to 120D – adding spatial information cre-ated a 122D descriptor. Based on empirical results, the Wikipainting dataset benefitedfrom different color information than single-channel information. Applying the samenumber of GMM components as above, each image in the dataset was represented bya 2 x 122 x 256 = 62, 464D feature vector. Moreover, since some images had a reallylarge resolution, every image was scaled to a fixed size of 512× 512 pixels.

Chapter 4. Experimental Setup 37

Figure 4.2: The Caffe’s CNN architecture. The replication of the CNN architectureproposed by [10], with small modifications at the order of pooling and normalization

steps (i.e., pooling is applied before normalization).

Chapter 5

Evaluation and Discussion ofResults

In this section, the results of the experiments will be evaluated and analyzed. Section 5.1presents the evaluation results from the ImageNet dataset, followed by visualizations ofboth FV and CNN models on different datasets, including the evaluation results obtainedafter training CNNs on the imbalanced datasets. Section 5.2 presents the evaluationresults from the Wikipainting dataset.

5.1 Evaluation on Imagenet Dataset

For evaluation purposes, AP score per class were computed based on trained models on aseparate 10-class test set. Figure 5.1a and 5.1b show the MAP scores from both FV andCNNs respectively (for individual AP scores see Table A.1 and A.2 in the appendices).

Both models show similar improvement trends in increasing the number of positive sam-ples (i.e., increasing numbers of samples from 5% to 100%). The most significant im-provements that occurred increased in percentage from a range of 5% to 40%. In the100-class and 200-class settings, FV seems to saturate at around 80%, as opposed toCNNs whose the performance kept growing even at the 100% ratio.

Furthermore, increasing negative samples (i.e., going from 10 to 200 classes) affected theFV performance greatly. FV MAP scores dropped significantly from 77.5% in the 10-class setting to around 70% in the 100-and 200-class settings. The most difficult classes,such as muzzle and hatchet, contributed to the classification performance decline. TheAP score of the muzzle class dropped from 71% in the 10-class setting to 36% in the100-class setting – both results obtained using the full sets (100%) of each setting. Onthe other hand, CNNs benefited from increasing the number of classes, with the highestMAP score being 78.6% using the entire set (i.e., 100% ratios) of the 200-class scenario.As the matter of fact, the best performing FV score (AP=77.5%) was achieved usingonly the initial 10 classes.

Moreover, a threshold has been drawn where CNNs managed to outperform the bestperforming FV model (Figure 5.1c). The figure depicts that the CNN trained on the

38

Chapter 5. Evaluation and Discussion of Results 39

0 20 40 60 80 100Dataset ratios

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Mea

n Av

erag

e Prec

ision

10 classes100 classes200 classes

(a) FV

0 20 40 60 80 100Percentages (%)

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Mean Average Precision

CNN 10

CNN 100

CNN 200

(b) CNN

0 20 40 60 80 100Dataset ratios

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Mea

n Av

erag

e Prec

ision

10 classes200 classes

(c) CNN and FV

Figure 5.1: A comparison of FV and CNN MAP scores on the incrementing ImageNetdatasets. Figure 5.1c depicts the highest MAP scores from both models; CNN trainedon the 200-class scenario and FV trained on the 10-class scenario. 60% was the minimal

amount of dataset required for a CNN model to outperform FV.

0 20 40 60 80 1000.00.20.40.60.81.0

geyser

200 class CNN10 class IFV

0 20 40 60 80 1000.00.20.40.60.81.0

odometer

0 20 40 60 80 1000.00.20.40.60.81.0

canoe

0 20 40 60 80 1000.00.20.40.60.81.0

web site

0 20 40 60 80 1000.00.20.40.60.81.0

yellow [...]

0 20 40 60 80 1000.00.20.40.60.81.0

hook&claw

0 20 40 60 80 1000.00.20.40.60.81.0

muzzle

0 20 40 60 80 1000.00.20.40.60.81.0

spatula

0 20 40 60 80 1000.00.20.40.60.81.0

hatchet

0 20 40 60 80 1000.00.20.40.60.81.0

ladle

Figure 5.2: A comparison of FV and CNN AP scores per class from their best per-forming model: FV models trained on the 10-class scenario and CNN models trained

on the 200-class scenario.

Chapter 5. Evaluation and Discussion of Results 40

200 classes managed to supersede the best performing FV (i.e., 10 classes) at around60% – where both lines are intersected.

Analyzing the individual per-class AP scores (Figure 5.2), it seems that both best per-forming models could easily predict the 5 easier classes, as opposed to the 5 harderclasses, where the AP scores were equally low for both CNNs and FV. Therefore, pre-dicting correctly the worst performing classes contributed significantly to the increaseof the MAP scores.

Interestingly, using only 5% of the 10 classes, FV could almost perfectly predict the 5easier classes (i.e. more than 95% AP scores), whereas the CNN approach required 40%of the 200 classes in order to obtain the similar scores. The result shows that FV can beperfectly predict scenery-type images using only few training data. In CNNs, the canoeclass seems to progress slower than the other 4 easier classes, with 68% MAP score ausing 5% ratio of the 200 classes. It might happen that, in contrast to the other 4 easierclasses, the images in the canoe class are less scenery-like, although similar backgroundoccurs often in the canoe class (e.g., a canoe object comes usually with water or grassbackground).

Despite the best performing CNN and FV models gave only slightly different AP scores(78.6% and 77.5% respectively), important object regions were seen differently by bothmodels – especially in the harder classes, as will be discussed in the model visualizationsubsection.

These results lead to two conclusion, firstly, FV-based models outperform CNNs insmaller datasets, and secondly, increasing numbers of samples (both positive and negativesamples) help CNN-based models to learn more robust features.

5.1.1 Visualizing Model Predictions

In order to see how the CNN models evolve over the size of data, heatmaps visualizationwas employed on the models trained on the full set sample (i.e. 100% ratio) of the 10-,100- and 200-class settings. The method was adopted from the heatmap visualizationpresented in [20]. It works by occluding parts of a test image stepwise prior to classifi-cation, in order to visualize regions with the highest impacts on the overall classificationscore.

The detailed implementation is described as follows: Firstly, two empty matrices withthe same size as a test image were prepared. A sliding window of 64×64 was moved overthe image pane by partially setting the occluded region to 0 (Figure 5.3). For each step(i.e., 8 pixels), a prediction score for the true class label (e.g., muzzle) was computed bya trained model. This prediction score filled the first matrix on the same location as thepartially occluded region; additionally, the second matrix was used to count how manytimes certain areas got occluded.

This method was iterated until the last step of the sliding window. Lastly, the aggregatedscores from the first matrix was normalized with values from the second matrix. The firstmatrix was plotted as a heatmap matrix, visualizing regions with the highest impacts.

Deciding the size of sliding window and step was important in generating a heatmapmatrix. A small step size led to longer computation time due to more regions being

Chapter 5. Evaluation and Discussion of Results 41

Figure 5.3: An image in the ImageNet muzzle class, with examples of some regionsset to zero in order to visualize heatmaps.

occluded (note that every occluded region needs to be computed by a model), whereasa big step size resulted in too coarse regions, creating poor visualizations. In addition, asliding window needed to be large enough to cover object regions, but also fine-grainedenough to pinpoint important regions.

Figure 5.4 displays a true positive muzzle image superposed with heatmap matricesgenerated from all models generated from the full sets (i.e., 100%) of all scenarios. Theheatmaps are described as follows: The color palettes range from colors blue to red,the strong blue color denotes regions with the highest responses, meaning that theseregions contributed significantly to prediction scores – as opposed to the red regionswhich did not contribute as significantly. The prediction scores depicted on the colorpalettes differ for both CNNs and FV. In CNNs, values were computed from a softmaxfunction; therefore output is a probability value ranging from 0 to 1. In contrast, in FV,values were not probabilities, but a distance from the separating hyperplane computedby a SVM.

Interestingly, analyzing Figures 5.4 D – F, all of the CNN output scores were above0.5, indicating that, even in the presence of occlusions, the CNN models managed tocorrectly classify the muzzle image. However, the same conclusion cannot be inferredfrom FV (Figures 5.4 A – C), since in a linear SVM the output scores range from minusinfinity to plus infinity.

In the 10 classes (Figures 5.4 A and D), both models considered grassy regions asimportant features – it may happen that a majority of muzzle images contain grassregions as backgrounds. However, with more training samples (i.e., 100-and 200-classscenarios), CNNs could better focus on the object region, as for FV, the muzzle area was

Chapter 5. Evaluation and Discussion of Results 42

(a) 10 classes

(b) 100 classes

(c) 200 classes

(d) 10 classes

(e) 100 classes

(f) 200 classes

Figure 5.4: a true positive image in the muzzle category superposed with heatmapmatrices generated from all of the full set models (i.e., 100%). Figures A – C displayall the heatmaps plotted using FV models. Figures D – F display all the heatmapsplotted using CNN models. The color palettes show confidence scores of each model –

probability outputs for CNNs and distances from a decision boundary for FV.

Chapter 5. Evaluation and Discussion of Results 43

mixed with the background regions. The important areas detected by FV were morespread out and did not focus on the target objects. Similar observations can be madewhen analyzing heatmaps from other classes (see Figure 5.5).

Figure 5.5 shows the visualizations of important regions from both models from 10-classto 200-class settings. The figures consist of two images from the 2 easier classes (canoeFigure 5.5 A and yellow lady’s slipper Figure 5.5 B) and two images from the harderclass (spatula Figure 5.5 C–D). The FV model (top rows) and the CNNs model (bottomrows). In the easier classes, the important regions identified by FV models were morespread out than important regions in CNNs, whereas CNNs focused more on objectregions as the number of samples increased (e.g., Figure 5.5 A, going from the 10-classto 100-class scenario).

Both approaches had difficulties in locating meaningful regions for both of the spatulaimages. On the first spatula image (Figure 5.5 C where the spatula is being held by aman), the important regions were mixed between spatula objects and other surroundingobjects (e.g., faces, hands). On another spatula image (Figure 5.5 D), FV could betterfocus on the spatula regions (i.e., the head of spatula), whereas CNNs focused on thecenter regions where the regions was mixed with other objects.

Similar to the visualizations of the muzzle class (Figure 5.4), CNNs models could robustlypredict images in the presence of occlusions (especially in yellow lady’ slipper where noscore was less than 95%). On the contrary, in both spatula images, CNN models reportedlow probability scores.

5.1.2 Evaluation on Imbalanced Dataset

20 100 200 300Number of Classes

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Mea

n Av

erag

e Pr

ecis

ion

CNN Default SettingCNN Imbalanced Setting

Figure 5.6: A comparison of CNN default setting with CNN imbalanced setting MAPscores with different sizes of imbalanced data.

Figure 5.6 shows the MAP scores on the 10-class test set using the models trained onthe incrementing imbalanced data (for the detailed scores per class see Table A.3 inappendices). Two CNN approaches were investigated in this experiment: The CNNtrained using the default setting (i.e., without modifying the loss function), and theproposed imbalanced setting (modifying the loss function).

The CNN with the imbalanced setting showed increasing trends by increasing the numberof negative samples (going from 20 to 300 classes), where MAP scores increased from

Chapter 5. Evaluation and Discussion of Results 44

(a) Canoe

(b) Yellow lady’s slipper

(c) Spatula1

(d) Spatula2

Figure 5.5: Heatmaps computed for some of the easier classes (canoe Figure 5.5 Aand yellow lady’s slipper Figure 5.5 B) show that CNNs focus better on the depictedobject when adding more (negative) data. Example images taken from one of theharder classes (spatula Figures 5.5 C–D) convey that both approaches had difficultiesin locating meaningful regions. FV models (top rows) and CNN model (bottom rows)

have been trained on 10, 100 and 200 classes datasets.

Chapter 5. Evaluation and Discussion of Results 45

70% (the 20-class setting) to 72.3% (the 100-class setting) and 72.4% (the 200-classsetting). On the other hand, the CNN trained using the default setting were gettingworse as the number of negative samples increased, from 70.8% (20-class setting) to67.8% (200-class setting). Moreover, adding too many negative sample (i.e., 300-classsetting) could hurt the classification performance of both approaches.

In comparison to the balanced dataset (i.e., where samples are distributed uniformlyamong classes), Using the same number of classes, the CNN trained on the imbalanceddataset showed classification performance worse than the CNN trained on the balanceddataset. For example, the model trained on the 100-class setting on the balanced datasetmanaged to obtain 75.2% MAP score comparing to 72.3% obtained in the imbalancedsetting. Nevertheless, the classification performance of the models trained on the 100-and 200-classes using the proposed imbalanced setting (MAP=72.4%) was higher thanthe model trained on the initial 10-classes only (MAP=71.5%). This denotes that thereis a possibility to improve the existing CNN model performance by adding more negativesamples without the cost of separating them into different classes.

5.2 Evaluation on Wikipainting Dataset

0 20 40 60 80 100Dataset ratios

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Mean Average Precision

FVCNNCNN fine-tuned

Figure 5.7: A comparison of FV, CNN trained from scratch and pre-trained CNNMAP scores on the incrementing Wikipainting datasets.

Figure 5.7 shows the MAP scores computed using the trained models on the separatedWikipainting test set – consisting of 50 images from 22 classes (for detailed scores perclass see Table B.1 in appendices). Similar to the ImageNet scenario, adding morepositive samples increased the classification performance for all three approaches. Thebest performing results reported by the pre-trained CNN models, whereas CNNs trainedfrom scratch performed worse than FV.

In the beginning (5% to 10% ratio), FV-based models superseded the pre-trained CNNapproach, with the biggest gap occurring at a 5% ratio (the MAP scores were around 6%difference). However, the pre-trained models kept improving with incrementing numberof samples, until they started to outperform FV models using 20% of the training data.

The highest score (i.e., MAP=55.9%) was achieved by the pre-trained CNN modelstrained on the full set sample of a 100% ratio. Moreover, the FV models converged

Chapter 5. Evaluation and Discussion of Results 46

at 80% and dropped at 100% (from 51.2% to 48.9% MAP scores). Conversely, BothCNN approaches (i.e., pre-trained and scratch) kept improving by increasing number ofsamples without showing any sign of declining. The similar characteristic, where FVsaturated quite early (i.e., 80% ratios), has also been reported in the previous experimenton the incrementing ImageNet dataset.

These result demonstrates that pre-trained CNN models with fine-tuning needs to learnnew features, when the target dataset is not similar to the initial pre-trained dataset(i.e., the ImageNet). Moreover, FV can be a better alternative in this type of datasetwhen only limited amounts of data is available.

Comparing to the results of the previous ImageNet experiment, both approaches hadmore difficulties recognizing different styles on the Wikipainting dataset. One maynotice that there was no class obtaining a perfect AP score in the Wikipainting dataset,whereas in the ImageNet 5 easier categories both approaches could achieve almost a100 % AP score for each individual class. Classes such as Surrealism, Realism andPost Impressionism achieved MAP scores of lower than 35%. An exception was theUkiyo-e class – Japanese painting images that contain special character inscriptions –which managed to reach more than 80% MAP using all approaches. Figure 5.8 displaysthe contrast between the Ukiyo-e and Surrealism category. Classifying the Ukiyo-ecategory appears to be much easier due to similar visual characteristics, such as vanillacolor backgrounds or Japanese inscriptions, in contrast to Surrealism where each imagecontains unique visual styles.

A confusion matrix based on accuracy was computed to give an analysis of individ-ual class performance with respect to other classes (Figure 5.9). Interestingly, theclassifier got confused distinguishing similar classes, such as Impressionism and Post-Impressionism. The fact that they have similar names (i.e., Impressionism), classifyingthese two classes might even be difficult for humans. In addition, the model often mis-classified Expressionism as Cubism. Even though, no clear explanations can be given,they might share similar characteristics due to fact that they appeared in a close timeperiod (i.e., Modern Art).

Figure 5.8: Contrasting the hardest (Ukiyo-e [top]) and the easiest classes (Surre-alism [bottom]) in the Wikipainting dataset. Ukiyo-e contains similar visual patternsacross different images (e.g., similar background colors, inscription letters), whereas in

Surrealism visual patterns are mostly unique.

Chapter 5. Evaluation and Discussion of Results 47

Abstract_Expressionism

Art_Nouveau_(Modern)

Baroque

Color_Field_Painting

Cubism

Early_Renaissance

Expressionism

High_Renaissance

Impressionism

Mannerism

_(Late_Renaissance)

Minimalism

Naive_Art_(Primitivism)

Neoclassicism

Northern_Renaissance

Pop_Art

Post-Impressionism

Realism

Rococo

Romanticism

Surrealism

Symbolism

Ukiyo-e

Abstract_Expressionism

Art_Nouveau_(Modern)

Baroque

Color_Field_Painting

Cubism

Early_Renaissance

Expressionism

High_Renaissance

Impressionism

Mannerism_(Late_Renaissance)

Minimalism

Naive_Art_(Primitivism)

Neoclassicism

Northern_Renaissance

Pop_Art

Post-Impressionism

Realism

Rococo

Romanticism

Surrealism

Symbolism

Ukiyo-e

0.54 0.0 0.0 0.08 0.04 0.0 0.02 0.0 0.02 0.0 0.06 0.0 0.04 0.0 0.06 0.02 0.0 0.0 0.0 0.04 0.06 0.02

0.0 0.4 0.02 0.0 0.06 0.02 0.18 0.02 0.04 0.02 0.0 0.02 0.04 0.0 0.02 0.02 0.02 0.0 0.02 0.02 0.04 0.04

0.0 0.0 0.44 0.0 0.0 0.04 0.0 0.08 0.0 0.08 0.0 0.02 0.06 0.04 0.0 0.04 0.04 0.12 0.04 0.0 0.0 0.0

0.0 0.0 0.0 0.84 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.04 0.0

0.0 0.02 0.0 0.0 0.74 0.0 0.1 0.0 0.0 0.0 0.02 0.02 0.0 0.0 0.06 0.02 0.0 0.0 0.0 0.0 0.02 0.0

0.0 0.04 0.0 0.0 0.02 0.72 0.02 0.1 0.0 0.02 0.0 0.0 0.0 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.0

0.06 0.02 0.0 0.0 0.3 0.06 0.18 0.0 0.04 0.0 0.0 0.08 0.0 0.02 0.0 0.06 0.08 0.0 0.02 0.0 0.04 0.04

0.0 0.0 0.02 0.0 0.02 0.12 0.0 0.44 0.0 0.12 0.0 0.0 0.04 0.06 0.0 0.04 0.02 0.02 0.04 0.02 0.04 0.0

0.0 0.0 0.02 0.0 0.0 0.0 0.08 0.0 0.46 0.02 0.0 0.0 0.02 0.0 0.0 0.22 0.14 0.0 0.02 0.0 0.02 0.0

0.0 0.0 0.08 0.0 0.0 0.04 0.0 0.1 0.0 0.58 0.0 0.0 0.04 0.04 0.0 0.0 0.04 0.06 0.0 0.02 0.0 0.0

0.06 0.0 0.0 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.82 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.04 0.0 0.0

0.06 0.06 0.0 0.0 0.14 0.0 0.06 0.02 0.04 0.0 0.0 0.32 0.02 0.04 0.02 0.04 0.0 0.0 0.02 0.08 0.06 0.02

0.0 0.04 0.04 0.0 0.0 0.02 0.0 0.0 0.04 0.02 0.0 0.02 0.62 0.0 0.0 0.02 0.04 0.04 0.08 0.02 0.0 0.0

0.02 0.02 0.04 0.0 0.0 0.08 0.04 0.04 0.02 0.06 0.0 0.0 0.02 0.56 0.0 0.0 0.06 0.02 0.0 0.02 0.0 0.0

0.04 0.0 0.0 0.08 0.04 0.0 0.06 0.0 0.0 0.0 0.06 0.0 0.02 0.02 0.62 0.0 0.0 0.0 0.0 0.02 0.04 0.0

0.08 0.04 0.0 0.0 0.04 0.0 0.1 0.02 0.08 0.0 0.06 0.12 0.0 0.0 0.0 0.34 0.02 0.0 0.0 0.04 0.06 0.0

0.02 0.02 0.06 0.0 0.02 0.0 0.02 0.0 0.1 0.02 0.0 0.02 0.08 0.02 0.0 0.1 0.28 0.0 0.16 0.02 0.04 0.02

0.0 0.0 0.16 0.0 0.02 0.02 0.0 0.02 0.0 0.04 0.0 0.0 0.02 0.0 0.0 0.0 0.06 0.56 0.06 0.0 0.04 0.0

0.0 0.02 0.02 0.0 0.0 0.02 0.04 0.0 0.06 0.04 0.0 0.02 0.02 0.02 0.0 0.04 0.1 0.12 0.44 0.0 0.04 0.0

0.02 0.04 0.0 0.0 0.08 0.0 0.12 0.02 0.0 0.0 0.12 0.12 0.06 0.0 0.04 0.02 0.04 0.0 0.02 0.26 0.02 0.02

0.04 0.0 0.04 0.0 0.06 0.0 0.02 0.04 0.06 0.04 0.0 0.0 0.04 0.0 0.02 0.02 0.02 0.04 0.02 0.02 0.52 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.04 0.0 0.0 0.0 0.0 0.0 0.92

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 5.9: A Wikipainting confusion matrix based on prediction accuracies from theCNN pre-trained model on 22 different classes.

Chapter 6

Conclusion and Future Work

This chapter concludes the entire experiment, discusses major findings and limitation,and make recommendations for possible improvements. Section 6.1 discusses the sum-mary of the experiments. Finally, Section 6.2 presents further possible research topicsbeyond the scope of this thesis.

6.1 Summary

In this thesis, the impact of incrementing dataset has been evaluated on the classificationperformance of both Convolutional Neural Networks and Bag of Visual Words withFisher Vector encodings. Some evaluation results and findings have been collected fromall of the conducted experiments.

Firstly, in accordance to the initial hypothesis both approaches CNNs and FV benefitedfrom increasing the number of positive samples (i.e., going from 5% to 100%). However,FV classification performance seems to converge faster at a certain amount data, whereasCNNs exhibited a steady performance increase in line with the number of samples. Thissimilar behavior has been reported from the both datasets (i.e., the ImageNet and theWikipainting dataset) conducted in this thesis.

Secondly, a new insight has been reported, while CNNs were able to learn better featuresby increasing negative samples (i.e., going from 10 to 200 classes), the precision of FV-based models suffered from large class diversities. This observation gives a new insighton the impact of negative samples on the classification performance of FV based models,contrary to the typical assumption, that increasing the number of samples can improvemodel accuracy.

Furthermore, a decision threshold with respect to a restricted test set (i.e., the ImageNet10 classes) has been drawn. Combining both positive and negative samples, a CNN-basedmodel managed to surpass the best performing FV-based model at a certain thresholdpoint (i.e., 60% of the 200-class scenario). However to reach this threshold point, a CNNmodel required to be trained on a large enough number of samples (i.e., around 155,000images), which might not be possible to achieve in most datasets – the Caltech-101and-256 containing only 9,146 and 30,607 images respectively.

48

Chapter 6. Conclusion and Future Work 49

Moreover, analyzing each individual ImageNet class, it appears that FV could easilypredict the 5 easier ImageNet classes (i.e., scenery-type images). In fact, using evena small amount of training data (i.e., 5%) FV models could obtain almost perfect APscores. Therefore, in this type of images, it might be not necessary to use a CNN-basedapproach, especially concerning the hardware resources it requires.

Finally, the experiment on the Wikipainting dataset – where the dataset has differentvisual characteristics and categories to the ImageNet dataset – exhibited that the FVcould outperform pre-trained CNN models in smaller subsets (i.e. 5% to 20%). Thisobservation shows that in a completely unique dataset at certain extent, pre-trainedCNN models need to learn new features different from image features in the dataset, onwhich a CNN model is pre-trained.

In conclusion, this thesis has given a detailed comparison of both approaches on varyingtypes of datasets. The result concludes that CNN-based models benefit in efficiency froma large dataset, whereas hand-crafted features approach based on FV encodings can stillbe a competitive candidate in situations where limited amounts of data is available.Moreover, a decision threshold has been reported, which benefits researchers to estimatewhich model to favor with respect to the number of samples.

6.2 Future Work

Although this work has compared BoVW with FV encodings and CNNs on varyingdatasets. A much further research could be developed in a number of ways.

The CNN-based models employed in this thesis were restricted to the CNN architectureproposed by Krizhevsky et al., [10]. An implementation using more advanced CNNmodels [12, 16, 20] might lead to a lower threshold point due to better feature descrip-tors generated from a deeper layer model. Moreover, the models applied in this thesiswere trained without any image augmentation step (i.e., augmenting images in order toincrease the number of samples), which has been proven to be important for improvingmodel accuracy by many prior research works [10, 12]. Analyzing the impact of imageaugmentation on both approaches with respect to varying data sizes could be a furtherresearch study.

As mentioned in the introduction chapter, providing a manually annotated data fortraining sets is a tedious and time-consuming task. It involves grouping numerous imagesinto many different classes. For this reason, this initial work on imbalanced datasets hasexhibited the possibility of improving a CNN model, by using unclassified images treatedas a single negative class. Nonetheless, a better approach than the classification costproposed in this thesis should be employed. One way to improve a CNN model in animbalanced setting is to oversample smaller classes or undersample bigger classes. Thismethod ensures that every class has an even number of samples. Another method is byapplying a different loss function instead of the default softmax loss implemented by theCaffe framework. The authors in [12] have proposed a one vs rest hinge loss as the lossfunction to train a CNN model on a multi-label scenario.

In relation to FV based models, the experiment has reported some limitations of FV withrespect to increasing the number of positive samples (i.e., going from 5 % to 100 %) andnegative samples (i.e., going from 10 to 200 classes). Unfortunately, this result is still not

Chapter 6. Conclusion and Future Work 50

fully understood and can not be explained in this thesis. Therefore, a further analysison this direction may lead to an improvement of the hand-crafted feature approaches.

Moreover, this thesis is limited to a multi-class scenario (i.e., assigning an image to atrue label). A comparison between FV and CNNs on different classification tasks, suchas in multi-label settings, may give different comparison results. In the end, despite alot of good results shown by the CNNs, there is still a lack of underlying explanationof the contributing factors. Hence, finding a connection between learned features andhand-crafted features still remains an open question, which can be further explored bymore research studies.

Appendix A

Imagenet Evaluation Results

Model Ratio MAP Geyser Odo-meter

Canoe Website Yellowlady’sslipper

Hook,Claw

Muzzle Spatula Hatchet Ladle

FV 5% 0.681 0.993 0.997 0.962 0.999 0.992 0.411 0.364 0.315 0.379 0.40210 10% 0.689 0.994 0.996 0.961 1.000 0.993 0.480 0.511 0.258 0.377 0.325Classes 20% 0.731 0.997 1.000 0.984 1.000 0.998 0.560 0.615 0.267 0.508 0.381

40% 0.746 1.000 0.999 0.984 1.000 0.997 0.612 0.653 0.299 0.519 0.39860% 0.761 1.000 1.000 0.985 1.000 1.000 0.621 0.700 0.323 0.540 0.44080% 0.755 1.000 0.998 0.988 1.000 1.000 0.627 0.696 0.310 0.510 0.424100% 0.775 1.000 1.000 0.990 1.000 1.000 0.636 0.709 0.385 0.549 0.480

FV 5% 0.607 0.998 0.998 0.982 0.991 0.996 0.377 0.240 0.145 0.170 0.170100 10% 0.638 1.000 0.999 0.986 0.998 1.000 0.422 0.257 0.204 0.303 0.215Classes 20% 0.659 0.999 1.000 0.995 0.994 0.999 0.487 0.285 0.239 0.352 0.238

40% 0.677 1.000 0.998 0.991 1.000 0.998 0.530 0.315 0.288 0.366 0.27860% 0.689 1.000 0.999 0.990 1.000 1.000 0.470 0.369 0.323 0.349 0.38980% 0.701 1.000 0.998 0.980 1.000 1.000 0.509 0.460 0.290 0.396 0.379100% 0.699 1.000 0.998 0.994 1.000 0.999 0.500 0.361 0.358 0.369 0.414

FV 5% 0.629 0.997 0.993 0.977 1.000 0.991 0.377 0.275 0.242 0.251 0.188200 10% 0.638 0.997 0.999 0.982 1.000 0.996 0.381 0.347 0.172 0.322 0.186Classes 20% 0.639 0.998 0.999 0.996 1.000 0.992 0.429 0.307 0.212 0.259 0.198

40% 0.668 0.999 0.998 0.996 0.999 1.000 0.491 0.364 0.200 0.362 0.27560% 0.693 1.000 0.999 0.986 1.000 1.000 0.523 0.404 0.288 0.386 0.34480% 0.709 1.000 0.997 0.998 1.000 1.000 0.506 0.524 0.328 0.354 0.379100% 0.708 1.000 0.998 0.995 0.999 1.000 0.534 0.462 0.281 0.454 0.354

Table A.1: Each class result of FV AP scores on the incrementing ImageNet datasets

51

Appendix A. Imagenet Evaluation Results 52

Model Ratio MAP Geyser Odo-meter

Canoe Website Yellowlady’sslipper

Hook,Claw

Muzzle Spatula Hatchet Ladle

CNN 5% 0.523 0.874 0.739 0.677 0.862 0.981 0.225 0.338 0.180 0.207 0.14710 10% 0.534 0.911 0.772 0.652 0.910 0.991 0.249 0.213 0.227 0.193 0.226Classes 20% 0.576 0.922 0.823 0.886 0.939 0.986 0.270 0.338 0.199 0.231 0.167

40% 0.664 0.961 0.990 0.963 0.989 0.993 0.386 0.437 0.283 0.321 0.31360% 0.705 0.988 0.993 0.966 0.992 0.996 0.429 0.506 0.340 0.510 0.33480% 0.701 0.984 0.997 0.959 0.991 0.996 0.474 0.490 0.313 0.429 0.377100% 0.715 0.984 0.996 0.961 0.990 0.997 0.475 0.471 0.378 0.547 0.353

CNN 5% 0.541 0.898 0.831 0.760 0.878 0.995 0.199 0.209 0.302 0.180 0.163100 10% 0.593 0.930 0.934 0.909 0.957 0.991 0.244 0.308 0.252 0.201 0.202Classes 20% 0.641 0.957 0.986 0.948 0.967 0.999 0.377 0.343 0.341 0.282 0.213

40% 0.692 0.990 0.983 0.976 0.978 1.000 0.389 0.397 0.414 0.486 0.30760% 0.709 0.988 0.988 0.988 0.990 1.000 0.489 0.474 0.363 0.463 0.34680% 0.741 0.999 0.992 0.984 1.000 1.000 0.564 0.445 0.438 0.585 0.408100% 0.752 0.999 0.998 1.000 1.000 1.000 0.574 0.536 0.452 0.580 0.381

CNN 5% 0.537 0.846 0.823 0.680 0.927 0.979 0.227 0.261 0.213 0.187 0.229200 10% 0.627 0.943 0.936 0.918 0.958 0.992 0.332 0.392 0.276 0.257 0.267Classes 20% 0.669 0.975 0.975 0.983 0.979 0.996 0.413 0.378 0.289 0.423 0.277

40% 0.711 0.984 0.996 0.999 0.999 1.000 0.417 0.499 0.419 0.469 0.32660% 0.759 0.994 0.999 1.000 1.000 1.000 0.594 0.572 0.483 0.547 0.39780% 0.779 1.000 0.999 1.000 1.000 1.000 0.618 0.636 0.514 0.607 0.421100% 0.786 1.000 0.998 1.000 0.999 1.000 0.647 0.625 0.538 0.610 0.446

Table A.2: Each class result of CNN AP scores on the incrementing ImageNet datasets

Model No.Classes

MAP Geyser Odo-meter

Canoe Website Yellowlady’sslipper

Hook,Claw

Muzzle Spatula Hatchet Ladle

CNN 20 0.702 0.996 0.997 0.987 0.998 1.000 0.524 0.401 0.325 0.533 0.260Default 100 0.693 0.995 0.995 0.988 0.996 1.000 0.337 0.350 0.469 0.515 0.286Setting 200 0.681 0.986 0.996 0.981 0.991 1.000 0.400 0.361 0.313 0.478 0.301

300 0.678 0.991 0.999 0.993 1.000 1.000 0.396 0.404 0.412 0.311 0.274

CNN 20 0.706 0.998 0.990 0.976 0.995 1.000 0.457 0.427 0.342 0.523 0.352Imbalanced 100 0.723 0.995 1.000 0.985 1.000 1.000 0.491 0.451 0.437 0.517 0.356Setting 200 0.724 0.998 0.998 0.989 0.998 1.000 0.510 0.443 0.433 0.483 0.391

300 0.708 0.995 0.999 0.984 0.997 1.000 0.461 0.442 0.394 0.497 0.308

Table A.3: Each class result of CNN AP scores on the imbalance ImageNet datasets

Appendix B

Wikipainting Evaluation Results

Model Ratio MAP AbstractExpressionism

ArtNouveau

Baroque ColorField

Cubism EarlyReinassance

FV 5 % 0.371 0.325 0.250 0.306 0.747 0.422 0.47110 % 0.400 0.353 0.294 0.329 0.762 0.548 0.51420 % 0.438 0.419 0.311 0.364 0.780 0.491 0.53140 % 0.471 0.415 0.303 0.380 0.837 0.629 0.57060 % 0.495 0.456 0.431 0.453 0.821 0.613 0.61780 % 0.512 0.490 0.465 0.459 0.818 0.631 0.636100 % 0.489 0.437 0.284 0.476 0.822 0.578 0.684

CNN 5 % 0.125 0.115 0.071 0.107 0.420 0.122 0.109Scratch 10 % 0.164 0.095 0.079 0.148 0.638 0.166 0.158

20 % 0.212 0.160 0.148 0.176 0.757 0.193 0.15940 % 0.278 0.162 0.130 0.184 0.768 0.187 0.22460 % 0.325 0.176 0.241 0.234 0.751 0.273 0.27480 % 0.361 0.252 0.237 0.269 0.829 0.357 0.386100 % 0.393 0.249 0.256 0.262 0.820 0.374 0.486

CNN 5 % 0.311 0.276 0.173 0.124 0.839 0.400 0.388Fine-tuned 10 % 0.374 0.339 0.291 0.219 0.840 0.442 0.512

20 % 0.443 0.450 0.332 0.215 0.885 0.568 0.53040 % 0.480 0.428 0.408 0.291 0.870 0.593 0.51360 % 0.534 0.449 0.478 0.362 0.846 0.659 0.59880 % 0.553 0.543 0.456 0.567 0.876 0.695 0.625100 % 0.559 0.560 0.464 0.443 0.894 0.652 0.722

53

Appendix B. Wikipainting Evaluation Results 54

Model Ratio Express-ionism

HighRenaissance

Impress-ionism

Manne-rism

Mini-malism

NaiveArt

Neo-classicism

NorthernRenaissance

FV 5% 0.134 0.240 0.386 0.258 0.574 0.277 0.625 0.48610% 0.102 0.247 0.405 0.316 0.626 0.316 0.638 0.55420% 0.218 0.263 0.451 0.454 0.721 0.403 0.637 0.64640% 0.145 0.432 0.391 0.466 0.741 0.440 0.677 0.62960% 0.163 0.351 0.458 0.501 0.796 0.459 0.682 0.68080% 0.175 0.457 0.416 0.485 0.810 0.483 0.694 0.697100% 0.169 0.428 0.363 0.518 0.858 0.471 0.629 0.650

CNN 5% 0.050 0.086 0.115 0.202 0.359 0.113 0.058 0.061Scratch 10% 0.071 0.091 0.116 0.244 0.516 0.088 0.141 0.080

20% 0.122 0.103 0.199 0.243 0.563 0.094 0.284 0.17640% 0.116 0.204 0.269 0.306 0.597 0.207 0.465 0.30160% 0.173 0.172 0.273 0.367 0.677 0.195 0.516 0.37680% 0.171 0.248 0.320 0.372 0.681 0.230 0.560 0.434100% 0.197 0.328 0.368 0.375 0.671 0.242 0.594 0.483

CNN 5% 0.127 0.179 0.256 0.211 0.755 0.180 0.447 0.264Fine-tuned 10% 0.125 0.284 0.285 0.218 0.689 0.259 0.548 0.420

20% 0.271 0.335 0.372 0.422 0.779 0.329 0.532 0.49840% 0.308 0.419 0.388 0.535 0.780 0.287 0.628 0.53860% 0.228 0.426 0.536 0.576 0.794 0.395 0.653 0.67680% 0.319 0.392 0.413 0.623 0.786 0.404 0.670 0.661100% 0.235 0.439 0.482 0.614 0.856 0.391 0.693 0.628

Model Ratio Pop Art PostImpress-ionism

Realism Rococo Roman-ticism

Surrealism Symbolism Ukiyo-e

FV 5% 0.310 0.233 0.174 0.460 0.225 0.133 0.270 0.86310% 0.373 0.230 0.169 0.474 0.278 0.179 0.225 0.86020% 0.433 0.231 0.154 0.431 0.361 0.128 0.351 0.85440% 0.479 0.368 0.162 0.591 0.379 0.153 0.284 0.89060% 0.454 0.242 0.215 0.504 0.457 0.233 0.408 0.90280% 0.521 0.350 0.187 0.599 0.425 0.227 0.350 0.881100% 0.466 0.266 0.191 0.584 0.457 0.211 0.325 0.881

CNN 5% 0.118 0.088 0.080 0.144 0.102 0.043 0.079 0.099Scratch 10% 0.169 0.070 0.118 0.175 0.149 0.063 0.079 0.156

20% 0.239 0.077 0.141 0.211 0.171 0.060 0.126 0.27340% 0.265 0.159 0.188 0.321 0.257 0.109 0.208 0.47860% 0.399 0.175 0.207 0.320 0.263 0.154 0.256 0.67180% 0.352 0.170 0.243 0.364 0.294 0.144 0.301 0.731100% 0.431 0.219 0.247 0.394 0.296 0.167 0.383 0.808

CNN 5% 0.337 0.111 0.098 0.382 0.192 0.119 0.274 0.716Fine-tuned 10% 0.389 0.166 0.197 0.443 0.279 0.134 0.373 0.778

20% 0.463 0.247 0.195 0.503 0.368 0.180 0.407 0.87140% 0.479 0.284 0.183 0.572 0.429 0.286 0.433 0.92060% 0.549 0.350 0.331 0.570 0.470 0.345 0.501 0.95380% 0.568 0.336 0.268 0.639 0.592 0.358 0.418 0.956100% 0.677 0.291 0.303 0.639 0.542 0.310 0.513 0.943

Table B.1: Each class result of CNN and FV AP scores on the incrementing Wikipaint-ing datasets

Bibliography

[1] Jason Farquhar, Sandor Szedmak, Hongying Meng, and John Shawe-Taylor. Im-proving ”bag-of-keypoints” image categorisation: Generative models and pdf-kernels. Technical report, University of Southampton, 2005.

[2] Albrecht Blaser, editor. Data Base Techniques for Pictorial Applications, Florence,Italy, June 20-22, 1979, Proceedings, volume 81 of Lecture Notes in ComputerScience, 1980. Springer. ISBN 3-540-09763-5.

[3] Shashikala Tapaswi and Ramesh Chandra Joshi. Classification of bio-medical im-ages using neuro fuzzy approach. In Database Systems for Advances Applications,9th International Conference, DASFAA 2004, Jeju Island, Korea, March 17-19,2004, Proceedings, pages 568–581, 2004.

[4] Amandeep Khokher and Rajneesh Talwar. Content-based image retrieval: Featureextraction techniques and applications. IJCA Proceedings on International Confer-ence on Recent Advances and Future Trends in Information Technology (iRAFIT2012), iRAFIT(3):9–14, April 2012.

[5] Sean Moran. Automatic image tagging. Master’s thesis, School of Informatics,University of Edinburgh, 2009.

[6] H. T. Nguyen, Q. Ji, and A. W. M. Smeulders. Spatio-temporal context for robustmultitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 29(1):52–64, 2007. URL https://ivi.fnwi.uva.nl/isis/publications/

2007/NguyenTPAMI2007.

[7] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J.Comput. Vision, 60(2):91–110, nov 2004. ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664.99615.94. URL http://dx.doi.org/10.1023/B:VISI.0000029664.

99615.94.

[8] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec-tion. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, InternationalConference on Computer Vision & Pattern Recognition, volume 2, pages 886–893,INRIA Rhone-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005.URL http://lear.inrialpes.fr/pubs/2005/DT05.

[9] Chih-Fong Tsai. Bag-of-words representation in image annotation: A review. ISRNArtificial Intelligence, 2012:19, 2012. doi: 10.5402/2012/376804. URL http://dx.

doi.org/10.5402/2012/376804.

[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassification with deep convolutional neural networks. In F. Pereira,

55

Bibliography 56

C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 25, pages 1097–1105. Cur-ran Associates, Inc., 2012. URL http://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks.

pdf.

[11] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-basedlearning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.

[12] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return ofthe devil in the details: Delving deep into convolutional nets. CoRR, abs/1405.3531,2014.

[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks forlarge-scale image recognition. CoRR, abs/1409.1556, 2014.

[14] Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, T. Cour, Kai Yu, LiangliangCao, and T. Huang. Large-scale image classification: Fast feature extraction andsvm training. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on, pages 1689–1696, June 2011. doi: 10.1109/CVPR.2011.5995477.

[15] F. Perronnin, Yan Liu, J. Sanchez, and H. Poirier. Large-scale image retrieval withcompressed fisher vectors. In Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on, pages 3384–3391, June 2010. doi: 10.1109/CVPR.2010.5540009.

[16] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/

abs/1409.4842.

[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. arXivpreprint arXiv:1409.0575, 2014.

[18] Gil Levi and Tal Hassner. Age and gender classification using convolutional neuralnetworks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)workshops, June 2015. URL http://www.openu.ac.il/home/hassner/projects/

cnn_agegender.

[19] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carls-son. CNN features off-the-shelf: an astounding baseline for recognition. CoRR,abs/1403.6382, 2014. URL http://arxiv.org/abs/1403.6382.

[20] MatthewD. Zeiler and Rob Fergus. Visualizing and understanding convolutionalnetworks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars,editors, Computer Vision – ECCV 2014, volume 8689 of Lecture Notes in ComputerScience, pages 818–833. Springer International Publishing, 2014. ISBN 978-3-319-10589-5. doi: 10.1007/978-3-319-10590-1 53.

[21] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and trans-ferring mid-level image representations using convolutional neural networks. In

Bibliography 57

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recog-nition, Washington, DC, USA, 2014. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.222.

[22] Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, andShuicheng Yan. CNN: single-label to multi-label. CoRR, abs/1406.5726, 2014.URL http://arxiv.org/abs/1406.5726.

[23] Mark Everingham, Luc Van Gool, ChristopherK.I. Williams, John Winn, and An-drew Zisserman. The Pascal Visual Object Classes (VOC) Challenge. Int. Journalof Computer Vision, (2), 2010. ISSN 0920-5691. doi: 10.1007/s11263-009-0275-4.URL http://dx.doi.org/10.1007/s11263-009-0275-4.

[24] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEETransactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006.

[25] Eric Brill. Processing natural language without natural language processing. InProceedings of the 4th International Conference on Computational Linguistics andIntelligent Text Processing, CICLing’03, pages 360–369, Berlin, Heidelberg, 2003.Springer-Verlag. ISBN 3-540-00532-3. URL http://dl.acm.org/citation.cfm?

id=1791562.1791607.

[26] Hua-Jun Zeng, Xuanhui Wang, Zheng Chen, Hongjun Lu, and Wei-Ying Ma. Cbc:Clustering based text classification requiring minimal labeled data. In ICDM, pages443–450. IEEE Computer Society, 2003. ISBN 0-7695-1978-4. URL http://dblp.

uni-trier.de/db/conf/icdm/icdm2003.html#ZengWCLM03.

[27] Sergey Karayev, Matthew Trentacoste, Helen Han, Aseem Agarwala, Trevor Dar-rell, Aaron Hertzmann, and Holger Winnemoeller. Recognizing image style. InBritish Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014.

[28] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories. In Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178, 2006. doi: 10.1109/CVPR.2006.68.

[29] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of localdescriptors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, October2005. ISSN 0162-8828. doi: 10.1109/TPAMI.2005.188. URL http://dx.doi.org/

10.1109/TPAMI.2005.188.

[30] Fei-Fei Li and Pietro Perona. A bayesian hierarchical model for learning naturalscene categories. In Proceedings of the 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’05) - Volume 2 - Volume 02,CVPR ’05, pages 524–531, Washington, DC, USA, 2005. IEEE Computer Society.ISBN 0-7695-2372-2. doi: 10.1109/CVPR.2005.16. URL http://dx.doi.org/10.

1109/CVPR.2005.16.

[31] Florent Perronnin and Christopher R. Dance. Fisher kernels on visual vocabulariesfor image categorization. In CVPR. IEEE Computer Society, 2007.

[32] Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors. Computer Vi-sion - ECCV 2010, 11th European Conference on Computer Vision, Heraklion,

Bibliography 58

Crete, Greece, September 5-11, 2010, Proceedings, Part IV, volume 6314 of LectureNotes in Computer Science, 2010. Springer. ISBN 978-3-642-15560-4. doi: 10.1007/978-3-642-15561-1. URL http://dx.doi.org/10.1007/978-3-642-15561-1.

[33] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn.,20(3):273–297, September 1995. ISSN 0885-6125. doi: 10.1023/A:1022627411411.URL http://dx.doi.org/10.1023/A:1022627411411.

[34] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze. Introductionto Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.ISBN 0521865719, 9780521865715.

[35] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice HallPTR, Upper Saddle River, NJ, USA, 2nd edition, 1998. ISBN 0132733501.

[36] Yoshua Bengio, Ian J. Goodfellow, and Aaron Courville. Deep learning. Book inpreparation for MIT Press, 2015. URL http://www.iro.umontreal.ca/~bengioy/

dlbook.

[37] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning appliedto document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.ISSN 0018-9219.

[38] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deepconvolutional neural networks. CoRR, abs/1301.3557, 2013. URL http://arxiv.

org/abs/1301.3557.

[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.

org/papers/v15/srivastava14a.html.

[40] Jake Bouvrie. Introduction notes on convolutional neural networks. Technicalreport, Department of Brain and Cognitive Sciences, Massachusetts Institute ofTechnology, 2006.

[41] Christopher M. Bishop. Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.ISBN 0387310738.

[42] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the im-portance of initialization and momentum in deep learning. In ICML (3), volume 28of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.

[43] Anders Krogh and John A. Hertz. A simple weight decay can improve generaliza-tion. In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4,pages 950–957. Morgan Kaufmann, 1992.

[44] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, RossGirshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecturefor fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[45] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computervision algorithms. http://www.vlfeat.org/, 2008.

Bibliography 59

[46] Jorge SaNchez, Florent Perronnin, and TeoFilo De Campos. Modeling the spatiallayout of images beyond spatial pyramids. Pattern Recogn. Lett., 33(16):2216–2223, dec 2012. ISSN 0167-8655. doi: 10.1016/j.patrec.2012.07.019. URL http:

//dx.doi.org/10.1016/j.patrec.2012.07.019.