Deep Representation Learning for Multimedia Data Analysis ...

Deep Representation Learningfor Multimedia Data Analysis

Habilitationsschrift(kumulative)

zur Erlangung des akademischen GradesDr. rer. nat. habil.

eingereicht an derDigital Engineering Fakultat

der Universitat Potsdam

vorgelegt von

Dr. rer. nat Haojin Yanggeboren am 02.12.1981 in Henan

Fur meine Eltern, Wei, Yuetong und Yanqing

“I do not know what I may appear to the world, but to myself I seem to havebeen only like a boy playing on the seashore, and diverting myself in now and thenfinding a smoother pebble or a prettier shell than ordinary, whilst the great oceanof truth lay all undiscovered before me.”

– Isaac Newton

“There is only one heroism in the world: to see the world as it is, and to loveit.”

– Romain Rolland

Dekan:Prof. Dr. Christoph Meinel, Digital Engineering Fakultat (DEF)

Gutachter:Prof. Dr. Christoph Meinel, Universitat Potsdam

Prof. Dr. Wolfgang Effelsberg, Universitat Mannheim

Prof. Dr. Ralf Steinmetz, Technische Universitat Darmstadt

Prufungskommission:Prof. Dr. Felix Naumann, Informationssysteme DEF (Vorsitzender)

Prof. Dr. Christoph Meinel, Internet-Technologien und -Systeme DEF

Prof. Dr. Wolfgang Effelsberg, Multimedia Systeme, Uni Mannheim

Prof. Dr. Ralf Steinmetz, Multimedia Kommunikation, TU Darmstadt

Prof. Dr. Andreas Polze, Betriebssysteme und Middleware DEF

Prof. Dr. Robert Hirschfeld, Software-Architekturen DEF

Prof. Dr. Patrick Baudisch, Human Computer Interaction DEF

Prof. Dr. Tobias Friedrich, Algorithm Engineering DEF

Prof. Dr. Erwin Bottinger, Digital Health Center DEF

Prof. Dr. Manfred Stede, Angewandte Computerlinguistik MNF

Einreichung:Mittwoch, 10. Oktober 2018

Kolloquium:Title: “Deep Representation Learning for Multimedia Data Analysis”

Termin: Freitag, 10. Mai 2019, 14.00 Uhr im HPI-Horsaal 2

Probevorlesung:Title: “A Concise History of Neural Networks”

Termin: Freitag, 14. Juni 2019, 09.00 Uhr im HPI-Horsaal 3

Abstract

In the last decade, due to the rapid development of digital devices, Internet

bandwidth and social networks, an enormous amount of multimedia data have

been created in the WWW (World Wide Web). According to the publicly avail-

able statistics, more than 400 hours of video are uploaded to YouTube every

minute [You18]. 350 million photos are uploaded to Facebook every day [Fac18];

By 2021, video traffic is expected to make up more than 82% of all consumer

Internet traffic [Cis17]. There is thus a pressing need to develop automated tech-

nologies for analyzing and indexing those “big multimedia data” more accurately

and efficiently. One of the current approaches is Deep Learning. This method

is recognized as a particularly efficient machine learning method for multimedia

data.

Deep learning (DL) is a sub-field of Machine Learning and Artificial Intelli-

gence, and is based on a set of algorithms that attempt to learn representations

of data and model their high-level abstractions. Since 2006 DL has attracted

more and more attention in both academia and industry. Recently DL has pro-

duced break-record results in a broad range of areas, such as beating human in

strategic game systems like Go (Googles AlphaGo [SSS+17]), autonomous driv-

ing [BDTD+16], and achieving dermatologist-level classification of skin cancer

[EKN+17], etc.

In this Habilitationsschrift, I mainly address the following research problems:

Nature scene text detection and recognition with deep learning. In this work, we

developed two automatic scene text recognition systems: SceneTextReg [YWBM16]

and SEE [BYM18] by following the supervised and semi-supervised process-

ing scheme, respectively. We designed novel neural network architectures and

achieved promising results in both recognition accuracy and efficiency.

Deep representation learning for multimodal data. We studied two sub-topics:

visual-textual feature fusion in multimodal and cross-modal document retrieval

task [WYM16a]; Visual-language feature learning with its use case image cap-

tioning. The developed captioning model is robust to generate the new sentence

descriptions for a given image in a very efficient way [WYBM16, WYM18].

vi

We developed BMXNet, an open-source Binary Neural Network (BNN) im-

plementation based on the well-known deep learning framework Apache MXNet

[CLL+15]. We further conducted an extensive study on training strategy and

executive efficiency of BNN on image classification task. We showed meaningful

scientific insights and made our models and codes publicly available; these can

serve as a solid foundation for the future research work.

Operability and accuracy of all proposed methods have been evaluated using

publicly available benchmark datasets. While designing and developing theo-

retical algorithms, we also work on exploring how to apply these algorithms to

practical applications. We investigated the applicability of two use cases, namely

automatic online lecture analysis and medical image segmentation. The result

demonstrates that such techniques might significantly impact or even subvert

traditional industries and our daily lives.

vii

Acknowledgments

First, I own a high debt of gratitude to my family for their unconditional support

throughout the years of my doctoral studies and habilitation. I would like to

thank my parents, even though we are geographically separated, we have always

remained close to each other. Especially to my father, you are the paragon and

spiritual support for me forever. Words fail to express my appreciation to my

love Wei, my angle Yuetong and my little rider Yanqing. Without you, I could

never finish this work.

Second, I would like to express my sincere gratitude to Professor Christoph

Meinel, my supervisor at the Hasso-Plattner-Institute for his guidance and inspi-

ration. Professor Meinel, I am much indebted to you for sharing your valuable

time and level of professionalism never cease to impress. Without your support,

I could not successfully finish my job at HPI.

I would like to thank the colleagues, who have contributed to this Habili-

tationsschrift. This includes Xiaoyin Che, Cheng Wang, Christian Bartz, Mina

Rezaei and Joseph Bethge.

Last but not least, I would like to thank Professor Hasso Plattner. You estab-

lished HPI, an excellent place for research. Your selfless support allows people

to chase great dreams. I would like to thank the beautiful time at HPI. All the

experiences I learned here just changed my life, and motivate me to keep moving

forward.

viii

Contents

List of Figures xiii

List of Tables xvii

PART I: Introduction and Fundamentals 1

1 Introduction 3

1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Research Topics . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Deep Learning: The Current Highlighting Approach of Artifi-

cial Intelligence 17

2.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 18

2.1.2 Representation Learning . . . . . . . . . . . . . . . . . . . 20

2.2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 24

2.2.1.1 Neural Network 1.0 . . . . . . . . . . . . . . . . . 25

2.2.1.2 Neural Network 2.0 . . . . . . . . . . . . . . . . . 25

2.2.2 Neural Network 3.0 - Deep Learning Algorithms . . . . . . 36

2.2.2.1 Data Preprocessing and Initialization . . . . . . . 37

ix

CONTENTS

2.2.2.2 Batch Normalization . . . . . . . . . . . . . . . . 40

2.2.2.3 Regularization . . . . . . . . . . . . . . . . . . . 42

2.2.2.4 Activation Function . . . . . . . . . . . . . . . . 43

2.2.2.5 Optimization Algorithms . . . . . . . . . . . . . 45

2.2.2.6 Loss Function . . . . . . . . . . . . . . . . . . . . 47

2.2.2.7 DNN Architectures . . . . . . . . . . . . . . . . . 48

2.2.2.8 Visualization Tool for Network Development . . . 59

2.3 Recent Development in the Age of Deep Learning . . . . . . . . . 60

2.3.1 Success Factors . . . . . . . . . . . . . . . . . . . . . . . . 61

2.3.2 DL Applications . . . . . . . . . . . . . . . . . . . . . . . . 62

2.3.3 DL Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3.4 Current Research Topics . . . . . . . . . . . . . . . . . . . 65

2.3.4.1 Region-based CNN . . . . . . . . . . . . . . . . . 65

2.3.4.2 Deep Generative Models . . . . . . . . . . . . . . 66

2.3.4.3 Weakly Supervised Model, e.g., Deep Reinforce-

ment Learning . . . . . . . . . . . . . . . . . . . 68

2.3.4.4 Interpretable Machine Learning Research . . . . . 69

2.3.4.5 Energy Efficient Models for Low-power Devices . 71

2.3.4.6 Multitask and Multi-module Learning . . . . . . 72

2.3.4.7 Capsule Networks . . . . . . . . . . . . . . . . . 74

2.3.5 Current Limitation of DL . . . . . . . . . . . . . . . . . . 75

2.3.6 Applicable Scenario of DL . . . . . . . . . . . . . . . . . . 76

PART II: Selected Publications 78

3 Assignment to the Research Questions 81

4 SceneTextReg 83

4.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

x

CONTENTS

5 SEE 91


5.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . 100

5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 100

5.3.1.1 Localization Network . . . . . . . . . . . . . . . . 100

5.3.1.2 Recognition Network . . . . . . . . . . . . . . . . 100

5.3.1.3 Implementation . . . . . . . . . . . . . . . . . . . 101

5.3.2 Experiments on Robust Reading Datasets . . . . . . . . . 101

6 Learning Binary Neural Networks with BMXNet 103


6.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7 Image Captioner 135


7.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 A Deep Semantic Framework for Multimodal Representation

Learning 157


8.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

9 Automatic Lecture Highlighting 181


9.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10 Medical Image Semantic Segmentation 197


10.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

11 Discussion 219

12 Conclusion 225

12.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

12.2 Some Concerns about DL . . . . . . . . . . . . . . . . . . . . . . 229

xi

CONTENTS

PART III: Appendices and References 231

A Ph.D. Publications 233

B Publications After Ph.D. 237

C Deep Learning Applications 245

References 249

Acronyms 277

xii

List of Figures

2.1 The taxonomy of AI, ML and DL . . . . . . . . . . . . . . . . . . 18

2.2 Scale drives DL progress (source: Andrew Ng’lecture 2013 ) . . . . 19

2.3 HOG feature for face verification . . . . . . . . . . . . . . . . . . 20

2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Hierarchical feature learning of DNN (image source: Zeiler and

Fergus 2013 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Visual pathway of visual cortex (image source: Simon Thorpe) . . 22

2.7 Brief history of machine learning . . . . . . . . . . . . . . . . . . 24

2.8 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.9 Non-linear activation, e.g., Sigmoid function . . . . . . . . . . . . 26

2.10 The CNN architecture of Lenet (image source: Lenet5 [L+15]) . . 29

2.11 A convolution operation on an image using a 3 × 3 kernel. Each

pixel in the output image is the weighted sum of 9 pixels in the

input image. (image credit: Tom Herold) . . . . . . . . . . . . . . 30

2.12 A pooling operation using a 2 × 2 kernel, stride 2, where max-

pooling and average pooling are demonstrated. . . . . . . . . . . . 31

2.13 Computational graph of a RNN in folded (left) and unfolded view

(right). (image credit: Xiaoyin Che) . . . . . . . . . . . . . . . . 32

2.14 Detailed structure of a “Vanilla” RNN cell. (image credit: Xiaoyin

Che) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.15 Detailed structure of a LSTM cell. (image credit: Xiaoyin Che) . 34

2.16 Computational graph of a Bidirectional RNN. (image credit: Xi-

aoyin Che) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xiii

LIST OF FIGURES

2.17 (Left) Images generated by our text sample generation tool. (Right)

Images taken from ICDAR dataset of the robust reading challenge.

(image credit: Christian Bartz ) . . . . . . . . . . . . . . . . . . . 38

2.18 Ground truth image created from computer games. (image credit:

[RVRK16]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.19 Pictorial representation of the concept Dropout . . . . . . . . . . 42

2.20 Activation function: (top-left) sigmoid function, (bottom-left) deriva-

tive of sigmoid function, (top-middle) tanh function, (bottom-

middle) derivative of tanh function, (top-right) ReLU function,

(bottom-right) derivative of ReLU function. . . . . . . . . . . . . 43

2.21 Derived linear unit activation function: (left) PReLU and LReLU

activation function, (right) ELU activation function. . . . . . . . . 45

2.22 Architecture of AlexNet . . . . . . . . . . . . . . . . . . . . . . . 49

2.23 Structure diagram of VGG-Net . . . . . . . . . . . . . . . . . . . 50

2.24 A stack of three convolution layers with 3×3 kernel and stride 1,

has the same active receptive field as a 7×7 convolution layer. . . 51

2.25 Structure diagram of GoogLeNet, which emphasizes the so-called

“Inception Module” . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.26 Naive version of “Inception Module”, where the numbers in the

figure e.g., 28×28×128 denote the width×height×depth of the fea-

ture maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.27 “Bottleneck” design idea for a convolution layer: preserves width

and height, but reduces depth . . . . . . . . . . . . . . . . . . . . 53

2.28 “Inception Module” with “bottleneck” layers, where the numbers

in the figure e.g., 28×28×128 denote the width×height×depth of

the feature maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.29 Structure diagram of ResNet. Left: a “Residual Module”, right:

the overall network design. The different colors indicate the blocks

with various layer type or the different number of filters. . . . . . 55

2.30 “Residual Module”. Left: initial design of the residual block, right:

residual block with “Bottleneck” design. . . . . . . . . . . . . . . 56

xiv

LIST OF FIGURES

2.31 A comparison of DNN models, which gives indication for practical

applications. Left: top-1 model accuracy on ImageNet dataset,

right: computation complexity comparison. (image credit: Alfredo

Canziani [CPC16]) . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.32 Exemplary visualization of VisualBackProp method. (image credit:

Mariusz Bojarski [BCC+16]) . . . . . . . . . . . . . . . . . . . . . 59

2.33 Block diagram of the VisualBackProp method. (image credit:

Mariusz Bojarski [BCC+16]) . . . . . . . . . . . . . . . . . . . . . 60

2.34 Top-5 errors of essential DL models in ImageNet challenge . . . . 62

2.35 Activity of DL frameworks. Left: arXiv mentions as of March 3,

2018 (past 3 months); Right: Github aggregate activity April -

July 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.36 Architecture of Generative Adversarial Networks (image source:

Gharakhanian) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.37 Multitask Learning Example . . . . . . . . . . . . . . . . . . . . . 73

5.1 Samples from ICDAR, SVT and IIIT5K datasets that show how

well our model finds text regions and is able to follow the slope of

the words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xv

LIST OF FIGURES

xvi

List of Tables

1.1 Comparison of available implementations for binary neural net-

works. Alongside BMXNet other implementations are difficult to

use for actual applications, because actual model saving and de-

ployment is not possible . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Classfication accuracy on ImageNet dataset of mainstream CNN

architectures. Essential information over the parameters is pro-

vided, such as the number of weights and computation operations,

etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2 DL framework features . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3 DL frameworks benchmarking. Average time (ms) for 1000 images:

ResNet-50 Feature Extraction (Source: analyticsindiamag.com) . 65

5.1 Recognition accuracies on the ICDAR 2013, SVT and IIIT5K ro-

bust reading benchmarks. Here we only report results that do not

use per image lexicons. (*[JSVZ15] is not lexicon-free in the strict

sense as the outputs of the network itself are constrained to a 90k

dictionary.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xvii

https://www.analyticsindiamag.com/evaluation-of-major-deep-learning-frameworks/

LIST OF TABLES

xviii

PART I:

Introduction and Fundamentals

1

1

Introduction

This Habilitationsschrift presents multimedia analysis with deep learning tech-

nology as the central theme. In this chapter, I will demonstrate the need for, the

benefits of applying deep learning methods, and present the research problems

we aim to solve in this thesis. Subsequently, I will summarize the achievement

regarding scientific contributions, and conclude the chapter with a structural

overview of the thesis.

1.1 Motivation and Scope

In the recent research of multimedia analysis and retrieval, multimodal represen-

tation learning becomes one of the most beneficial methods. The establishment

of this research trend is due to the significant improvement of machine learning

technology and the nature of multimedia data, which consist of multiple modal-

ities that illustrate the ordinary semantic meaning of information from hybrid

resources (e.g., visual, textual and auditory content).

Due to the rapid development of digital devices, Internet bandwidth, multi-

media portals, and social networks, the amount of multimedia data in the WWW

(World Wide Web) becomes particularly huge. More than 400 hours of video are

uploaded to YouTube every minute [You18]; 350 million photos are uploaded to

Facebook every day [Fac18]; By 2021, video traffic is expected to make up more

than 82% of all consumer Internet traffic [Cis17]. It is therefore pressingly needed

3

1. INTRODUCTION

to develop novel methods for processing those “big multimedia data” more ac-

curately and efficiently, to make them understandable and searchable based on

their content. Due to the high-efficiency machine learning technologies have been

widely applied in this domain. There are two benefits of doing so: first, we can

find more precise positions of the target objects in the multimedia content, as

e.g., searching for a visual target appears in the visual scene of an image or a

video, or locating a spoken text appears in an audio speech. Moreover, we in-

tend to understand the semantic meaning of the image or video content on top

of the visual and auditory recognition result and further generate humanly sound

nature language sentences for describing the content. The second benefit is that

machine learning is known as a data-driven technology which makes it highly

suitable to process massive amounts of data. Therefore the development of a

machine learning algorithm can in turn be gained by using multimedia data.

Artificial Intelligence (AI ) is the intelligence exhibited by computer. This

term is applied when a machine mimics “cognitive” functions that humans asso-

ciate with other human minds, such as “learning” and “problem-solving”. Cur-

rently, researchers and developers in this field are making efforts to AI and ma-

chine learning algorithms which intend to train the computer to mimic some

human skills such as “reading”, “listening”, “writing” and “decision making” etc.

Some AI applications such as Optical Character Recognition (OCR) and Au-

tomatic Speech Recognition (ASR) recently become conventional technologies in

the industry. One of the current machine learning approaches is Deep Learning,

which is recognized as a particularly efficient representation learning method for

multimedia data.

From the year 2006, Deep Learning (DL) or Deep Neural Network has at-

tracted more and more attention in both academia and industry. DL is a sub-field

of Machine Learning and Artificial Intelligence based on a set of algorithms that

attempt to learn representations of data and model their high-level abstractions.

In a deep neural network, there are multiple so-called neural layers between the

input and output. The algorithm is allowed to use those layers to learn higher

abstraction, composed of various linear and non-linear transformations. Due to

the rapid upgrade of computation power and copious amounts of training data

4


become available, deep learning achieved impressively great success in many top-

ics such as computer vision, speech recognition, and natural language processing.

Recently DL also gives us break-record results in many different application ar-

eas, e.g. beating human in strategic game systems like Go (Google’s AlphaGo

[SSS+17]), self-driving cars [BDTD+16], achieving dermatologist-level classifica-

tion of skin cancer [EKN+17], etc.

1.1.1 Research Questions

Human learning behaviors inspire many research ideas in DL. The human brain

can learn meaningful things from diverse contexts and perform several learning

tasks at the same time. Unfortunately, the current machine learning algorithms

can only achieve high performance on individual perception tasks. It is still

impossible to build a generic perception model for multiple tasks, such as solving

the computer vision problems like detecting and recognizing colors, objects, faces,

and reading texts simultaneously. Moreover, human learn new skills based on

previous experiences and knowledge rather than from scratch. Thus transfer

learning and multi-task learning are frequently discussed by current DL research

community.

In our work, we have raised the following research questions and conducted

different research topics for these issues, which will be discussed in section 1.1.2.

• Q1: DL is data hungry, how can we alleviate the reliance on substantial

data annotations?

– through synthetic data?

– through unsupervised and semi-supervised learning method?

• Q2: How can we perform multiple computer vision tasks with a uniform

end-to-end neural network architecture?

• Q3: How can we apply DL models on low power devices as e.g., smart-

phones, embedded devices, wearable and IoT (Internet of Things) devices?

• Q4: Can DL models gain multimodal and cross-modal representation learn-

ing tasks?

5

1. INTRODUCTION

• Q5: Can we effectively and efficiently apply multimedia analysis and DL

algorithms in real-world applications?

1.1.2 Research Topics

We conducted and investigated the following research topics according to the

research questions raised in the previous section.

Nature Scene Text Detection and Recognition is one of the yet not

solved essential problems in computer vision due to numerous difficulties, such

as different contrast and image quality, complicated background, lighting effect,

blending- and blurring impact, geometrical distortion, various font styles, and

font sizes, etc. In the nature scenes, text can be found on cars, road signs,

billboards, etc. Automatically detecting and reading text from natural scene

images is a crucial part of systems, which is being used for several challenging

tasks, such as image-based machine translation, autonomous driving, and image

or video indexing.

In our work, we developed two automatic scene text recognition systems Scene-

TextReg [YWBM16] and SEE [BYM18] by following the supervised and semi-

supervised processing scheme, respectively. To solve the lacking of training data

problem for fully supervised approach, we develop a data generator which can

efficiently create text image samples. We designed novel neural network architec-

tures and achieved promising results in both recognition accuracy and processing

speed.

Towards Lower Bit-width Neural Networks State-of-the-art deep mod-

els are computationally expensive and consume large storage space. Deep learning

is also strongly demanded by numerous applications from areas such as mobile

platforms, wearable devices, autonomous robots, and IoT devices. How to ef-

ficiently apply deep models on such low power devices becomes a challenging

research problem. The recently introduced Binary Neural Networks (BNNs) is

one of the possible solutions for this problem.

We developed BMXNet which is an open-source BNN implementation based

on the well-known deep learning framework Apache MXNet [CLL+15]. We con-

ducted an extensive study on training strategy and executive efficiency of BNN.

We systematically evaluated different network architectures and hyperparameters

6


to provide useful insights on how to train a BNN. Further, we present how we

improved classification accuracy by increasing the number of connections through

the network. We showed meaningful scientific insights and made our models and

codes publicly available. These can serve as a solid foundation for the future

research work.

Deep Representation Learning for Multimodal Data Representation

learning [BCV13a], or feature learning, is a set of techniques for transforming

raw input data to good representations which can adequately support machine

learning algorithms. The performance of a machine learning algorithm heavily

relies on the quality of data representations. Representation learning allows a

machine to automatically learn good discriminative representations in the con-

text of specific machine learning task, and make machine learning methods less

dependent on labor-intensive feature engineering.

We can apply representation learning methods to multimodal data, e.g., given

an image with its textual tags, the learned word representation can be combined

with its visual representation, to enable the exploration of the shared semantics

of two modalities. Another example is automatic video indexing by combin-

ing visual and auditory representation. In multimodal representation learning,

joint representation of two different modalities can be learned via end-to-end

neural networks. Therefore in our work, we studied two sub-topics: visual-

textual feature fusion in multimodal and cross-modal document retrieval task

[WYM16a]; Visual-language feature learning with its use case Image Caption-

ing [WYBM16, WYM18]. The developed captioning model is robust to generate

novel sentences for describing arbitrary given images in a very efficient way.

We studied the feasibility and practicality of developed algorithms in two

practical use cases: Automatic Online Lecture Analysis and Medical Image Seg-

mentation.

Automatic Online Lecture Analysis In this work, we propose a com-

prehensive solution to highlight the online lecture videos at different levels of

granularity. Our solution is based on automatic analysis of multimedia lecture

materials, such as speeches, transcripts and lecture slides (both in the file and

video format). The extracted highlighting information can facilitate the learners,

7

1. INTRODUCTION

especially in the context of MOOCs (Massive Open Online Course). In com-

parison with ground truth created by experts, our approach achieves satisfied

precision, which is better than baseline approaches and also welcomed by user

feedbacks.

Medical Image Segmentation In this work, we introduced a novel Condi-

tional Refinement Generative Adversarial Network to address the medical image

segmentation task. Our approach can solve several common problems in medi-

cal image segmentation, such as unbalanced data distribution of classes, varying

image dimension and resolution. We achieved promising results on three popular

medical imaging datasets for semantic segmentation of abnormal tissues as well as

body organs, including BraTS2017 dataset for brain tumor segmentation (MRI

images) [20117a], LiTS2017 dataset for liver cancer segmentation (Computed To-

mography (CT ) images) [20117b], and MDA231 microscopic light dataset from

human breast carcinoma cells [BE15]. Overall, the achieved results demonstrate

strong generalization ability of the proposed method for medical image segmen-

tation task.

1.2 Contribution

As mentioned in the last section, the central theme of this Habilitationsschrift

is multimedia data analysis with deep learning algorithms. Thus, the scientific

contribution and publication can be categorized according to the involved analysis

tasks and developed deep learning frameworks. The main contributions of the

thesis are presented as follows:

• I extensively studied text detection and recognition problem using deep

learning technology in several different application contexts.

– First, I developed SceneTextReg1, a real-time scene text recognition

system. The system applies deep neural networks in both text detec-

tion and word recognition stage. I trained the corresponding models

in the fully supervised manner. SceneTextReg achieved the same level

concerning word recognition accuracy as Google’s PhotoOCR system

1https://youtu.be/fSacIqTrD9I

8

https://youtu.be/fSacIqTrD9I

1.2 Contribution

[BCNN13a]. It is worth to mention that Google’s system was trained

based on millions of real-world samples created by human annotators,

where SceneTextReg was trained by only using synthetic samples gen-

erated by our data engine.

– Although data generator works well for text, most of the time, the

same concept is hard to be ported to arbitrary object classes, and tech-

nically not possible to simulate all the scenarios. Therefore, concerning

the long-term vision, unsupervised as well as semi-supervised methods

are desired if we want to scale up the number of supported classes.

We intended to develop a semi-supervised system for object detection

and recognition. Because of the accumulated experiences in the past,

we still choose text as the first experimental subject. We proposed

a semi-supervised system SEE for end-to-end scene text recognition.

This system only applies weak supervision signal for text detection and

achieved state-of-the-art accuracy on many popular opened benchmark

datasets.

– As a useful application, we developed a new approach for the writer

independent verification of offline signatures. This approach is based

on deep metric learning. By comparing triplets of two genuine and

one forged signature, the system learns to embed signatures into a

high-dimensional space, in which the Euclidean distance functions as a

metric of their similarity. Our system ranks best in nearly all evaluation

metrics from the ICDAR SigWiComp 2013 challenge [MLA+13].

(related publications: [BYM18, BYM17a, YWBM16, RYM16, YWC+15,

BQ15])

• Binary neural networks (BNNs) seem to be a promising approach for devices

with low computational power. However, none of the existing mainstream

deep learning frameworks natively supports such binary neural layers, in-

cluding Caffe [JSD+14], Tensorflow [ABC+16], MXNet [CLL+15], PyTorch

[PCC+], Chainer [TOHC15]. Existing BNNs or quantized NN approaches

[RORF16, HCS+16, ZWN+16, LNZ+17, LZP17], which have promising re-

sults. However, there is often no source code for actual implementations

9

1. INTRODUCTION

Table 1.1: Comparison of available implementations for binary neural networks. Along-

side BMXNet other implementations are difficult to use for actual applications, because

actual model saving and deployment is not possible

Title GPU CPUPython

API

C++

API

Save

Binary

Model

Deploy

on

Mobile

Open

Source

Cross

Platform

BNNs [HCS+16] X X X

DoReFa-Net [ZWN+16] X X X X X

XNOR-Net [RORF16] X X

BMXNet [YFBM17a] X X X X X X X X

present (see Table 1.1). This makes follow-up research and application de-

velopment based on BNNs difficult. Moreover, architectures, design choices,

and hyperparameters are often presented without thorough explanation or

experiments. According to the actual needs, we thus made the following

contributions:

– First, we developed BMXNet [YFBM17a], an open-source BNN imple-

mentation based on the well-known deep learning framework Apache

MXNet [CLL+15]. We share our code and developed models for re-

search use, from which both academia and industry can take advan-

tage. To our knowledge, BMXNet is the first open-source BNN

implementation that supports binary model saving and de-

ployment on Android as well as iOS mobile devices.

– We further focus on increasing our understanding of the training pro-

cess and making it accessible to everyone. We provide novel empirical

proof for the choice of methods and parameters commonly used to train

BNNs, such as how to deal with the bottleneck architecture and the

gradient clipping threshold. We found that dense shortcut connections

can improve the classification accuracy of BNNs significantly and show

how to create robust models with this architecture. We present an

overview of the performance of commonly used network architectures

with binary weights.

(related publications: [YFBM17a, BYBM18, BYBM19])

10

1.2 Contribution

• The contributions of multimodal representation learning are summarized as

follows:

– In visual-textual multimodal representation learning, we propose a deep

semantic framework for mapping visual and textual feature to common

feature space. By imposing supervised pre-training as a regularizer, we

can better capture intra- and inter-modal relationships. In multimodal

fusion, it shows that combining visual and textual features can achieve

better performance compared to unimodal features. In our experiment,

we applied two mainstream datasets: Wikipedia dataset [RCPC+10]

and MIR Flicker 25K [HL08]. We achieved state-of-the-art results on

both cross-modal and multimodal retrieval tasks.

(related publication: [WYM16a, WYM15b, WYM15a])

– In multimodal video-representation learning, to learn the discrimina-

tive video representations we explored the fusion of video appearance,

motion, and auditory information. Our experimental result shows that

fusing spatial, temporal (motion) and auditory information can boost

recognition performance with appropriate fusion strategies. Our ap-

proach achieved highly competitive performance as compared to previ-

ous methods.

(related publication: [WYM16b])

– In visual-language representation learning, we developed an end-to-

end trainable deep Bidirectional Long-Short Term Memory (BLSTM )

network to capture the relationships between a visual input image

and language sequences. The effectiveness and generalization ability

of the proposed system have been evaluated using multiple bench-

mark datasets, including Flickr8K [RYHH10], Flickr30K [YLHH14],

MSCOCO [LMB+14], and Pascal1K [RYHH10]. The experimental re-

sults show that our models outperformed related work in both image

captioning and image-sentence retrieval task. Furthermore, we conduct

a transfer-learning experiment on the Pascal1K dataset. The result

demonstrates that without applying the training data from Pascal1K,

11

1. INTRODUCTION

our model still achieved the best performance on both tasks. We devel-

oped a real-time captioning system for demonstration purpose, called

Neural Visual Translator1.

(related publication: [WYBM16, WYM18])

• As mentioned in previous chapters, I extensively studied two application

use cases, the proper contributions are summarized as follows:

– In automatic online lecture analysis, we propose a comprehensive solu-

tion to highlight the online lecture videos in both lecture segment and

transcript sentence level. Our solution is based on automatic analy-

sis of multimedia lecture materials, such as speeches, transcripts and

lecture slides (both in the file and video format). The extracted high-

lighting information can facilitate the learners, especially in the context

of MOOCs.

For sentence-level lecture-highlighting based on audio and subtitles of

MOOC videos, we achieved the precision over 60%, in comparison with

ground truth created by experts. This is way better than baseline

work and also welcomed by user feedbacks. Segment-level lecture-

highlighting works with statistical analysis, mainly by exploring speech

transcripts, lecture slides, and their correlations. With the ground

truth created by massive users, an evaluation process shows that the

general accuracy can reach 70%, which is reasonably promising. Fi-

nally, we conducted and report the correlation study of two types of

lecture highlights.

(related publication: [CYM18, CLYM16, CYM15, CYM13])

– In medical image segmentation, we introduced a novel Conditional Re-

finement Generative Adversarial Network to address the medical image

segmentation task. We studied the effects of several crucial architec-

tural choices for semantic segmentation task on medical imaging. We

introduce a patient-wise mini-batch normalization technique that helps

to accelerate the learning process and improve the accuracy.

1https://youtu.be/a0bh9_2LE24

12

https://youtu.be/a0bh9_2LE24

1.3 Publication

We achieved promising results on three famous medical imaging datasets

for semantic segmentation of abnormal tissues as well as the body or-

gan, including BraTS2017 dataset for brain tumor segmentation (MRI

images) [20117a], LiTS2017 dataset for liver cancer segmentation (Com-

puted Tomography (CT ) images) [20117b], MDA231 microscopic light

dataset from human breast carcinoma cells [BE15].

(related publication: [RYM19a, RYM18, RYM19b])

• The recently proposed Generative Adversarial Networks (GANs) [GPAM+14]

achieved state-of-the-art results on a large variety of unsupervised learn-

ing tasks, such as image generation, audio synthesis, and human language

generation, etc. However, there are still several significant shortcomings

of GANs, such as missing modes from the data distribution or even col-

lapsing large amounts of probability mass on some modes. We extensively

studied the mode-collapse problem and proposed to incorporate adversarial

dropout in generative multi-adversarial networks. Our approach forces the

single generator not to constrain its output to satisfy a single discrimina-

tor, instead, to fulfill a dynamic ensemble of discriminators. We show that

this approach leads to a more generalized generator, promoting variety in

the generated samples and avoiding the mode-collapse problem commonly

experienced with GANs. We provide evidence that the proposed solution

promotes sample diversity on five different datasets, mitigates mode-collapse

and further stabiles training. (related publication: [MYM19a, MYM19b])

1.3 Publication

Earlier version of several parts of this thesis have been published in international

journals and presented at international scientific conferences.

According to the requirements of the cumulative Habilitationsschrift at the

University of Potsdam, I prepared two publication lists respectively for the time

of my Ph.D. study and after the Ph.D.:

• Publications during my Ph.D. study (14): Appendix A

• Publications after Ph.D. (40+): Appendix B

13

1. INTRODUCTION

A list of selected publications from Appendix B, assigned to the corresponding

research questions (defined in section 1.1.1), for this cumulative Habilitationss-

chrift is prepared as follows:

• Q1, Q2:

– “SceneTextReg: A Real-Time Video OCR System” [YWBM16]

– “SEE: Towards Semi-Supervised End-to-End Scene Text Recognition”

[BYM18]

• Q3:

– “BMXNet: An Open-Source Binary Neural Network Implementation

Based on MXNet” [YFBM17b]

– “Learning to Train a Binary Neural Network” [BYBM18]

– “Back to Simplicity: How to Train Accurate BNNs from Scratch?”

[BYBM19]

• Q4:

– “Image Captioning with Deep Bidirectional LSTMs and Multi-Task

Learning” [WYM18]

– “A Deep Semantic Framework for Multimodal Representation Learn-

ing” [WYM16a]

• Q5:

– “Automatic Online Lecture Highlighting Based on Multimedia Analy-

sis” [CYM18]

– “Recurrent Generative Adversarial Network for Learning Imbalanced

Medical Image Semantic Segmentation” [RYM19b]

14

1.4 Outline of the Thesis

1.4 Outline of the Thesis

The thesis is organized in the following manner.

Many commonly used techniques in DL are proposed in the past 2-3 years

and updated rapidly. They are relatively new to the readers without sufficient

DL knowledge, and the thorough explanation of those techniques are usually not

provided in the selected scientific publications in the PART II of the thesis. Thus,

to improve reader’s understanding, I wrote a “foundations” chapter (Chapter 2),

which presents some fundamentals of DL techniques as well as a comprehensive

review of recent efforts, development and current limitations of DL technologies.

Chapter 4 to 10 present the selected publications for this cumulative Habil-

itationsschrift (paper list cf. section 1.3). I provide an overview as well as a

clarification of the self-contribution for each selected paper.

Chapter 11 discusses the achievements and the current limitations of the pre-

sented approaches in this thesis. Chapter 12 concludes the thesis and provides a

comprehensive outlook on future work, followed by appendices and references.

15

1. INTRODUCTION

16

2

Deep Learning: The Current

Highlighting Approach of

Artificial Intelligence

Since the 1950s, Machine Learning (ML), a subset of Artificial Intelligence (AI),

started revolutionizing different application fields in the last few decades. Artifi-

cial Neural Networks (ANN) is a subfield of ML, and from where Deep Learning

(DL) spawned and has been considered as a subfield of representation learning

(cf. Figure 2.1).

Deep learning has demonstrated enormous success in a large variety of appli-

cations since AlexNet1 won the ImageNet challenge [DDS+09]. This new research

field of machine learning has been overgrowing and helps to open new opportu-

nity in AI research. There are different models proposed for the different class

of learning approaches, including supervised, semi-supervised, unsupervised and

deep reinforcement learning. Most of the time, the experimental results show the

state-of-the-art performance of deep learning over traditional machine learning

methods in the field of Speech Recognition, Machine Translation, Computer Vi-

sion, Image or Video Processing, Medical Imaging, Robotics, Natural Language

Processing (NLP) and many others. Meanwhile, DL is impacting many industrial

products, e.g., autonomous driving, digital assistant, digital health. The success

of Deep Learning has opened the current wave of AI.

1AlexNet is a deep neural network architecture, proposed by Krizhevsky et al. [KSH12]

17

2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE

Figure 2.1: The taxonomy of AI, ML and DL

One of the most crucial advantages of DL is the ability of hierarchically learn-

ing features from large-scale data. If we say ML is a data-driven technique, then

we can also say that the scale is driving DL progress. Figure 2.2 demonstrates the

comparison of deep neural networks with different levels of size and the traditional

ML algorithms. From which we can easily find out that one of the most significant

benefits of DL models is their superior fitting and generalization ability. Their

performance can be significantly gained by adding new training data and appro-

priately increasing the network complexity. As the number of data increases,

the performance of traditional machine learning approaches become steady. On

the contrary, the performance of deep learning models increases concerning the

increment in the amount of data.

2.1 Data Representation

2.1.1 Feature Engineering

In traditional ML, Feature Engineering is fundamental to its applications, is the

process of using domain knowledge of the data to create features that make ML

algorithms work. Creating handcrafted features is a time and cost expensive task,

and the expert knowledge is required. Therefore, researchers intended to explore

the feasibility of using algorithms such as artificial neural networks to perform

automated feature learning.

18


Figure 2.2: Scale drives DL progress (source: Andrew Ng’lecture 2013 )

In a traditional ML approach, given a new problem, we then often perform

the following working steps:

• Data preparation, create a labeled dataset as e.g., CIFAR [KH09] or Ima-

geNet [DDS+09] for image classification

• Spend hours for hand engineering representative features as e.g., HOG

[DT05], SIFT [Low04], LBP [OPM02], bag-of-words [MSC+13a], feeding

into a ML algorithm

• Evaluate different ML algorithms as e.g., SVM [BGV92], Random Forest

[LW+02]

• Repeat feature engineering and evaluation step, pick the best configuration

for the application

Figure 2.3 shows how do we use Histogram of Oriented Gradients (HOG),

a traditional feature engineering method for face verification. We first get a

candidate face region image as the input. We then apply edge filter to create the

gradient map of the input image. Based on the gradient magnitude in horizontal

and vertical direction, we can further calculate the gradient direction at each

image pixel. We then use the histogram to calculate the statistics of gradient

directions in local regions. The histogram of gradient directions is so-called HOG

feature, will be fed into an ML classifier such as SVM, to distinguish the class

category “face” or “non-face.”

19


Figure 2.3: HOG feature for face verification

2.1.2 Representation Learning

DL or Deep Neural Networks (DNN) on the other hand consists of input, out-

put, and several hidden layers in between (cf. Figure 2.4). DNN allows many

stages of non-linear information processing units (hidden layers) with hierarchi-

cal architectures that are exploited for feature learning and pattern classification

[Sch15, LBH15]. All the weights in a DNN can be updated by using Backpropa-

gation [RHW86a] algorithm.

This automated feature learning process has been referred to as representation

learning. Bengio et al. give the definition of Representation Learning : “Learn-

ing method based on representations of data can be defined as representation

learning” in [BCV13b].

Recent literature defines that DL based representation learning involves a

hierarchy of features or concepts, where the high-level concepts can be defined

from the low-level ones. Figure 2.5 visualizes the learned features extracted from

different levels of a DNN model, where the low-level features are edges, corners,

and gradients from different input color channels. The features from mid-level

layers are feature-groups showing the part of objects. Finally, the high-level

features depict more complete objects such as faces, wheels, bodies.

The bionic evidence of this kind of hierarchical structure has been found by

Gallant and Van Essen et al. [FV91]: the mammalian visual cortex is hierar-

chical. In Figure 2.6, the ventral (recognition) pathway in the visual cortex has

multiple stages: Retina - LGN - V1 - V2 - V4 - PIT - AIT ..., which contains lots

20


Figure 2.4: Artificial Neural Networks

Figure 2.5: Hierarchical feature learning of DNN (image source: Zeiler and Fergus 2013 )

of intermediate representations of the visual signal as e.g., simple visual forms

like edges and corners at V1, intermediate visual forms like feature groups at

V4, and high-level object descriptions like faces at AIT. This study successfully

revealed the essential bionic significance of DNN and shows that with DNN we

can learn higher abstractions (high-level features) from the late-stage layers, and

they demonstrate superior performance in a wide range of application domains.

In some articles, researchers describe DL as a universal learning approach that

can solve almost all kinds of problems in different application domains, showing

that DL is not task-specific [B+09].

21


Figure 2.6: Visual pathway of visual cortex (image source: Simon Thorpe)

2.2 Fundamentals

According to the Deep Learning book [GBCB16], DL approaches can be cat-

egorized into three categories: supervised, semi-supervised and unsupervised.

Moreover, there is another subfield of learning approaches called Deep Reinforce-

ment Learning (DRL) which often discussed under the scope of semi-supervised

or weakly supervised learning methods.

Supervised Learning The methods in this category have a common charac-

teristic: learning with fully labeled data, i.e., each sample xt from a dataset X

has a corresponding label yt. Thus the environment has a set of inputs and their

corresponding outputs (xt, yt) ∼ ρ. A DL model is trained by using the dataset

X.

For instance, if for an input xt, the model predicts yt = f(xt), then the model

can receive a loss value L(yt, yt). The training algorithm will then iteratively

update the model parameters for a better approximation of the desired outputs.

The training process stops when we are satisfied with the model outputs. Af-

ter successful training, the model can make correct predictions of given inputs.

There are different supervised learning approaches for DL including Convolu-

tional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) including

Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). CNNs

and LSTM, as the key techniques applied in our work, are thoroughly described

in section 2.2.1.2, respectively.

Unsupervised Learning The definition of unsupervised learning methods

can be roughly summarized as a learning approach without labels. In this cate-

22

2.2 Fundamentals

gory, the learning algorithms learn the internal representation and essential fea-

tures to discover unknown relationships or structure within the input dataset.

Typically, clustering, generative model, and dimensionality reduction method

are considered as unsupervised learning approaches. In the development of DL,

some unsupervised learning methods have achieved great success, including Auto

Encoders (AE) [VLBM08], Restricted Boltzmann Machine (RBM) [HS06], and

the recently proposed Generative Adversarial Networks (GANs) [GPAM+14].

Section 2.3.4.2 gives a brief overview of GANs. Moreover, LSTM and RL are

also applied for unsupervised learning in many application use cases.

Semi-supervised Learning, Deep Reinforcement Learning A learning

approach with partially labeled datasets or using weakly supervised signal is

considered as a semi-supervised or weakly supervised method. In the recent

DL research, Deep Reinforcement Learning is one of the typical semi-supervised

learning techniques.

For a given task, an RL system is set within the task-specific environment and

executes a set of actions. The consequence of each action on the environment is

measured, and a reward value is calculated. By maximizing the reward, the

system finds a series of actions that are most effective towards achieving the

specified task. This way of working is different from supervised learning and other

kinds of learning approaches studied before, including traditional statistical ML

methods, and ANN.

We can apply RL in a different scope of fields, such as decision making in

fundamental sciences, ML in computer science. Furthermore, the reward strategy

has been widely studied in the field of engineering and mathematics, robotics

control, power station control, etc., in the last couple of decades.

Deep Reinforcement Learning (DRL) first practiced in 2013 in Google Deep-

Mind’s paper [MKS+13]. From then on, DRL has achieved great success in mas-

tering strategy games, including AlphaGo and AlphaGo Zero for the game GO

[SSS+17], and other games like Atari [MKS+15], Dota2 [Blo18], Chess and Shougi

[SHS+17].

Mathematically, let (xt) ∼ ρ denote the training samples, and the intelligent

agent gives the prediction: yt = f(xt). The agent can receive the loss value:

ct ∼ P (ct|xt, yt), where P is an unknown probability distribution, the environment

23


Figure 2.7: Brief history of machine learning

asks the agent a question and returns a noisy score for agent’s answer. The

fundamental differences between RL and supervised learning are: first, there is no

straightforward loss function defined, in other words, we don’t have full knowledge

of the objective function being optimized. We have to query it through interaction

to get feedback. Second, we are interacting with a state-based environment: the

actual input xt depends on previous actions.

LeCun et al. [LBH15] pointed out that DRL is one of the most promising

directions of DL research.

2.2.1 Artificial Neural Networks

In this section, I will introduce the related fundamental technologies of Artificial

Neural Network according to its historical development timeline.

Figure 2.7 depicts a brief historical timeline of ML, from which we can cate-

gorize the development of ANN into three different time phases discussed in the

subsequent sections. More comprehensive description of ANN can be found in

the literature [BB+95, GBCB16].

24

2.2 Fundamentals

Figure 2.8: Perceptron

2.2.1.1 Neural Network 1.0

McCulloch and Pitts (1943) show that neurons can be combined to construct

a Turing Machine using ANDs, ORs, and NOTs [MP43]. Inspired by McCul-

loch and Pitts, Rosenblatt invented Perceptron [Ros58] in 1958. He showed

that perceptron can converge if the suitable learning objective can be well repre-

sented. The perceptron outputs a binary result y based on a linear combination

of weighted inputs and a threshold θ:

y =

{1 if

∑wixi + b > θ

0 otherwise(2.1)

where wi, xi and b respectively denote weight parameters, inputs, and a bias

term.

In 1969, Minsky and Papert [MP69] showed the limitations of Perceptron,

where Perceptron is not able to learn XOR logic, which is not linearly separable.

This paper killed the research in neural networks for a decade.

2.2.1.2 Neural Network 2.0

In the 1980s, the second wave of AI research emerged and built up the basics

which are employed today in DL community. There are many essential techniques

proposed during this period.

First, the most significant limitation of Perceptron for non-linear problems

has been overcome by adding a non-linear activation function to the neural layer.

Figure 2.9 depicts a commonly used non-linear activation function: Sigmoid func-

tion, which is applied on top of the linear combination of the weighted inputs.

Hornik et al. [HSW89] and Cybenko [Cyb89] proved that multilayer feedforward

25


Figure 2.9: Non-linear activation, e.g., Sigmoid function

networks are universal approximators. It means that even a very simple ANN

with only one single hidden layer, if its output is followed by a sigmoid activation

function, then it can approximate any math function from a finite space to an-

other finite space with any desired accuracy, provided that the number of hidden

neurons is large enough.

2.2.1.2.1 Backpropagation (BP) By adding non-linear activation function

for each layer, Multi-layered Perceptron (MLP) has been proposed. As the name

suggests, the MLP consists of multiple Perceptrons arranged in layers. However,

how to effectively train an MLP remained as a challenging problem until BP

algorithm being proposed by Rumelhart and Hinton et al. [RHW86b]. Algorithm

1 shows the pseudo code of basic BP. For MLP, we can easily represent ANN

models by using computation graphs. Then we can apply Chain-Rule to efficiently

calculate the gradient from the network output to the frontal layers with BP, as

shown in Algorithm 1 for a single path network.

We can define a composite function for an L-layers ANN.

y = f(x) = ϕ(wL...ϕ(w2ϕ(w1x+ b1) + b2)...+ bL) (2.2)

In Equation 2.2, L=2, we thus can rewrite the function as

y = f(x) = f(g(x)) (2.3)

according to the Chain-Rule, the derivative of this function is

∂y

∂x=∂f(x)

∂x= f

′(g(x)) · g′(x) (2.4)

26

2.2 Fundamentals

Algorithm 1 Backpropagation

Input: A network with l layers, the activation function is σl, the outputs of hidden

layer:

hl = σl(WTl hl−1 + bl)

and the network output:

y = hl

Calculate the gradient:

δ ← ∂ε(yi, yi)

∂y

For i ← l to 0 do

Calculate gradient w.r.t weights for present layer:

∂ε(y, y)

∂Wl=∂ε(y, y)

∂hl

∂hl∂Wl

= δ∂hl∂Wl

Calculate gradient w.r.t bias for present layer:

∂ε(y, y)

∂bl=∂ε(y, y)

∂hl

∂hl∂bl

= δ∂hl∂bl

Apply SGD using ∂ε(y,y)∂Wl

and ∂ε(y,y)∂bl

Backpropagate gradient to the frontal layers:

δ ← ∂ε(y, y)

∂hl

∂hl∂hl−1

= δ∂hl∂hl−1

End

27


2.2.1.2.2 Stochastic Gradient Descent (SGD) In Algorithm 1, we can realize

that the model optimization is based on the SGD algorithm, which is a stochastic

approximation of the gradient descent optimization and an iterative method for

minimizing a cost function. In DL, due to the large size of the dataset, SGD

often means mini-batch SGD. It performs an update for every mini-batch of n

training examples and reduces the variance of the parameter updates.

Algorithm 2 explains SGD in detail.

Algorithm 2 Stochastic Gradient Descent (SGD)

Input: cost function ε, learning rate η, a dataset {X,y} and the model z(θ, x)

Output: Optimum θ which minimizes ε

REPEAT until convergence:

Shuffle {X, y};For each mini-batch of xi, yi in {X,y} do

yi = z(θ, xi)

θ := θ - η· 1N

∑Ni=1

∂ε(yi,yi)∂θ

End

There were two epoch-making research works, Convolutional Neural Net-

works (CNN) [LBBH98] and Long Short-Term Memory (LSTM) [HS97],

published at this time. Since the two approaches are also the cornerstone of

our recent work, I provide the fundamental principles of them in the following

paragraphs.

2.2.1.2.3 Convolutional Neural Networks (CNNs) Fukushima first proposed

CNNs in 1988 [Fuk88]. However, due to the hardware limitation, CNNs first time

became popular in the research community after LeCun et al. obtained successful

results for the handwritten digit classification [LBBH98].

CNNs are more similar to the human vision system, being efficient at learning

hierarchical abstractions of visual features (cf. section 2.1.2). The pooling layer of

CNNs is effective in absorbing shape variations and reducing the feature dimen-

sions. Furthermore, CNNs apply smaller receptive fields which compose sparse

connections. It does not do a matrix multiplication with the entire input image

at once, but instead, performs a convolutional operation with the small kernels

on the input. This is the reason why CNNs have significantly fewer parameters

28

2.2 Fundamentals

Figure 2.10: The CNN architecture of Lenet (image source: Lenet5 [L+15])

than a fully-connected neural layer, and CNNs have translation invariant prop-

erty. Most of all, CNNs are trained with the gradient-based learning algorithm

like SGD, and can suffer less from the diminishing gradient problem by using

activation function such as Rectified Linear Unit (ReLU) [NH10].

Figure 2.10 shows the architecture of Lenet [L+15], developed by LeCun et al.

in 1998 for handwritten digits recognition. Lenet is the cornerstone of the current

CNN architectures and consists of feature extractors and a classifier network.

Each layer of the network receives the output from its adjacent previous layer

as its input and passes its output as the input to the next layer. The feature

extractors, in turn, consist of convolution and pooling (subsampling) layers, which

are placed in the low and middle-level of the network.

Generally, as the features propagate from lower level layers to higher-level

layers, the dimensions of feature maps are reduced progressively. However, the

number of feature maps usually increased in order to maintain the capacity of the

feature representations. In this way, we can obtain higher feature abstractions

through this layer-wise structure, without losing the information density. The

outputs of the last pooling layer are fed as the inputs into a fully-connected

(fc) network which is called classification layer. Feed-forward NNs are used as

the classification layer in many previous network architectures. However, due to

that, the fc-layers are expensive in terms of network parameters, researchers tend

to apply new techniques including average-pooling and global average-pooling

[HZRS16] as an alternative to fc-networks. The score of the respective class is

calculated using a softmax layer. Based on the highest score, the classifier gives

outputs for the corresponding classes.

29


input output3 × 3 kernel

Figure 2.11: A convolution operation on an image using a 3 × 3 kernel. Each pixel in

the output image is the weighted sum of 9 pixels in the input image. (image credit: Tom

Herold)

More formally, convolution layers are expressed as:

Y j(r) = σ

(∑

i

Kij(r) ∗X i(r) + bj(r)

)(2.5)

where X i and Y j are the i-th input and j-th output map, respectively. Kij

denotes the convolution kernel applied to the input maps and ∗ denotes the

convolutional operation. σ is a non-linear activation function, as e.g., ReLU. bj

is the bias term of the j-th output map and r indicates a local region where the

convolution performs.

Figure 2.11 shows the convolution of a single pixel in the output map Y j. For

example, a commonly used convolution kernel has a square size of 3×3. For each

pixel in the output map, the surrounding 3 × 3 = 9 pixels are involved by the

convolutional operation. A convolution layer usually has a set of kernels with a

pre-defined size, and their parameters are initialized randomly. This is to enhance

the richness of features, where each of the kernel is corresponding to the desired

output dimension. Other common hyperparameters include stride and padding.

The former one defines how many pixels the kernel moves on the input feature

map. The latter specifies how to handle convolutions along the edge of an input,

where the kernel would need to include pixels from outside the image. We will

use each kernel to convolve all possible positions of the input, and this property

is so called (Parameter Sharing). Instead of learning parameters for every nodes

30

2.2 Fundamentals

Figure 2.12: A pooling operation using a 2× 2 kernel, stride 2, where max-pooling and

average pooling are demonstrated.

in the input like the fc layer, it only learns a set of kw × kh × c × o parameters,

where kw and kh denote the width and height of the convolution kernel, c the

number of input channels, and o the number of output feature maps. A non-

linear activation function is applied to the output of a convolutional layer. The

parameters of the kernels are learned during the training process by using the BP

algorithm. A CNN layer in the frontal stage learns to detect primitive features

such as edges, corners, and higher feature abstractions can be obtained by this

layer wise architecture.

In Lenet, a convolutional layer is typically followed by a pooling layer, also

called subsampling or downsampling layer (cf. Figure 2.10). Pooling layers are

used to reduce the spatial dimensions (width and height) of input data. However,

the number of input and output feature maps does not change. For example, if

the input channel is N , then precisely N output feature maps will be created. If a

2×2 downsampling kernel with the stride two is used, then the spatial dimension

of the output map will be the half of the corresponding input dimension. The

commonly used pooling methods are average pooling and max-pooling, as shown

in Figure 2.12. In case of the average pooling, it will sum up over N×N patches of

the input maps and compute the average value. On the other hand, max-pooling

only outputs the maximum value within its neighborhood.

2.2.1.2.4 Recurrent Neural Networks (RNNs) Regarding the fact that, hu-

man thoughts have persistence. We don’t lose our thought of the moment and

31


Figure 2.13: Computational graph of a RNN in folded (left) and unfolded view (right).

(image credit: Xiaoyin Che)

start the new thinking from scratch in a second. For example, if we are reading an

article, we understand each sentence and paragraph based on the understanding of

previous sentences and paragraphs. It indicates that our knowledge significantly

depends on the sequential context information. However, traditional feed-forward

neural networks cannot deal with such problems. Therefore, the idea of Recur-

rent Neural Networks (RNNs) was developed in the 1980s. Figure 2.13 shows the

structure of basic RNNs in both folded and unfolded view.

Unlike CNNs, RNNs consist of layers with recurrent connections forming a

feedback loop to the previous state. An RNN in unrolled form consists of several

internal time steps. The output of the previous step will be part of the input

of next step. This allows RNNs to maintain internal states, also called hidden

states. The parameters in the RNN layer are kept static through continuous

steps.

In practice, RNNs are very well suited for processing sequential data of arbi-

trary lengths, such as videos (frame sequences) and texts (word sequences). There

are several different Modelling methods developed for RNN, including Many-to-

Many (language model, encoder-decoder model), Many-to-One (sentiment anal-

ysis) and One-to-Many (image captioning).

Figure 2.14 depicts the detailed structure of a “Vanilla” RNN cell. Mathe-

matically, the internal state ht of the current step can be computed as follows

32

2.2 Fundamentals

Figure 2.14: Detailed structure of a “Vanilla” RNN cell. (image credit: Xiaoyin Che)

(for simplicity the bias term bt is omitted):

ht = tanh(Uxt +Wht−1) (2.6)

where W and U are the weight matrix for the hidden state of the previous time

step ht−1 and the current input xt, respectively. The prediction result yt can be

calculated as:

yt = softmax(V ht) (2.7)

The probability distribution obtained by applying the softmax function on top of

the output. Once the forward propagation for the whole sequence finished, the

Cross-Entropy Loss for each time step t will be calculated, and the total loss is

the sum losses over all steps.

Backpropagating errors in an RNN can be done by applying the standard

BP algorithm to the unfolded computational graph of the RNN. This method

is called Back-Propagation Through Time (BPTT). The gradients obtained by

applying BPTT can be used by an optimization algorithm such as SGD to train

the RNN.

The main drawback of “Vanilla” RNNs are gradient exploding and vanishing

problem. While computing the gradient the same weight matrices are multiplied

with each other several times, which will lead to the following issues:

33


Figure 2.15: Detailed structure of a LSTM cell. (image credit: Xiaoyin Che)

• Exponentially increasing of gradients, if the weight parameters are larger

than abs(1.0). This leads to unstable training.

• Exponentially vanishing of gradients, if the weight parameters are smaller

than abs(1.0). The model only learns how to keep “short memory”.

The gradient exploding problem can be partially solved by setting a threshold

to “clip” the gradient. However, there is no good solution found for gradient

vanishing problem with “Vanilla” RNN.

In 1997, Hochreiter et al. proposed Long Short-Term Memory (LSTM) [HS97],

which offers an “Advanced” RNN cell structure.

Figure 2.15 demonstrates the detailed structural design of an LSTM cell.

Unlike “Vanilla” RNNs, LSTM is designed to capture sequential context over

long as well as short periods of time. The key idea of LSTM is the new design of

Cell State, intended to save long-term memory. It also introduces several new

hidden cells, each cell holds its internal state and is carefully controlled by three

trainable Gates. Those gates control the information flow and decide if the cell’s

state should be altered.

First, a forget gate ft decides which information to erase from the previous

cell’s state Ct−1. Next, the input gate it controls which parts of the new infor-

34

2.2 Fundamentals

mation Ct from the current time step are stored in the cell’s state. Finally, the

output gate ot constraints how state information is used for computing the hidden

state of the current time step ht.

Overall, the LSTM is calculated as follows:

Ct = tanh(Ucxt +WCht−1) (2.8)

ft = σ(Ufxt +Wfht−1) (2.9)

it = σ(Uixt +Wiht−1) (2.10)

ot = σ(Uoxt +Woht−1) (2.11)

Long-term memory update:

Ct = ft ◦ Ct−1 + it ◦ Ct (2.12)

Short-term memory output:

ht = ot ◦ tanh(Ct) (2.13)

Since LSTM achieved significant success in a wide range of application ar-

eas, there are many derived variants proposed. For example, a simplified ap-

proach called Gated Recurrent Unit (GRU) [CVMG+14] was developed for ma-

chine translation, which demonstrates similar accuracy but better efficiency com-

pared to LSTM. Another approach named Bidirectional LSTM (BLSTM) also

attracted a lot of attention from different application fields. Figure 2.16 shows

the basic computational graph of a Bidirectional RNN, where the backward RNN

cells (blue color) have an independent computation flow with the forward cells

(yellow color). On top of both data flows, an additional activation cell is used

to create the integrated outputs. BLSTM achieved many state-of-the-art results

in different application areas, such as image captioning [WYBM16], phoneme

classification [GS05], and named entity recognition [CN15] etc.

35


Figure 2.16: Computational graph of a Bidirectional RNN. (image credit: Xiaoyin Che)

Although, almost all the core algorithms used nowadays in DL community

already exist since the 1980s and successful evidence has been shown, unfortu-

nately, the second wave of AI research only lasted until the mid-1990s. There

are several important reasons for this result: first of all, some other ML methods,

such as SVM, achieved better results in the mainstream tasks; second, ANNs are

too hard to train and training a network is computationally costly; moreover, the

large-scale datasets used today were not available at that time. As a result, the

research of ANN ushered in another cold winter.

2.2.2 Neural Network 3.0 - Deep Learning Algorithms

In 2012, AlexNet proposed by Krizhevsky and Hinton et al. [KSH12], won the

ImageNet LSVRC-2012 competition and created a large margin to other com-

petitors (the top-5 error rate:16% vs. 26.2% of second place). This event opened

the current wave of AI, by which Deep Neural Networks (DNNs) are considered

to be the core foundation.

Generally speaking, the recent efforts in DL research are focusing on the opti-

mization of the information flow of DNNs and the architecture engineering. The

recent achievements of gradient flow optimization have focused on the following

36

2.2 Fundamentals

aspects:

• problem: gradient vanishing in deep networks; solution: ReLU [NH10]

• problem: “dying ReLU”; solution: LeakyReLU, PReLU, ELU [XWCL15],

etc.

• problem: extremely “deep” network is hard to train; solution: adding short-

cut connections to enable more flexible gradient circulation

• problem: strong bias in the data flow of deep networks; solution: forced sta-

bility of the mean and variance of parameters, Batch Normalization [IS15a]

• problem: overfitting; solution: adding noises to the gradient flow, Dropout

[SHK+14]

• problem: standard SGD is heavily relying on manual hyperparameter tun-

ing; solution: adaptive optimization methods, e.g. Adam [KB14]

I will give a detailed description of the mentioned achievements in the rest of

this chapter, followed by an introduction of several most significant DNN archi-

tectures. For more comprehensive knowledge of DL algorithms, I would highly

recommend the following literature [GBCB16, LBH15, B+09].

2.2.2.1 Data Preprocessing and Initialization

Successfully training a DNN usually requires some advanced training techniques

or components which need to be analyzed carefully. There are different ap-

proaches have been applied before feeding the data to the network. First, we

use mean-subtraction techniques to reduce the bias of the dataset. In practice,

for instance, we subtract the input image by the mean image in a pixel-wise man-

ner, where the mean image is computed based on the training set (e.g., the mean

image of AlexNet with the channel size [3, 224, 224]). Another influential deep

model VGG-Net [SZ15] uses per-channel mean subtraction with mean-vector hav-

ing three values [103.939, 116.779, 123.68] for “blue”, “green” and “red” channel,

respectively. Moreover, Google’s Inception-Net [SLJ+15] utilizes zero-centered

RGB channels and squashes the input to [-1, 1] further.

37


Figure 2.17: (Left) Images generated by our text sample generation tool. (Right) Images

taken from ICDAR dataset of the robust reading challenge. (image credit: Christian Bartz )

Due to that DL algorithms usually require large datasets for the training and

creating large datasets with pixel-level labels has been extremely costly due to the

amount of human effort required. Therefore, data augmentation methods have

been widely used to enhance the training datasets, where techniques for preparing

a dataset include sample rescaling, random cropping, flipping data with respect to

the horizon or vertical axis, color jittering, PCA/ZCA whitening as well as arbi-

trary combinations of small translation, rotation, stretching, shearing, distortions

and many others. Moreover, for some specific tasks, by which training data are

challenging to obtain, we often develop a data engine to generate the synthetic

data for the model training and use the real data for the performance testing. The

synthetic data should approximate the real data distribution as much as possible.

For example, in our work [YWBM16] we developed a data engine for generating

text sample images with various factors, such as font styles, background blend-

ing, distortion, blurring, reflection, etc. Figure 2.17 demonstrates a comparison

of generated samples with real-world images from the ICDAR dataset of the ro-

bust reading challenge. Base on this system we achieved similar word recognition

38

2.2 Fundamentals

Figure 2.18: Ground truth image created from computer games. (image credit:

[RVRK16])

results compared to Google’s PhotoOCR system [BCNN13a], which is built based

on millions of real-world data having been manually annotated.

Moreover, Richter et al. [RVRK16] developed an approach to rapidly creating

pixel-accurate semantic label maps for images extracted from modern computer

games. The authors produced dense pixel-level semantic annotations for 25 thou-

sand pictures synthesized by a photorealistic open-world computer game. Figure

2.18 shows an exemplary picture generated by this approach.

2.2.2.1.1 Parameter Initialization Parameter initialization of deep networks

has a considerable impact on the overall performance before Batch Normaliza-

tion has been proposed. This fact has been confirmed in many previous works

[SMDH13], most of the networks apply random initialization for weights. How-

ever, for more complicated tasks, effective initialization techniques are desired

for the high dimensionality input data. The weights of a DNN should not be

symmetrical to ease the backpropagation process. There are many effective tech-

niques proposed over the last few years. Glorot et al.[GB10] proposed a simple

but effective approach, by which the network weights Wl of the lth layer are

scaled by the inverse of the square root of the input dimension, more formally

this method can be represented as:

V ar(in) = 1Dl

Wl = Random(Dl,H)√Dl

(2.14)

39


where Dl denotes the dimension of the input of the lth layer, and Random(Dl, H)

denotes the randomly initialized weights. This method is known as Xavier initial-

ization, which is based on the symmetric activation function with respect to the

hypothesis of linearity [GB10]. However, this approach might also receive biased

data flow when we use it together with ReLU activation function. Therefore, He

et al. (2015) [HZRS15] introduced an additional term in the Xavier initialization

function which can effectively alleviate the data bias problem. The distribution of

the weights of the lth layer is a normal distribution with zero mean and variance2nl

, expressed as follows:

Wl ∼ N(0,2

nl) (2.15)

2.2.2.2 Batch Normalization

As mentioned in the previous section, in DL, we often use mean subtraction to

reduce the dataset bias. In this way, the network converges faster and shows

better regularization during training. It thus has a positive impact on the overall

accuracy. However, this process is performed outside of the network before the

training and the deep network has a layer-wise structure with a lot of non-linear

data transformations. Even some small changes in the frontal layers can result in

big outliers in the late layers, and such outliers will result in gradient bias in the

BP process. Thus, additional compensations (more training epochs) for solving

such outliers are required.

Batch Normalization (BN) [IS15a] helps to accelerate DL processes by re-

ducing internal covariance by shifting input data, where the inputs are linearly

transformed to have zero mean and unit variance. To optimize the training pro-

cess, BP is then applied to the internal layers of the DNNs. BN is commonly used

in the most state-of-the-art DNN architectures, such as ResNet, Inception-Net,

DenseNet, etc.

The algorithm of BN is given in Algorithm 3.

The parameters γ and β are defined for the scale and shift factor for the

normalized data values so that the normalization does not only depend on layer

values. Calculating BN is slightly different in testing, where the mean and vari-

ance are not computed based on the batch, rather than a single fixed empirical

40

2.2 Fundamentals

Algorithm 3 Batch Normalization (BN)

Inputs: Values of x over a mini-batch: B = {x1...xm}Output: {yi = BNγ,β(xi)}Parameters to be learned: γ, β

Calculates mini-batch mean:

µB ←1

m

m∑

i=1

xi

Calculates mini-batch variance:

σ2B ←1

m

m∑

i=1

(xi − µB)2

Normalize:

xi ←xi − µB√σ2B + ε

Scale and shift:

yi = γxi + β ≡ BNγ,β(xi)

mean and variance of activations during training is used. The advantages and

some criterions of using BN can be summarized as follows:

• Prevents outliers, so that improves gradient flow in the backward pass

• Increase learning rate which makes the training faster

• Reduce the strong dependence on initialization

• Reduce the need for dropout (reported by Ioffe and Szegedy, [IS15a])

• L2 weight regularization

• Remove Local Response Normalization (LRN) (if used)

• Shuffle training sample more thoroughly

• Use less distortion of images in the training set

41


Figure 2.19: Pictorial representation of the concept Dropout

2.2.2.3 Regularization

Regularization techniques in ML are defined to enhance the generalization ability

of an ML model. Unlike other technologies, regularization methods can keep the

number of features, but reduce their magnitude (feature value). It works well

when the ML models have a lot of features such as DNNs, where each feature

contributes a bit to the classification result. The commonly used regularization

methods in DNNs are L1, L2, and L1 + L2.

Different regularization approaches have been proposed in the past few years

for deep networks. From which Dropout is a very straightforward but efficient

approach, proposed by Srivastava et al. (2012)[SHK+14]. In Dropout a randomly

selected subset of activations is set to zero within a layer. It uses the parameter

dropout-rate to set the probability of dropping, e.g., 50%, which means that

50% of the neuron activations of a particular layer will randomly set to zero.

Therefore, training a deep model with Dropout can be considered as training a

large ensemble of sub-models. It prevents co-adaptation of features and forces the

network to have a redundant representation. The concept of Dropout is shown in

Figure 2.19. Dropconnect, proposed by Wang et al. [WZZ+13], is another effective

regularization approach. In this approach, instead of dropping the activations,

the subsets of weights within layers are set to zero. As a result, each layer receives

a randomly selected subset of units from the previous layer.

42

2.2 Fundamentals

Figure 2.20: Activation function: (top-left) sigmoid function, (bottom-left) derivative of

sigmoid function, (top-middle) tanh function, (bottom-middle) derivative of tanh function,

(top-right) ReLU function, (bottom-right) derivative of ReLU function.

2.2.2.4 Activation Function

Previously, Sigmoid and Tanh activation function have been commonly used

in neural networks. The corresponding graphical representations are shown in

Figure 2.20 (cf. the left and middle column). Equation 2.16 and 2.17 show the

mathematical expressions.

Sigmoid : σ(x) = 11+e−x

Derivative : σ′(x) = σ(x)(1− σ(x))

(2.16)

Tanh : tanh(x) = 21+e−2x − 1

Derivative : tanh′(x) = 1− tanh(x)2

(2.17)

However, both Sigmoid and Tanh function have the gradient vanishing prob-

lem when the network has many hidden layers. We were thus not able to train a

very deep network using those two activation functions. This problem has been

solved by using ReLU activation function, proposed by Nair et al. (2010) [NH10].

The basic concept of ReLU is to simply keeps all the values above zero and sets all

negative values to zero, as shown graphically in Figure 2.20 (cf. the right column)

43


and formally in Equation 2.18. ReLU converges much faster than Sigmoid and

Tanh, since the gradient always equals 1 when x ≥ 0. This characteristic solves

the gradient vanishing problem. But when x < 0, the corresponding neurons will

never be activated in the forward pass and always get zero gradients in backward

pass. This issue is so-called “dying ReLU” problem.

ReLU : f(x) = max(x, 0)

Derivative : f′(x) =

{0, if x< 01, if x≥ 0

(2.18)

The straightforward solution for solving this problem is simply to modify

the negative part of the ReLU function, to enable the negative data flow. Some

improved variants of ReLU have been proposed recently, such as parametric ReLU

(PReLU) [HZRS15], Leaky ReLU (LReLU) [MHN13] and Exponential Linear

Unit (ELU) [CUH15]. Figure 2.21 shows their graphical representations, and the

mathematical expressions are given in Equation 2.19, 2.20 and 2.21.

Leaky ReLU : f(x) =

{0.01x, if x< 0x, if x≥ 0


{0.01, if x< 01, if x≥ 0

(2.19)

PReLU : f(x) =

{αx, if x< 0x, if x≥ 0


{α, if x< 01, if x≥ 0

(2.20)

We can quickly realize that LReLU needs to manually choose the constant

parameter (0.01 in Equation 2.19), while PReLU can adaptively learn the pa-

rameter α from the training data.

ELU : f(x) =

{α(ex − 1), if x< 0x, if x≥ 0


{f(x) + α, if x< 01, if x≥ 0

(2.21)

44

2.2 Fundamentals

Figure 2.21: Derived linear unit activation function: (left) PReLU and LReLU activation

function, (right) ELU activation function.

ELU takes all benefits of ReLU and does not have the “dying ReLU” problem,

but its exponential function is computationally expensive. Xu et al. (2015)

[XWCL15] present an empirical study of several rectified activations in CNNs,

which provides more insights on this topic.

2.2.2.5 Optimization Algorithms

In section 2.2.1.2, we present SGD, a widely used optimization algorithm for

training neural networks. With SGD we can obtain a reliable result in the case

of a proper initialization and an appropriate learning rate scheduling scheme.

However, SGD requires a lot of manual tuning for its hyperparameters and con-

verges slowly compared to the recently proposed adaptive optimization methods.

Furthermore, SGD has trouble navigating ravines, i.e. areas where the surface

curves much more steeply in one dimension than in another, which are common

around local optima or saddle point.

2.2.2.5.1 Momentum Momentum [Qia99] is a method which helps to accel-

erate the training process with the SGD approach. It has similar functionality to

the same term in physics. The technique can boost SGD moving in the relevant

direction and dampens oscillations. The core idea of this method is to utilize the

moving average of the gradients instead of only using the current real value of

the gradient.

45


Mathematically, it adds a fraction factor γ of the update vector of the previous

time step to the current update vector, that can be expressed as follows:

vt = γvt−1 + η5θ J(θ)

θ = θ − vt(2.22)

where γ is the momentum term, η denotes the learning rate for the tth update, and

v denotes the “velocity”. The momentum term increases for dimensions whose

gradients tend in the same directions and dampens updates for dimensions whose

gradients change directions. This results in a faster convergence and reduces

oscillation.

However, even we can achieve some speedup with Momentum, we still have to

manually choose an initial learning rate and define its update scheduling scheme.

Therefore, the adaptive learning rate method is still highly desired.

2.2.2.5.2 RMSprop RMSprop is an adaptive learning rate method proposed

by Geoffrey Hinton in his Coursera lecture 1. RMSprop follows the same idea as

Adagrad [DHS11] that adapts the learning rate to each parameter, performing

more significant updates for infrequent and smaller updates for frequent param-

eters. Unlike Adagrad, RMSprop utilizes only the magnitude of gradient of the

previous iteration, which prevents the monotonically decreasing learning rate of

Adagrad and provides better performance in many cases. The mathematical ex-

pression of RMSprop is presented in Equation 2.23.

E[g2]t = γE[g2]t−1 + (1− γ)g2t

θt+1 = θt − η√E[g2]t+ε

gt(2.23)

Instead of inefficiently accumulating all previously squared gradients, the sum

of gradients is recursively calculated as a decaying average of all past squared

gradients. E[g2]t defines the running average at the time step t. From Equation

2.23, we can realize that E[g2]t only depends on the previous average and the

actual gradient. The learning rate η is divided by an exponentially decaying

average of squared gradients. The suggested default value of γ and η are 0.9 and

0.001, respectively.

1http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

46

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

2.2 Fundamentals

2.2.2.5.3 Adaptive Moment Estimation (Adam) Adam [KB14] is probably

the most widely used adaptive optimization method in the DL community at

present. Similar to its counterparts, Adam also computes the learning rate for

each parameter. This method can be considered as the combination of Momentum

and RMSprop. It utilizes the exponentially decaying average of past squared

gradients like RMSprop (indicated by vt in Equation 2.24) on the one hand and

keeps the exponentially decaying average of past gradients as well (denoted by

mt), which is similar to Momentum.

mt = β1mt−1 + (1− β1)gt

vt = β2vt−1 + (1− β2)g2t(2.24)

where mt and vt are defined respectively for estimating the first moment (repre-

sents the mean) and the second moment (represents the variance) of the gradients.

The authors of [KB14] observed that mt and vt are biased towards zero. There-

fore the bias-corrected version of them are:

mt = mt

1−βt1

vt = vt1−βt

2

(2.25)

Then, the parameter update function is expressed as:

θt+1 = θ − η√vt+ε

mt (2.26)

The proposed default value for β1, β2 and ε is 0.9, 0.999 and 10−8, respectively.

Adam works quite well in practice and is particularly suggested as a good starting

point for a deeper, more complex network.

2.2.2.6 Loss Function

The Softmax function is an activation function applied to the last fc-layer to

obtain a probability distribution. It maps any K-dimensional (K ∈ N+) input

vector x ∈ RK onto a vector with the value distribution [0, 1], where the sum of

all elements is 1:

p(x )i =expxi∑Kk=0 expxk

47


To train a neural network, we iteratively calculate the difference between a desired

output class y and a predicted class y. We refer to such an error measure between

the predicted and expected output value as a Loss Function. A commonly used

loss function for multi-class classification problem is the cross-entropy loss, which

is also mostly used in this thesis.

In binary classification, where the number of classes K = 2, cross-entropy can

be calculated as:

Lcross-entropy = −(y log(p) + (1− y) log(1− p))

For K > 2, we compute a separate loss for each class label per observation

and sum the result, which is expressed as follows:

Lcross-entropy = −K∑

k=1

yi,k log(pi,k)

where K denotes the number of classes, log is the natural log function, y denotes

the binary indicator (0 or 1) if class label k is the correct classification for the

observation i, and p is the predicted probability observation i is of class c.

2.2.2.7 DNN Architectures

In section 2.2.1.2, I present Lenet [L+15], which has been recognized as the cor-

nerstone of the subsequent CNN architectures. In this section, I will introduce

some recent CNN architectures which are in turn the backbones of the most

AI applications. The top-5 errors of essential deep CNN models in ImageNet

classification challenge can be found in Figure 2.34.

2.2.2.7.1 AlexNet As already mentioned, AlexNet [KSH12] can be considered

as the starting sign of the current AI wave. It achieved state-of-the-art accuracy

and won the most difficult ImageNet ILSVRC challenge in 2012 (Image classifi-

cation task with 1000 classes, 1.2 million training and 500k validation images).

It was a significant breakthrough in the field of machine learning and computer

vision for visual recognition and classification.

48

2.2 Fundamentals

Method LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50

Top-5 errors N/A 16.4% 7.4% 6.7% 5.3%

Input size 28×28 227×227 224×224 224×224 224×224

Number of conv-layers 2 5 16 21 50

Kernel size {5} {3, 5, 11} {3} {1, 3, 5, 7} {1, 3, 7}Number of weights (conv) 26k 2.3M 14.7M 6.0M 23.5M

Number of MACs (conv) 1.9M 666M 15.3G 1.43G 3.86G

Number of fc-layers 2 3 3 1 1

Number of weights (fc) 406k 58.6M 124M 1M 1M

Number of MACs (fc) 405k 58.6M 124M 1M 1M

Total weights 431k 61M 138M 7M 25.5M

Total MACs 2.3M 724M 15.5G 1.43G 3.9G

Table 2.1: Classfication accuracy on ImageNet dataset of mainstream CNN architectures.

Essential information over the parameters is provided, such as the number of weights and

computation operations, etc.

Figure 2.22: Architecture of AlexNet

As shown in Figure 2.22, AlexNet has five convolution layers and three fully-

connected (fc) layers. The first convolution layer consists of convolution and max-

pooling with Local Response Normalization (LRN), where 96 filters are used with

the size 11×11. The same operations are performed in the second convolution

layer with 256 5×5 filters, and in the 3rd 4th and 5th conv-layer with 384, 384, and

256 filters respectively. There are several novel concepts introduced in AlexNet,

including using ReLU activation function instead of Sigmoid and Tanh; adding

Dropout into the network as a regularizer; implementing CNNs using CUDA

which achieved a significant acceleration.

Table 2.1 shows some essential information of the network, such as the number

of convolution and fc layers, the number of weight parameters, and the number of

Multiplier-Accumulators (MACs) in convolution and fc layers, respectively. From

49


Figure 2.23: Structure diagram of VGG-Net

the table we can realize that the convolution layers occupy a large proportion of

the total computation (95%) and the fully-connected layers have a large number

of weight parameters (94%). It indicates that if we want to reduce the model

size, we should consider eliminating the fc layer, and to speed up the model

computation, we could apply parallel computing techniques in the convolutional

layers.

Overall, we can clearly see the outstanding contribution of AlexNet, as a pio-

neer or a game-changer who made important guidance for the subsequent devel-

opment of DNNs.

2.2.2.7.2 VGG-Net After the success of AlexNet, the architecture engineering

of deep CNNs has been developed rapidly and became a mainstream research

direction in the DL community.

In 2013, Zeiler and Fergus won the ILSVRC’13 classification challenge with

ZFNet, which only adaptes the filter size of the input convolution layer from

11×11 to 7×7 and increases the filter number of the last three convolution layers

to 512, 1024 and 512. Those changes made an accuracy improvement of about

5%.

In 2014, there were two essential approaches published: VGG-Net [SZ15] and

GoogLeNet [SLJ+15], which won the ILSVRC’14 localization and classification

challenge, respectively. VGG-Net also achieved the second place in the classifi-

cation task.

The main contribution of VGG-Net is that it shows the importance of the

depth of a network. Alongside the other hyperparameters of CNNs, the depth

50

2.2 Fundamentals

Figure 2.24: A stack of three convolution layers with 3×3 kernel and stride 1, has the

same active receptive field as a 7×7 convolution layer.

could be a critical component for achieving much better accuracy. As shown in

Figure 2.23, VGG-Net consists of several different modules of convolutional layers

regarding the number of conv-layers and output feature maps. ReLU activation

function has been applied for obtaining non-linearity, followed by a single max-

pooling layer and several fully connected layers. The final layer of the model is a

softmax layer for classification.

We can learn some interesting characteristics from the design of VGG-Net. It

first time introduced the notion “module” or “block” in a deep network (cf. Figure

2.23, where various colors characterize different modules). It strictly uses 3×3

filters with the stride and pad of 1. In this way, we can significantly reduce the

weights of the network. E.g. given a 7×7 input map, stacking three conv-layers

with 3×3 kernel has the same active receptive field but much fewer parameters

as directly using a 7×7 kernel (number of parameters: 3∗(32C2) vs. 72C2), as

demonstrated in Figure 2.24.

VGG-16 is one of the most influential deep models in the early age of DL.

Because it was the preferred choice in the community as an image feature ex-

tractor for a large varity of applications. This trend has not been changed until

Inception-Net and ResNet models become available. It emphasizes the notion

“keep deep and keep simple”. However, the most accurate model from the VGG

series: VGG-19 is also the most computational expensive model, which contains

138M weights and has 15.5G MACs.

2.2.2.7.3 GoogLeNet GoogLeNet [SLJ+15] is the winner of the ILSVRC’14

image classification challenge, proposed by Google researchers. The essential

51


Figure 2.25: Structure diagram of GoogLeNet, which emphasizes the so-called “Inception

Module”

design aim of this model is to reduce the computation complexity compared to

the traditional CNNs.

Figure 2.25 demonstrates the structure of GoogLeNet, from which we can

see that the network consists of many sub-networks, the so-called “Inception

Module”. The idea behind is to design a good local network topology (sub-

network) and then stack these modules on top of each other. Those sub-modules

have variable receptive fields, which were created by different kernel sizes. Then,

we can concatenate all filter outputs together in a depth-wise manner, which is

called “Depth-Concat”. The initial concept of the inception module can be seen

in Figure 2.26. It applies four parallel filter operations on the input of the current

layer, to learn visual features from different scales. A problem with this design

is that it can not effectively reduce the model complexity as initially expected.

Therefore a simple but quite efficient idea was proposed: the “bottleneck” design.

Figure 2.27 describes this idea, where given an input map with the depth,

width, and height of [64, 56, 56], we then use a 1×1 conv-kernel with the depth

32 and stride 1 to convolve the input. The output map has the dimension [32,

56, 56]. This way, we can preserve the width and height (spacial dimension) of

the feature map, but arbitrarily reduce its depth.

Figure 2.28 shows the final “Inception Module” of GoogLeNet with “bottle-

neck” layer design. A 1×1 convolution layer with depth 64 has been applied

52

2.2 Fundamentals

Figure 2.26: Naive version of “Inception Module”, where the numbers in the figure e.g.,

28×28×128 denote the width×height×depth of the feature maps.

Figure 2.27: “Bottleneck” design idea for a convolution layer: preserves width and height,

but reduces depth

53


Figure 2.28: “Inception Module” with “bottleneck” layers, where the numbers in the

figure e.g., 28×28×128 denote the width×height×depth of the feature maps.

before the activations feeding into the 3×3 and 5×5 convolution layers, and after

the 3×3 max-pooling layers, respectively. To do so, we can effectively reduce the

number of operations of the “Inception Module” from 854M to 358M.

GoogLeNet consists of 22 conv-layers (the deepest one at that time) and

achieves the top-5 errors of 6.7%, which is almost 10% gains compared to AlexNet.

Moreover, the number of total weights is 12× less than AlexNet, 27× less than

VGG-19. The “bottleneck” design provides a massive inspiration for the follow-up

lightweight CNN models.

2.2.2.7.4 ResNet ResNet, proposed by He et al. [HZRS16], swept 1st place

in all ILSVRC’15 and COCO’151 competitions, including image classification,

object detection, and object segmentation task. ResNet-152 achieved 3.57% top-

5 errors in ILSVRC’15 image classification, which was the first time a machine

learning model surpassed “Human performance” (5.1% top-5 errors reported in

[RDS+15]) on this task. In my opinion, the main contribution of this work is

that the authors found an efficient solution to train ultra-deep CNNs without

suffering from the network degradation and vanishing gradient problem. In the

paper [HZRS16], the authors conducted a study on increasing the number of

hidden layers on a “plain” CNN model with the VGG-Net like architecture.

1http://cocodataset.org

54

http://cocodataset.org

2.2 Fundamentals

Figure 2.29: Structure diagram of ResNet. Left: a “Residual Module”, right: the overall

network design. The different colors indicate the blocks with various layer type or the

different number of filters.

55


Figure 2.30: “Residual Module”. Left: initial design of the residual block, right: residual

block with “Bottleneck” design.

Surprisingly, a deeper model with 56 layers performed worse than a shallow model

with 20 layers. Since both testing and training accuracy of the shallow model were

better than the deeper one, it can thus indicate that it was not the overfitting

problem. The authors considered this result as the network degradation problem,

and thus came to a conclusion: multiple layer non-linear feed-forward network

is hard to learn identity mapping. Assume that copying the learned layers from

the shallower model to a deeper one and setting additional layers to identity

mapping, the deeper model should be able to perform at least as well as the

shallower one. Furthermore, the authors raised another question: since the direct

mapping in CNN is difficult to learn, can we learn the residual information of the

information flow? Based on this assumption the authors proposed a new network

module, called “Residual Module”, which suggests adding a residual connection

(serves as an identity mapping) into the conv-layers, shown in the left part of

Figure 2.29. The right part of Figure 2.29 depicts the overall design of ResNet.

ResNet is developed with many different numbers of layers: 18, 34, 50, 101,

152, and even 1202 (on CIFAR-10 dataset). The popular ResNet-50 contains

49 conv-layers and 1 fc layer at the end of the network for classification. The

total number of weights and MACs for the whole network are 25.5M and 3.9G

respectively.

56

2.2 Fundamentals

Figure 2.31: A comparison of DNN models, which gives indication for practical appli-

cations. Left: top-1 model accuracy on ImageNet dataset, right: computation complexity

comparison. (image credit: Alfredo Canziani [CPC16])

The basic residual module architecture is shown in Figure 2.30 (left). The

output of a residual layer H(X) can be defined based on the outputs from the

previous layer defined as X. F (X) is the output after performing operations in

the block, such as convolution, BN etc., followed by a ReLU activation function.

The final output of the residual unit at the current layer can be defined with the

following equation:

H(X) = F (X) +X (2.27)

if F (X) = 0, then H(X) = X is an identity mapping.

The whole ResNet is based on the stacked residual modules, by which there

are at least two conv-layers in each residual module, and it periodically doubles

the number of filters and downsamples the spatial dimension using the stride

2. Inspired by GoogLeNet the authors also utilized the “bottleneck” design (cf.

Figure 2.30 (right)) for the deeper ResNets (≥ 50 layers), which successfully

improves the efficiency and reduces the model size.

2.2.2.7.5 MobileNet There are two main approaches which allow for execu-

tion on mobile devices: One of them is to use quantized floating-point number

with lower precision values for weights and activations, as e.g., the binary NNs

use 1 bit of storage. Our work BMXNet [YFBM17b, BYBM18] belongs to this

group. On the other hand, information in a CNN can be compressed through

57


a compact network design. These designs rely on full-precision floating point

numbers, but reduce the total number of parameters with more efficient network

design, while preventing loss of accuracy.

One of the most impacting work is MobileNet [HZC+17], implemented by

Howard et al. in 2017. They use a so-called depth-wise separable convolution

technique, proposed by Chollet [Cho17] (2017). In this approach, the convolution

layers apply a single 3×3 filter to each input channel, and a 1×1 convolution is

employed to combine their outputs. The authors tend to keep large activation

maps through the network and downsample them late in the network, to main-

tain more information. The total number of weights and MACs for the whole

MobileNet are 4.2M and 568M respectively. On ImageNet dataset, it achieves

similar accuracy compared to VGG-16 and GoogLeNet but requires much fewer

weights and MACs.

Other approaches from this group include Xception [Cho17], SqueezeNet [IHM+16],

Deep Compression [HMD15], and ShuffleNet [ZZLS17], etc. These approaches

reduce the memory consumption, but still require GPU hardware for efficient

training and inference. Specific acceleration strategies for CPUs still need to be

developed for these methods. Therefore in our ongoing research work, we devel-

oped BMXNet [YFBM17b], which is indended to tackle this problem, presented

in section 6.

2.2.2.7.6 Summary AlexNet, VGG, Inception-Net, ResNet and MobileNets

are all in wide use and available in model zoos of the mainstream DL frame-

works. Canziani et al. (2017) [CPC16] provides a detailed analysis of DNN mod-

els regarding the practical use cases. Figure 2.31 shows the comparison results

regarding both accuracy and model complexity.

The recent research interests of DL architectures can be roughly summarized

as follows:

• Design of layer and/or skip connections

• Further improvement of the gradient flow

• The more recent trend towards examing necessity of depth vs. width, and

residual connections

58

2.2 Fundamentals

Figure 2.32: Exemplary visualization of VisualBackProp method. (image credit: Mariusz

Bojarski [BCC+16])

• Efficient, lightweight models for low power devices

2.2.2.8 Visualization Tool for Network Development

In this section, I will introduce a visualization technique called VisualBackProp

[BCC+16], which can serve as a visual indicator showing what features have

been learned from a deep CNN model. We can visualize extracted features from

deeper CNN layers with the same resolution as the input image. In other words,

we can visualize which sets of pixels of the input image contribute most to the

final predictions of a CNN model. It can be easily applied during both training

and inference in real time, due to its high efficiency. Therefore, VisualBackProp

becomes a debugging tool for the development of CNN based systems. Figure 2.32

demonstrates some input images and the corresponding feature representations.

Pixels in the highlighted regions are more correlated with the prediction results.

Figure 2.33 depicts the block diagram of VisualBackProp. The method uses a

forward pass to obtain a prediction. Then, it uses the feature maps obtained after

each ReLU activation function. Subsequently, the feature maps from each layer

are averaged, resulting in a single feature map per layer. Next, the averaged

feature map of the deepest convolutional layer is scaled up to the size of the

feature map of the previous layer, which can be achieved by using deconvolution

operation. The authors pointed out that for deconvolution the same filter size and

stride are used as in the convolutional layer which outputs the feature map that

they are scaling up with the deconvolution. In deconvolution, all weights are set

to 1 and biases to 0. Hadamard product is then performed between the obtained

59


Figure 2.33: Block diagram of the VisualBackProp method. (image credit: Mariusz

Bojarski [BCC+16])

scaled-up averaged feature map and the averaged feature map from the previous

layer. This process continues layer-by-layer exactly as described above until the

network’s input reached as shown in Figure 2.33. Finally, a result mask with the

size as the input image is obtained, where the values are normalized to the value

range [0, 1]. From the process we can realize that this method backpropagates the

feature maps instead of the gradients, it thus does not need additional gradient

calculation and the backward pass of the network. The obtained visual content

can serve as useful guidance for the CNNs development.

2.3 Recent Development in the Age of Deep Learning

DL is being applied in a large variety of application areas at present as mentioned,

which is the reason why we refer it to as a universal learning approach [B+09].

DL approaches show robustness to natural variations of automatically learned

features in those applications, and we can apply the same DL architecture with

different data types.

60


In this section, I will review some recent efforts in DL research and applications

regarding different perspectives.

2.3.1 Success Factors

The recognized success-factors of deep learning are summarized as the fol-

lowing three points:

• Enormous labeled datasets become available for training deep neural net-

works, e.g.,

– ImageNet dataset for image classification with 14,197k images in 21.841

categories

– YouTube-8M dataset1 with 7 million videos

• The rapid development of hardware accelerations and massive amounts of

computational power available

– Applying GPU in neural network computation becomes more common

– Training time of a very complicated neural network significantly re-

duced, e.g., ten years ago, we needed several months. But today, we

count it by days.

– The rapid development of cloud, high-performance and distributed

computing method

• Working ideas on how to train deep neural networks:

– Stacked Restricted Boltzman Machines (RBM), Hinton et al. 2006

[HS06]

– Stacked Autoencoders (AE), Bengio et al. 2008 [VLBM08]

The two work [HS06] and [VLBM08] are considered as the cornerstone of the

recent deep learning revolution, which proved the possibility of training deep

neural network and indicated the further research directions. At the same time,

the rapid development of high-performance computing and the appearance of

large-scale labeled datasets contributed to this revolution as well.

1https://research.google.com/youtube8m/

61

https://research.google.com/youtube8m/


Figure 2.34: Top-5 errors of essential DL models in ImageNet challenge

2.3.2 DL Applications

DL achieved many outstanding successes in the fields of computer vision and

speech recognition. ImageNet [DDS+09], in the real sense, is the first large-

scale dataset for visual recognition, which consists of 1.2 million training images

with 1000 object classes. In 2012, AlexNet won the ImageNet challenge and

outperformed the second place method by almost 10% regarding the classification

accuracy. Since then, DL models continuously broke the records and ruled nearly

all computer vision competitions. Figure 2.34 shows the accuracy of ImageNet

winners over the years. The winner of the year 2015, ResNet-152 shows the result

with only 3.57% top-5 classification error, which is better than human errors for

this task at 5.1% [RDS+15].

DL methods also achieved great success in speech recognition and machine

translation field. For instance, on the popular speech recognition dataset TIMIT

[Gar93], the recently developed deep learning approaches surpass all the other

previous methods. Specifically, it outperforms the CRF based methods by about

15% on phone error rate (PER). Similar breakthroughs have also been obtained

in the NLP field. E.g. for machine translation, significant improvements have

62


been accomplished through the recent Neural Machine Translation approaches

[BCB14, WSC+16].

There are many other challenging issues have been solved in the past few

years with DL, which were not possible to solve efficiently before. For instance:

image and video captioning [WYBM16, VRD+15], cross-domain image-to-image

style transferring using Generative Adversarial Network (GAN) [IZZE17], beat-

ing human in the strategy game Go [SSS+17], achieving a dermatologist-level

classification of skin cancer [EKN+17], and many more as described in Appendix

C.

2.3.3 DL Frameworks

In the past few years, a good number of open-source libraries and frameworks for

DL has been published. They provide a rich application environment for people

to choose from.

Selected mainstream DL frameworks and SDKs are listed below:

• Tensorflow : https://www.tensorflow.org/

• Caffe : http://caffe.berkeleyvision.org/

• KERAS : https://keras.io/

• MXNET : https://mxnet.apache.org/

• Theano (stopped) : http://deeplearning.net/software/theano/

• Torch : http://torch.ch/

• PyTorch : http://pytorch.org/

• Chainer : http://chainer.org/

• DeepLearning4J : https://deeplearning4j.org/

• DIGITS : https://developer.nvidia.com/digits

• CNTK : https://github.com/Microsoft/CNTK

• MatConvNet : http://www.vlfeat.org/matconvnet/

63

https://www.tensorflow.org/

http://caffe.berkeleyvision.org/

https://keras.io/

https://mxnet.apache.org/

http://deeplearning.net/software/theano/

http://torch.ch/

http://pytorch.org/

http://chainer.org/

https://deeplearning4j.org/

https://developer.nvidia.com/digits

https://github.com/Microsoft/CNTK

http://www.vlfeat.org/matconvnet/


Figure 2.35: Activity of DL frameworks. Left: arXiv mentions as of March 3, 2018 (past

3 months); Right: Github aggregate activity April - July 2017

FrameworkCore

LanguagePlatform Interface

Distributed

training

Model

ZooMulti-GPU

Multi-

threaded

CPU

Caffe C++Linux, MacOS,

Windows

Python,

MatlabNo yes

Only

data-parallelyes

Tensorflow C++Linux,MacOS,

Windows

Python,Java,

Goyes yes Most flexible yes

MXNet C++Linux,MacOS,

Windows,Devices

Python,Scala,

R,Julia, Perlyes yes yes yes

Pytorch LuaLinux,MacOS,

WindowsPython yes yes yes yes

Chainer Python Linux Python yes yes yes openblas

CNTK C++ Windows,Linux Python,C# yes yes yes yes

Table 2.2: DL framework features

• cuDNN : https://developer.nvidia.com/cudnn

Figure 2.35 demonstrates the activity of several mainstream DL frameworks,

which can give some insights into the impact of them. Google’ Tensorflow

[ABC+16] is no doubt the current most popular DL framework from both re-

search and development aspects.

Table 2.2 shows the different features of selected DL frameworks, such as

core-language, supported platforms, programming interface, GPU support, etc.

Table 2.3 shows a speed evaluation of selected mainstream DL frameworks,

performed by Maladkar et al. 2018. They used the CIFAR-10 dataset with 50.000

training samples and 10.000 test samples, uniformly distributed over ten classes.

A CNN has been used across different platforms with GPU support. Two GPU

types: Nvidia K80 and P100 with CUDA and cuDNN [CWV+14] support were

used in this evaluation. The numbers in the table are the average time in ms for

64

https://developer.nvidia.com/cudnn


DL Library K80/CUDA8/cuDNN6 P100/CUDA8/cuDNN6

Caffe2 148 54

Chainer 162 69

CNTK 163 53

Gluon 152 62

Keras(CNTK) 194 76

Keras(TensorFlow) 241 76

Keras(Theano) 269 93

Tensorflow 173 57

Theano(Lasagne) 253 65

MXNet 145 51

PyTorch 169 51

Julia-Knet 159 n/a

Table 2.3: DL frameworks benchmarking. Average time (ms) for 1000 images: ResNet-50

Feature Extraction (Source: analyticsindiamag.com)

feature extraction using a ResNet-50 model for 1000 testing images. We can find

out that MXNet and Caffe2 demonstrate good processing speed by using both

GPUs, by contrast, Theano(Lasagne), Keras(Theano) and Keras(TensorFlow)

obtained relatively poor results regarding efficiency.

2.3.4 Current Research Topics

In this section, I summarize some current popular research topics in DL com-

munity, including region-based CNN, deep generative models, semi-supervised or

weakly supervised models, interpretable research, energy efficient models, multi-

task and multi-module learning, and a novel deep model Capsule Network.

2.3.4.1 Region-based CNN

This research topic made a significant contribution to several AI applications,

e.g., autonomous driving, mobile vision, drone, medical imaging, etc. Girshick

et al. (2014) proposed Region-based Convolutional Neural Network (R-CNN)

[GDDM14] for object recognition. R-CNN consists of three modules, i.e. candi-

date regions generation, visual features extraction from regions using CNNs, and

65

https://www.analyticsindiamag.com/evaluation-of-major-deep-learning-frameworks/


a set of class-specific SVM for object classification.

Due to the low computation speed of R-CNN, the same authors further pro-

posed Fast R-CNN framework [Gir15] (2015), which exploits R-CNN architecture

and achieves fast processing speed. Fast R-CNN consists of convolutional and

pooling layers, a selective search module for region proposals, and a sequence of

fully connected layers.

In the same year, Ren et al. proposed Faster RCNN [RHGS15], which uses

Region Proposal Network (RPN) for real-time object detection. RPN is a fully

convolutional network which can efficiently generate region proposals. Moreover

the whole system, including RPN and a recognition network, can be optimized

end-to-end. Many AI applications such as autonomous driving, medical image

processing are using this approach for their object detection engine. Other well-

known approaches like SSD [LAE+16] and Yolo [RDGF16] also belongs to this

category, which further improved the processing speed.

He et al. (2017) proposed Mask RCNN [HGDG17] for instance object seg-

mentation. Mask RCNN extends Faster RCNN architecture and utilizes an extra

branch for object mask, that established the new state-of-the-art.

2.3.4.2 Deep Generative Models

The concept of generative models in machine learning is used for data modeling

with a conditional probability density function. Generally, this kind of model is

considered as the probabilistic model with a joint probability distribution over

observation and target (label) values. The recent development of deep generative

models gives us many promising applications. Now we can generate different

types of images, human speeches, musics, poems, texts, and many other things

using deep generative models.

Van Den Oord et al. proposed WaveNet, a deep neural network for generating

raw audio [VDODZ+16]. WaveNet is composed of a stack of CNN layers, and

softmax distribution layer for outputs. Since WaveNet models audio based on

the WAV form, it thus can be applied for generating any acoustic signal such as

human speeches, music, etc.

Gregor et al. proposed Draw, a recurrent neural network for image generation

[GDG+15]. Draw is based on Variational Auto-Encoder (VAE) architecture and

66


applies RNN for both encoder and decoder. Moreover, it introduces a human like

dynamic attention mechanism, which further improves the system performance.

Goodfellow et al. proposed Generative Adversarial Networks (GANs) for es-

timating generative models with an adversarial process [GPAM+14]. Generally,

GAN is referred to as an unsupervised deep learning approach, which offers an

alternative approach to maximum likelihood estimation techniques. In GANs ar-

chitecture, the standard framework consists of two different models: a generator

G, which tries to capture the real data distribution to generate fake samples that

has a realistic looking, and a discriminator D, which tries to do a better job at

distinguishing real and fake samples. G maps a latent space to the data space by

receiving Gaussian noise z as input and applying transformations to it to generate

new samples, while D maps a given sample to a probability P of it coming from

the real data distribution. In the ideal setting, given enough training epochs, G

would probably start producing samples having a realistic looking that D would

not be able to distinguish between real and fake samples anymore. Hence, D

would assign P = 0.5 to all samples, no matter coming from real or fake data

distribution. However, given the training instability inherent to GANs training,

this equilibrium is hard to reach and it is hardly ever achieved in practice.

According to Figure 2.36, two players D and G are playing a minimax game

with the function of V (D,G), expressed as follows:

minG

maxD

V (D,G) = Ex∼Pr(x)[log(D(x))] + Ez∼Pz(z)[log(1−D(G(z)))] (2.28)

where Pz(z) represents the noise distribution used to sample G’s input and G(z)

represents its output, which can be considered as a fake sample originated from

mapping the modified input noise to the real data distribution. In contrast,

Pr(x) represents the real data distribution and D(x) represents the probability

P of sample x being a real sample from the training set. To maximize Equation

2.28, D’s goal is then to maximize the probability of correctly classifying a sample

as real or fake by getting better at distinguishing such cases by assigning P close

to 1 to real images and P close to 0 to generated images. On the contrary,

to minimize Equation 2.28, G tries to minimize the probability of its generated

samples being classified as fake, by fooling D into assigning them a P value close

to 1.

67


Figure 2.36: Architecture of Generative Adversarial Networks (image source: Gharakha-

nian)

Beyond the vanilla GAN [GPAM+14], numbers of improved versions have

been proposed in the recent years [SGZ+16]. The newly introduced approaches

mainly addressed two problems: meaningfully correlating the loss metric with the

generators convergence and sample’s quality. Second, they aimed to improve the

stability of the optimization process. In practice, GANs can produce photoreal-

istic images for applications such as visualization of interior or industrial design,

shoes, bags, and clothing items.

2.3.4.3 Weakly Supervised Model, e.g., Deep Reinforcement Learning

Reinforcement Learning is categorized as the weakly (or semi-) supervised learn-

ing method. It uses reward and punishment system for the next action generated

by the learning model, which is mostly used for games and robots, usually solves

decision-making problems.

Recently, Deep Reinforcement Learning (DRL), a combination of DNN and

RL has become a hot topic due to the great success in mastering games. AI bots

are beating human world champions and grandmasters in strategical and other

games. For instance, AlphaGo and AlphaGo Zero for game of GO [SSS+17], Atari

[MKS+15], Dota2 [Blo18], Chess and Shougi [SHS+17].

68


There are many semi-supervised and unsupervised techniques have been im-

plemented based on the concept of RL. In RL, there is no clearly defined loss

function, which therefore makes the learning process more difficult, compared to

the fully supervised approaches. The fundamental differences between RL and

fully supervised methods can be summarized as follows: first we don’t have the

full access to the objective function which is being optimized; it uses an inter-

active querying process, the agent is interacting with a state-based environment;

the input depends on the previous actions. It is foreseeable that DRL will be an

essential research direction for robotics and automation systems for a long time

to come.

2.3.4.4 Interpretable Machine Learning Research

Interpretable Machine Learning is a growing research area in machine learning,

which aims to present the reasoning of the ML system to the human so that

we can verify it and understand it better. Although DNNs have shown superior

performance in a broad range of tasks, the interpretability is always like the

Achilles’ heel of deep learning models. In DL, people tend to convert a new

problem into an objective function and solve this problem by optimizing the

objective function in an end-to-end fashion. This end-to-end learning strategy

makes DNN representations a “black box”. Except for the final output (mostly,

the prediction results), it is hard for a human to understand the logic of DNN ’s

intermediate predictions hidden inside the network (states of the hidden layers).

In recent years, a growing number of researchers have realized that high model

interpretability is of significant value in both theory and practice. For instance,

in a control system, interpreting ML models can help us to identify its incor-

rect decisions, which is especially meaningful for automated medical systems and

robots. Interpretability is crucial for applications where a single wrong decision

can be extremely costly, e.g., self-driving cars.

According to [ZZ18], we can roughly categorize the recent studies on visual

interpretability for DL as follows:

• Visualizing the intermediate representations of a DNN. The gradient-based

methods such as [ZF14, SVZ13, SDBR14] are the mainstream approaches

69


in this category. These methods mainly compute gradients of the score

of a given convolutional unit concerning the input image. They use the

gradients to estimate the image appearance which maximizes the conv-unit

score. Olah et al. (2017) [OMS17] proposed a toolbox of existing techniques

to visualize the feature patterns encoded in different convolutional layers of a

pre-trained DNN. The up-convolutional net [DB16] proposed by Dosovitskiy

et al. (2016) is another typical technique to visualize CNN representations.

The up-convolutional net inverts CNN feature maps to images. We can

regard up-convolutional nets as a tool that indirectly illustrates the image

appearance corresponding to a feature map. The mentioned methods mainly

invert feature maps of a convolutional layer back to the input image or

synthesize the image that maximizes the score of a given unit in a pre-

trained CNN.

VisualBackProp (2017) is another similar approach [BCC+16], which back-

propagates the image (feature map) values instead of the gradients. With

VisualBackProp we can obtain the visualization of higher abstraction from

the deeper layer with a higher resolution. Due to the high computation

efficiency, we can efficiently use VisualBackProp as a debugging tool for the

development of CNN -based systems.

• Diagnosis of CNN representations. Some recent methods intended to go

beyond the visualization and to diagnose CNN representations further to

extract understandable insights of features encoded in a deep model. For ex-

ample, the work in [Lu15, AR15] studied the feature distributions of different

categories/attributes in the feature space of a pre-trained CNN. The work in

[SVK17, KL17] aimed to compute adversarial samples for pre-trained CNNs,

to estimate the vulnerable points in the feature space. In other words, these

studies aim to determine the minimum noisy perturbation of the input im-

age which is sensitive to the final prediction. The influence function can

also provide plausible ways to create training samples to attack the learning

of CNN models. Those samples are so-called adversarial samples, which is

another derived research topic for the interdisciplinary research of internet

security and ML.

70


• Disentanglement of “the mixture of patterns” encoded in the learned filters

of CNNs. These studies mainly disentangle complex representations in con-

volutional layers and transform network representations into interpretable

graphs. Related work: [ZCS+17, FH17]

• Semantic-level middle-to-end learning via human-computer interaction. A

clear semantic disentanglement of CNN representations may further enable

middle-to-end learning of neural networks with weak supervision. Related

work: [ZCWZ17, ZCZ+17]

2.3.4.5 Energy Efficient Models for Low-power Devices

State-of-the-art deep models are computationally expensive and consume large

storage space. Deep learning is also strongly demanded by numerous applications

from areas such as mobile platforms, wearable devices, autonomous robots and

IoT devices. How to efficiently apply deep models on such low power devices

becomes a challenging research problem. The recent research work in this field

can be roughly categorized into two groups: the first one is lightweight DNN,

which reduces the number of parameters through a compact design of network

architecture. These designs rely on full-precision floating point numbers and

try to prevent loss of accuracy; the second possible solution is low-bit neural

networks. The network information can be compressed by avoiding the common

usage of full-precision floating point weights, which use 32 bit of storage. Instead,

quantized floating-point number with lower precision (e.g., 8 bit of storage) or

even binary (1 bit of storage) weights are used in these approaches.

SqueezeNet was presented by Iandola et al. [IHM+16] in 2016. The authors

replace a significant portion of 3×3 filters with smaller 1×1 filters in convolutional

layers and reduce the number of input channels to the remaining 3×3 filters for

a reduced number of parameters. Additionally, they facilitate late downsam-

pling to maximize their accuracy based on the lower number of weights. Further

compression is done by applying deep compression [HMD15] to the model for

an overall model size of 0.5 MB. MobileNet was implemented by Howard et al.

[HZC+17]. They apply a depth-wise separable convolution where convolutions

71


apply a single 3×3 filter to each input channel. Then, a 1×1 convolution is em-

ployed to combine their outputs. Zhang et al. [ZZLS17] use channel-shuffling

to achieve group convolutions in addition to depth-wise convolution. ShuffleNet

achieves a comparably lower error rate for the same number of operations needed

for MobileNet. In 2018, Tan et al. proposed MnasNet [TCP+18], whose net-

work structure was created by an automatic search method, inspired by NasNet

[ZVSL17]. The authors extended NasNet by making it suitable for mobile de-

vices. From the accuracy rate, the number of parameters and the running speed,

it has surpassed all the previous manually designed lightweight DNNs, including

MobileNet V2 [SHZ+18] released by Google recently. The mentioned approaches

reduce memory requirements, but still require GPU hardware for efficient training

and inference. Specific acceleration strategies for CPUs still need to be developed

for these methods.

On the contrary, approaches which use binary weights instead of full-precision

weights can achieve compression and acceleration. However, the drawback usually

is a severe drop in accuracy. For instance, the weights and activations in BNN

are restricted to either +1 or -1, as presented by Hubara et al. [HCS+16]. XNOR-

Net was built on a similar idea proposed by Rastegari et al. [RORF16]. They

suggested a channel-wise scaling factor to improve the approximation of full-

precision weights, but require weights between layers to be stored as full-precision

numbers. DoReFa-Net was presented by Zhou et al. [ZWN+16], which focuses

on quantizing the gradients together with different bit-widths (down to binary

values) for weights and activations. Lin et al. proposed ABC-Net [LZP17] which

achieves a drop in top-1 accuracy of only about 5% on the ImageNet dataset

compared to a full-precision network using the ResNet-18 architecture. However,

this result is based on five binary weight bases. This approach thus significantly

increases model complexity and size. Therefore finding a way to train a binary

neural network accurately remains an unsolved task.

2.3.4.6 Multitask and Multi-module Learning

Multitask learning is inspired by biological neural networks, e.g., human vision

system. The human vision system can simultaneously perform different tasks like

reading texts, recognizing objects, distinguishing facial expression, etc. Figure

72


Figure 2.37: Multitask Learning Example

2.37 shows an example of multitask learning, where the tasks are distinguishing

from {dog or human} and {boy or girl}, respectively. We can use one neural

network for solving two tasks through a joint feature learning process. This way,

more general features can be learned from the shared layers of two tasks. We can

learn features that might not be easy to obtain just using the original task. In

this example, the features learned from the task-{boy or girl} are beneficial for

distinguishing girl and dog with long hair in the other task.

On the other hand, the part that is not related to one task is equivalent to

noise during the learning process, while noise could improve the generalization

ability of the model. Another beneficial reason is that, regarding the optimization

objective, the local minimum of different tasks are in different locations, and the

interaction of multiple tasks can help to escape from the local minimum.

There are many recently proposed DL applications applied multitask learning

techniques, for instance, Faster RCNN [RHGS15] and Mask RCNN [HGDG17]

for object detection and instance segmentation, Neural Visual Translator [WYBM16]

for image captioning, SEE [BYM18] for scene text detection and recognition,

ACMR [WYX+17] for cross-modal multimedia retrieval, and many others.

73


2.3.4.7 Capsule Networks

Sabour et al. proposed Capsule Networks (CapsNet) [SFH17] (2017), which can

be considered as one of the most recent theoretical breakthroughs in DL. CapsNet

usually contains several convolution layers and one capsule layer at the end. This

work intended to solve several current limitations of CNNs. The standard CNNs

utilize the so-called Invariance representations:

Represent(X) = Represent(Transform(X)) (2.29)

where X is the input to a CNN. If we want to use a CNN to recognize the input

with variations of small translation, rotation or distortion, we usually need first

to transform the input data with those transformations, then we let the CNN

learn the transformed representations. Thus, the CNN itself cannot distinguish

the input with or without geometrical transformations. While CapsNet, on the

contrary, intends to learn the so-called Equivariance representations, formed as:

Transform(Represent(X)) = Represent(Transform(X)) (2.30)

where we will not lose the location and transformation information of an ob-

ject. It uses layers of capsules instead of layers of neurons, by which a capsule

consists of a set of neurons. In the context of computer vision, a capsule can

be considered as a visual entity. Active low-level entities make predictions, and

upon agreeing to multiple predictions, a high-level entity becomes active. Low-

level capsules are responsible for significant changes of object location in visual

content occurred. High-level capsules are accountable for the changes in visual

content. A routing-by-agreement mechanism is used for the communication of

different levels of capsule layers, instead of using the standard pooling method,

since the conventional pooling methods result in losing location information of

low-level entities. While the location information may express essential semantic

properties for visual recognition tasks.

CapsNet achieved promising results on several small and medium size datasets,

such as MNIST, Multi-digit MNIST. However, before it becomes popular, we still

need to prove its applicability on large-scale datasets e.g. ImageNet, and also

show its value in some real-world applications.

74


2.3.5 Current Limitation of DL

The current limitations of DL approaches can be summarized according to the

following aspects:

• There are many hyperparameters, such as network architecture, learning

rate, initialization metrics, loss function, activation function etc., which

have to be determined in the training of a DL model. Tuning such hyper-

parameters the expert knowledge is required.

Although with the recently proposed techniques such as AutoML [TCP+18]

we can search for the optimal hyperparameters automatically, the search-

ing process is extremely time and cost consuming. For instance, Google

researchers utilized 800 modern GPUs for their neural architecture search

method on CIFAR-10 dataset [ZL16] and the searching time was over sev-

eral months.

• As mentioned in section 2.3.4, the Adversarial Sample is a class of samples

that are maliciously designed to attack machine learning models. It is almost

impossible to distinguish the difference between real and adversarial samples

with naked eyes, and such adversarial samples will lead to a wrong judgment

of a DL model, but not the human. Regarding its particular importance

in many application areas, the attacks and defenses of adversarial samples

become a new active research field.

• DL models have very limited interpretability. However, there are some

recent methods achieved promising results for this problem (cf. section

2.3.4.4), and from long-term perspective, the interpretability is always an

actively studied topic in the community.

• Gary Marcus published a review of DL methods [Mar18], in which he criti-

cally discusses its nature and limitations, such as requiring more data, not

being sufficiently transparent, not well being integrated with prior knowl-

edge, struggling with open-ended inference, and inability to distinguish cau-

sation from correlation [Mar18]. He also mentioned that DL assumes a sta-

ble world, works as an approximation, but its answers often cannot be fully

75


trusted. DL is difficult to engineer with and has potential risks as being an

excessive hype. He suggested reconceptualizing DL by considering the possi-

bilities in unsupervised learning, symbol manipulation, and hybrid models,

learning insights from cognitive science and psychology and taking bolder

challenges [Mar18].

2.3.6 Applicable Scenario of DL

In this sub-section, we summarize some applicable scenarios of DL according to

the advantages as well as the limitations.

DL is employed in several situations where machine intelligence would be

beneficial:

• Learning a function that maps well-defined inputs to well-defined outputs.

– Classification tasks, e.g., labeling images of bird breeds

– Regression tasks, e.g., analyzing a loan application to predict the like-

lihood of future default

• Absence of a human expert, e.g., navigation on Mars, or human is unable

to explain their expertise, e.g., speech recognition, computer vision.

• The task provides clear feedback with clearly definable goals and metrics.

• No need for a detailed explanation of how the decision was made.

– For instance, it may be not suitable for medical imaging like breast

cancer prediction on MRI using an unexplainable model.

• The problem size is too broad for the limited reasoning capabilities; on

the other hand, large (digital) datasets exist or can be created containing

input-output pairs.

• DL (or ML) systems are less effective when the task requires long chains of

reasoning or complex planning that rely on common sense or background

knowledge unknown to the computer.

76


• Suitable for tasks, no specialized dexterity, physical skills, or mobility re-

quired. This is due to the current limitations of robotics, which will indeed

be improved in the future.

77


78

PART II:

Selected Publications

79

3

Assignment to the Research

Questions

In this chapter, a list of selected publications from Appendix B is presented.


data annotations? Through synthetic data? Or/and through unsupervised

and semi-supervised learning method?

Q2: How can we perform multiple computer vision tasks with a uniform


Publication:

– “SceneTextReg: A Real-Time Video OCR System” [YWBM16], cf.

chapter 4

– “SEE: Towards Semi-Supervised End-to-End Scene text Recognition”

[BYM18], cf. chapter 5


phones, embedded devices, wearable, and IoT devices?

Publication:

– “BMXNet: An Open-Source Binary Neural Network Implementation

Based on MXNet” [YFBM17b], cf. chapter 6

81

3. ASSIGNMENT TO THE RESEARCH QUESTIONS

– “Learning to Train a Binary Neural Network” [BYBM18] and “Back to

Simplicity: How to Train Accurate BNNs from Scratch?” [BYBM19],

cf. chapter 6


ing tasks?

Publication:

– “Image Captioning with Deep Bidirectional LSTMs and Multi-Task

Learning” [WYM18], cf. chapter 7

– “A Deep Semantic Framework for Multimodal Representation Learn-

ing” [WYM16a], cf. chapter 8



Publication:

– “Automatic Online Lecture Highlighting Based on Multimedia Analy-

sis” [CYM18], cf. chapter 9

– “Recurrent Generative Adversarial Network for Learning Imbalanced

Medical Image Semantic Segmentation” [RYM19b], cf. 10

82

4

SceneTextReg

In this paper we present a system for real-time video text recognition. The system

is based on the standard workflow of text spotting system, which includes text

detection and word recognition procedure. We apply deep neural networks in both

procedures. Our current implementation demonstrates a real-time performance

for recognizing scene text by using a standard laptop with a webcam. The word

recognizer achieves quite competitive results to state-of-the-art methods by only

using synthetic training data.

4.1 Contribution to the Work

• Main contributor to the formulation and implementation of research ideas

• Main contributor to the conceptual implementation

• Main contributor to the technical implementation

• Core maintainer of the software project

4.2 Manuscript

This manuscript is the extended version of the original paper, by which I detailed

the content of following sections: system design, evaluation and future work,

83

4. SCENETEXTREG

to further improve the legibility of the article. Additional to the manuscript, I

prepared a demo video to represent the proposed real-time system.1


84

https://youtu.be/fSacIqTrD9I

SceneTextReg: A Real-Time Video OCR System

Haojin Yang, Cheng Wang, Christian Bartz, Christoph MeinelHasso Plattner Institute (HPI), University of Potsdam, Germany

P.O. Box 900460,D-14440 Potsdam

{haojin.yang, cheng.wang, meinel}@hpi.de{christian.bartz}@student.hpi.uni-potsdam.de

ABSTRACTWe present a system for real-time video text recognition.The system is based on the standard workflow of text spot-ting system, which includes text detection and word recog-nition procedures. We apply deep neural networks in bothprocedures. In text localization stage, textual candidatesare roughly captured by using a Maximally Stable ExtremalRegions (MSERs) detector with high recall rate, false alarmsare then eliminated by using Convolutional Neural Network(CNN ) verifier. For word recognition, we developed a skele-ton based method for segmenting text region from its back-ground, then a CNN+LSTM (Long Short-Term Memory)based word recognizer is utilized for recognizing texts. Ourcurrent implementation demonstrates a real time perfor-mance for recognizing scene text by using a standard laptopwith webcam1. The word recognizer achieves competitiveresult to state-of-the-art methods by only using synthetictraining data.

CCS Concepts•Computing methodologies → Computer vision; Vi-sual content-based indexing and retrieval; •Computersystems organization → Real-time systems;

KeywordsVideo OCR; Multimedia Indexing; Deep Neural Networks

1. INTRODUCTIONThe amount of video data available on the World Wide

Web (WWW ) is growing rapidly. According to the officialstatistic-report of YouTube, 100 hours of video are uploadedevery minute. Therefore, how to efficiently retrieve videodata on the WWW or within large video archives has becomean essential and challenging task.

1A demo video is prepared: https://youtu.be/fSacIqTrD9I

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

MM ’16 October 15-19, 2016, Amsterdam, Netherlandsc© 2016 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-3603-1/16/10.

DOI: http://dx.doi.org/10.1145/2964284.2973811

On the other hand, due to the rapid popularization ofsmart mobile and wearable devices, large amounts of self-recorded “lifelogging” videos are created. Generally, it lacksmetadata for indexing such video data, since the only search-able textual content is often the title given by the uploader,which is typically brief and subjective. A more general so-lution is highly desired for gathering video metadata auto-matically.

Text in video is one of the most important high-level se-mantic feature, which directly depicts the video content. Ingeneral, text displayed in a video can be categorized intoscene text and overlay text (or artificial text). In contrastto overlay text, to detect and recognize scene text is oftenmore challenging. Numerous problems affecting the recogni-tion results, as e.g., texts appeared in a nature scene imagecan be in a very small size with the high variety of contrast;motion changes of the camera may affect the size, shape,and brightness of text content, and may lead to geometri-cal distortion. All of those factors have to be considered inorder to obtain a correct recognition result.

Most of the proposed scene-text recognition methods canbe briefly divided into two categories, either based on con-nected components (CCs) or sliding windows. The CCsbased approaches include Stroke Width Transform (SWT )[5], MSERs [17], Oriented Stroke [18] etc. One of the signifi-cant benefits of CCs based method is its computational effi-ciency since the detection is often a one pass process acrossimage pixels. The sliding window based methods as e.g.,[20, 3, 19, 9] usually apply representative visual features totrain a machine learning classifier for text detection. Herehand-crafted features [19, 14, 1], as well as deep features[20, 9] can be applied, and text regions will be detected byscanning the whole image with a sub-window in multiplescales with a potential overlapping. In [20, 3, 9], slidingwindow based methods with deep features achieved promis-ing accuracy for end-to-end text recognition. [22] proposeto consider scene text detection as a semantic segmenta-tion problem, by which their Fully Convolutional Network(FCN ) performs per-pixel prediction for classifying text andbackground. However, their proposed approaches may hardto achieve sufficient performance for real-time applicationdue to the expensive computation time.

In our approach, we intended to take advantages of bothcategories, i.e. the computation benefit of CCs based al-gorithm and the powerful text-classification ability of deepfeatures. The demonstrated system achieves real-time per-formance2 on a standard laptop (3.2 GHz CPU×4, 8G RAM,

2Similar to [17], we consider the real-time ability of a video

Figure 1: Network architecture of the verification CNN model. The numbers in the layer names e.g.,“conv1 64x5x5” describes the following information: layer type, the number of output feature maps andconvolution kernel width and height. “fc” indicates the fully-connected layer, followed by the number ofoutputs in the layer name.

Figure 2: CNN based text candidate verification.

NVIDIA GeForce 860M) with a webcam.The rest of the paper is organized as follows: section 2

demonstrates the overall system design, section 3 presentsthe evaluation results by using opened benchmark datasetand section 4 conclude the paper with an outlook on futurework.

2. SYSTEM DESIGNIn this section, we will describe the detailed workflow of

the proposed system, and report evaluation results on IC-DAR 2015 Robust Reading Competition Challenge 2 - Task3 “Focused Scene Word Recognition” [12].

2.1 Text DetectionIn [20, 10, 9], the authors were intended to achieve a bet-

ter end-to-end text recognition accuracy. Therefore in thetext detection step, their systems have been tuned to pro-duce text candidates with high recall and the subsequentrecognition engines will further eliminate the false alarms.Since our goal is to design a text spotting system with thereal-time ability and the recognition procedure is often timeconsuming, we thus keep the text detection result as accu-rate as possible and only pass the text candidates with highconfidence to the recognition stage. We apply a MESRs [15]based detector to roughly detect character candidates fromthe input video frame with a high recall rate. In order toincrease the recall rate, we apply both RGB and HSV colorchannels for ERs (Extremal Regions) detection. All candi-date regions are further verified by using a grouping methodand a Convolutional Neural Network (CNN) [13] classifier(cf. Figure 2). We utilize the CNN architecture inspired

text recognition system if its response time is comparable toa human.

by [6] which consists of two convolution and three fully-connected layers, as shown in Figure 1. The correspondinghyperparameters such as filter size and the number of fil-ters are also given in the figure, which has been determinedempirically.

To prepare the training samples, we applied a coarse-to-fine strategy. We created text image samples by using oursample generator, which will be discussed in detail in thesubsequent sections. We first used 5000 samples for eachclass to train an initial text verification model. For collectingnon-text samples, we applied our initial model on ImageNet[4] images and further manually collected the false positives.We iteratively update the model by increasing both posi-tive and negative training samples, until the model achievedgood performance on the testing dataset (150k samples foreach class). We used around 2 million non-text samplesand 4 million text samples to train our final model. Theachieved classification accuracy and F1-score is about 99.3%and 99.0%, respectively.

2.2 Text SegmentationWe developed a novel skeleton-based approach for text

segmentation, which will simplify the further OCR process.In short, we determine the text gradient direction for eachtext candidate by analyzing the content distribution of theirskeleton maps. We then calculate the threshold value forseed-selection by using the skeleton map which has beencreated with the correct gradient direction. Subsequently,a seed-region growing procedure starts from each seed pixeland extends the seed-region in its north, south, east, andwest directions. The region iteratively grows until it reachesthe character boundary. This method achieved the firstplace in ICDAR 2011 text segmentation challenge for borndigital images. More detailed description of this method canbe found in [21].

2.3 Word RecognitionIn this step, we first separate the verified text candidates

into words. Then, the word recognition is accomplished byperforming joint-training of an RCNN (Recurrent Convolu-tion Neural Network) model, which consists of a CNN anda Long Short-Term Memory (LSTM) network [7], followedby a standard spell checker.

Figure 3 depicts the CNN part of the proposed recog-nition network, which serves as a visual feature extractor.This CNN network consists of five conv-layers and threefully-connected layers. Max-pooling is used after the 1st,2nd and 4th conv-layer; ReLU activation function [16] and

Figure 3: Network architecture of the recognition CNN model, which serves as the visual feature extractor inthe whole recognition network. The numbers in the layer names e.g., “conv1 64x5x5” describes the followinginformation: layer type, the number of output feature maps and convolution kernel width and height. “fc”indicates the fully-connected layer, followed by the number of outputs in the layer name.

Figure 4: Network architecture of the recognitionLSTM model. Both sentence (word embedding)and CNN visual features (visual embedding) are fedinto the LSTM layers, which conducts a multi-modal(visual-language) learning problem.

Batch Normalization (BN) [8] are applied after every conv-and fc-layers except the last one through the network. Thefinal output feature vector has a dimension of 1794, whichis further fed into a LSTM network, depicted in Figure 4.

In the LSTM network, the input “CNN image features”

denotes the CNN network output, which will be fed intothe LSTM layer “lstm joint”. The word input is embeddedby using an embedding layer, and then the embedded wordvectors are fed into an LSTM layer, where the sequentialword representation can be obtained. Subsequently, a jointrepresentation of image and text modalities are learned inthe LSTM layer “lstm joint”. The final output of the LSTMnetwork is the softmax distribution of 78 entries includingEnglish characters, numbers and other special characters.

To improve the performance of our recognition network,we developed a data engine, which generated 9 millions ofsynthetic training data by considering different scene factors,discussed in detail in the next section.

2.4 Synthetic Data GenerationThe commonly used benchmark datasets for scene text

recognition, such as the ICDAR dataset, consist of a trainingdataset containing images used for training the system anda test dataset that acts as the benchmark dataset. Thosetraining datasets, most of the time, do not have more than2000 image samples, which is not enough for building a deeplearning model with tens of millions of parameters. It is thusnecessary to get more training data that have the same vi-sual complexity as real-world images. Image portals such asFlickr3 or Instagram4 provide a lot of user-uploaded photosthat may contain scene text, but none of these images hasannotations. Therefore, a ground truth that we can use fortraining the deep model has to be created manually, whichis a time- and cost-consuming task. A better solution is togenerate synthetic data with very closed properties as thereal-world data. We thus can use the synthetic data as agood estimator for the data a trained network has to expectwhen being fed with real-world data. We developed an im-age data engine for this job that is capable of performingthe following operations:

• Generating arbitrary text strings using fonts with a

3https://www.flickr.com/4ttps://www.instagram.com

Figure 5: Comparison of generated samples withreal-world images from the ICDAR dataset of therobust reading challenge. (Left) Images generatedby our sample generation tool. (Right) Images takenfrom the ICDAR dataset.

broad range of varieties that we can also find in real-world images.

• To enhance the realistic effects, using different colors,sizes, shadows, borders with varying displacements tothe rendered texts.

• Using transformations such as rotation and distortionso that the model is robust to small projective distor-tions.

• Using background blending, reflection, and other vi-sual effects to enhance the similarity to the real-worldimages further.

We randomly select a font from Google fonts5 for creat-ing texts. We can generate words in two different ways:either supplying a wordlist file that contains words or sen-tences, or specifying which kinds of characters to use (lowercase characters, upper case characters, numbers) and thewordlist generation script will generate random words up toa specified maximum length. The sample generation processsubsequently renders text and a shadow image using the sup-plied font and the font size provided. Text image, shadowimage, and background image have randomly selected col-ors, but background color and shadow color are constrainedto have a minimum contrast compared to the text color.That minimum contrast value can be configured in the con-fig and could also be set to zero to allow any color as thebackground and shadow color. Next, the process will blendthe shadow image and text image using alpha compositing,after that we will get the base samples.

In the next step, the process applies image transforma-tions in the following order:

• Random Gaussian Blur The created text sample isblurred using gaussian blur, where the value of eachpixel is computed as the weighted average of the neigh-borhood of that pixel. The weighted average of allneighborhood pixels can be obtained by convolving theimage with a Gaussian filter. The Gaussian filter usedin this work uses a random blur radius.

5https://fonts.google.com/

• Random Distortion We add random distortions to theimage by applying perspective transformations. Thiscreates text image samples where the text is not per-fectly aligned but slightly skewed and incidental ori-ented.

• Random Rotation The data engine can further rotatethe already blurred and distorted text images with ar-bitrary degrees.

• Random Reflection We use predefined reflection over-lays that are alpha composited with the generated im-age and simulate a white glare on the image. It issafe to assume that a white glare is sufficient enoughto model a natural reflection as we are using grayscaleimages.

• Background Blending The last step in the generationprocess is to add a background to the sample. Thisbackground can be a plain color or a natural image.The first blending step is to add a plain colored back-ground with a random opacity to the generated textsample. In the second step, the tool decides whetherto add a natural background further or not. The opac-ity and probability of a natural background image toadd can be pre-configured.

The data engine can very well approximate the distributionof real-world data samples, and our experimental results alsoconfirm this conclusion. More details can be found in section3. Figure 5 shows a comparison of samples generated withour tool and samples from the ICDAR dataset.

2.5 Loss FunctionTo train the neural networks the commonly used loss func-

tion cross-entropy loss is applied in this paper. In binaryclassification, where the number of classes K equals 2, cross-entropy can be calculated as:

Lcross-entropy = −(y log(p) + (1− y) log(1− p))

For K > 2, we calculate a separate loss for each class labelper observation and sum the result, which is expressed asfollows:

Lcross-entropy = −K∑

k=1

yi,k log(pi,k)

where K denotes the number of classes, log is the naturallog function, y denotes the binary indicator (0 or 1) if classlabel k is the correct classification for the observation i, andp is the predicted probability observation i is of class c.

3. EVALUATION

3.1 Implementation DetailThe model training in this paper has been performed on

Ubuntu 14.04/64-bit platform on a workstation which hasan Intel(R) Core(TM) i7-6900K CPU, 64 GB RAM and 2TITAN X GPUs. We applied deep learning framework Caffe[11] to train the deep networks. The real-time demo wascompiled on a standard laptop computer with Intel CPU×43.2 GHz, 8G RAM, GPU NVIDIA GeForce 860M.

Description WRR

I.C.PPhotoOCR [2] 0.876Our result 0.857Jaderberg’s JOINT-model [9] 0.818

IC05

SRC-B-TextProcessingLab* 0.874Baidu-IDL* 0.872Megvii-Image++ [22] 0.8283PhotoOCR [2] 0.8283Our result 0.8237NESP 0.642PicRead 0.5799PLT 0.6237MAPS 0.6274Feild’s Method 0.4795PIONEER 0.537Baseline 0.453TextSpotter [17] 0.2685

Table 1: Evaluation result on IC05. The base-line method is from a commercially available OCRsystem. We are intended to include results thatare not constrained to a pre-defined lexicon. How-ever the methods marked with ∗ are not published,therefore they are not distinguishable. (Last access:05/12/2016)

3.2 Experimental ResultWe evaluated our word recognizer by using ICDAR 2015

Robust Reading Competition Challenge 2 - Task 3 “FocusedScene Word Recognition” dataset (refer to IC05) in an un-constrained manner6. We can provide the output of ourword recognizer in both case-sensitive and case-insensitivemode. For the ICDAR 2015 dataset, we strictly followedthe evaluation metric in a case-sensitive manner (refer to“Word Recognition Rate” (WRR)). We also created evalu-ation result ignored capitalization and punctuation differ-ences using this dataset (refer to I.C.P), and compared withthe best-known methods [2, 9].

Table 1 shows the comparison results to previous methodson the IC05 dataset. In the I.C.P evaluation, our currentresult outperforms JOINT-model from [9], but comes lightlybehind Google’s photoOCR by 1.9%. We didn’t considerthe DICT-model from [9], since its result is constrained tolexicons.

According to the IC05 ranking results, our approach iscurrently not able to outperform the results created by com-mercial organizations such as SRC-B, Baidu-IDL, Megvii,and Google, but improves on the next best one (NESP) by18% of WRR. Our result is still competitive to commercialorganizations, only 0.46% behind Google and Megvii, re-garding that we have only used synthetic training data, andapplied more succinct network architecture by taking intoaccount of the execution speed. This result confirms thatby using carefully implemented synthetic training data, wecan achieve excellent recognition results very close to themodels trained using real-world data, as e.g., [2] (Google’ssystem) applied several millions of manually labeled sam-ples, and the processing time of their system is around 1.4seconds per image. To process a 640 × 480 image, the sys-tem from [22] (Megvii) needs about 20 seconds on CPU or

6The OCR results are not constrained to a given lexicon.

Figure 6: Exemplary recognition result of our sys-tem by using a webcam in real-time.

1 second on GPU only for text localization. Therefore, oursystem is superior regarding running time, which can beproved by our demo video captured by using a laptop anda video camera 7. The OCR analysis has been performedon every input frame from the camera. Figure 6 demon-strates an exemplary recognition result. The methods from“Baidu-IDL” and “SRC-B” (marked with ∗ in Table 1) haveachieved excellent results in terms of WRR. However, sincethey are the unpublished methods, we thus can not makean accurate judgment. Based on experience, they shoulduse extremely sophisticated neural network architectures toensure high accuracy, which is less efficient than our method.

4. CONCLUSIONIn this paper, we presented a real-time video text recog-

nition system by taking advantages of the efficient classicalcomputer vision methods and highly accurate deep neuralnetworks. In particular, we proved that, by using carefullydesigned synthetic training data, we are able to successfullybuild our CRNN model, which achieves similar recognitionresults as the systems developed based on large-scale real-world datasets.

As the next step, we are thinking about two possible re-search ideas. First, are we able to integrate the detectionand recognition task into a unified neural network, and op-timize the whole problem in an end-to-end manner? Webelieve that this goal could be accomplished by using so-called multi-task learning techniques in the fully supervisedmanner. Moreover, are we able to optimize the detectiontask without bounding box annotations in a semi-supervisedway? The human vision system inspires this idea. We neverprovide fully supervised information, such as the exact lo-cation information and bounding box size of texts or otherobjects, to teach our kids. The human vision system canthus learn to find texts or other objects by cross-modalsupervision signals such as context information, attentioninformation, etc. We will try to explore the possibility ofimplementing a semi-supervised detection network by usingvisual attention information.

We will showcase the proposed video OCR system inter-actively. The input video stream will be captured by using


a live camera, and the OCR result will be directly displayedon the computer screen.

5. REFERENCES[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool.

Speeded-up robust features (surf). Comput. Vis.Image Underst., 110(3):346–359, June 2008.

[2] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven.Photoocr: Reading text in uncontrolled conditions. InThe IEEE International Conference on ComputerVision (ICCV), December 2013.

[3] A. Coates, B. Carpenter, C. Case, S. Satheesh,B. Suresh, T. Wang, D. J. Wu, and A. Y. Ng. Textdetection and character recognition in scene imageswith unsupervised feature learning. In Proc. ofInternational Conference on Document Analysis andRecognition, ICDAR ’11, pages 440–445, Washington,DC, USA, 2011. IEEE Computer Society.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.

[5] B. Epshtein, E. Ofek, and Y. Wexler. Detecting textin natural scenes with stroke width transform. InProc. of International Conference on Computer Visionand Pattern Recognition, pages 2963–2970, 2010.

[6] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, andV. Shet. Multi-digit number recognition from streetview imagery using deep convolutional neuralnetworks. arXiv preprint arXiv:1312.6082v4, 2014.

[7] S. Hochreiter and J. Schmidhuber. Long short-termmemory. Neural Comput., 9(8):1735–1780, Nov. 1997.

[8] S. Ioffe and C. Szegedy. Batch normalization:Accelerating deep network training by reducinginternal covariate shift. arXiv preprintarXiv:1502.03167, 2015.

[9] M. Jaderberg, K. Simonyan, A. Vedaldi, andA. Zisserman. Deep structured output learning forunconstrained text recognition. In InternationalConference on Learning Representations, 2015.

[10] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deepfeatures for text spotting. In Proc. of EuropeanConference on Computer Vision (ECCV). Springer,2014.

[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Darrell.Caffe: Convolutional architecture for fast featureembedding. arXiv preprint arXiv:1408.5093, 2014.

[12] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou,S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas,L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait,S. Uchida, and E. Valveny. Icdar 2015 competition onrobust reading. In Proceedings of the 2015 13thInternational Conference on Document Analysis andRecognition (ICDAR), ICDAR ’15, pages 1156–1160,Washington, DC, USA, 2015. IEEE Computer Society.

[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to documentrecognition. Proceedings of the IEEE,86(11):2278–2324, 1998.

[14] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, Nov. 2004.

[15] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robustwide-baseline stereo from maximally stable extremalregions. Image and Vision Computing, 22(10):761 –767, 2004. British Machine Vision Computing 2002.

[16] V. Nair and G. E. Hinton. Rectified linear unitsimprove restricted boltzmann machines. InProceedings of the 27th international conference onmachine learning (ICML-10), pages 807–814, 2010.

[17] L. Neumann and J. Matas. Real-time scene textlocalization and recognition. In Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conferenceon, pages 3538–3545, June 2012.

[18] L. Neumann and J. Matas. Scene text localization andrecognition with oriented stroke detection. InComputer Vision (ICCV), 2013 IEEE InternationalConference on, pages 97–104, Dec 2013.

[19] K. Wang, B. Babenko, and S. Belongie. End-to-endscene text recognition. In Computer Vision (ICCV),2011 IEEE International Conference on, pages1457–1464, Nov 2011.

[20] T. Wang, D. Wu, A. Coates, and A. Ng. End-to-endtext recognition with convolutional neural networks. InPattern Recognition (ICPR), 2012 21st InternationalConference on, pages 3304–3308, Nov 2012.

[21] H. Yang, B. Quehl, and H. Sack. A skeleton basedbinarization approach for video text recognition. InImage Analysis for Multimedia Interactive Services(WIAMIS), 2012 13th International Workshop on,pages 1–4. IEEE, 2012.

[22] C. Yao, J. Wu, X. Zhou, C. Zhang, S. Zhou, Z. Cao,and Q. Yin. Incidental scene text understanding:Recent progresses on icdar 2015 robust readingcompetition challenge. arXiv preprintarXiv:1511.09207v2, 2016.

5

SEE

In this paper, we present an approach towards semi-supervised neural networks

for scene text detection and recognition, which can be optimized end-to-end. We

show the feasibility, by performing a range of experiments on standard benchmark

datasets, where we achieved state-of-the-art results.

Moreover, I prepared additional experimental results of the proposed approach

from our another paper [BYM17b] in section 5.3, which prove the robust gener-

alization ability of SEE.


• Significantly contributed to the conceptual discussion and idea development.

• Guidance and supervision of the technical implementation

5.2 Manuscript

Additional to the manuscript, we prepared demo videos to represent the proposed

approach on different datasets.1

1SVHN: https://youtu.be/GSq3_GeDZKk

FSNS: https://youtu.be/5lt6dAbbsu4

ICDAR: https://youtu.be/LNNrZ7kcmbU

91

https://youtu.be/GSq3_GeDZKk

https://youtu.be/5lt6dAbbsu4

https://youtu.be/LNNrZ7kcmbU

SEE: Towards Semi-SupervisedEnd-to-End Scene Text Recognition

Christian Bartz, Haojin Yang, Christoph MeinelHasso Plattner Institute, University of Potsdam

Prof.-Dr.-Helmert Straße 2-314482 Potsdam, Germany

{christian.bartz, haojin.yang, meinel}@hpi.de

Abstract

Detecting and recognizing text in natural scene images is achallenging, yet not completely solved task. In recent yearsseveral new systems that try to solve at least one of thetwo sub-tasks (text detection and text recognition) have beenproposed. In this paper we present SEE, a step towardssemi-supervised neural networks for scene text detection andrecognition, that can be optimized end-to-end. Most existingworks consist of multiple deep neural networks and severalpre-processing steps. In contrast to this, we propose to usea single deep neural network, that learns to detect and rec-ognize text from natural images, in a semi-supervised way.SEE is a network that integrates and jointly learns a spatialtransformer network, which can learn to detect text regions inan image, and a text recognition network that takes the iden-tified text regions and recognizes their textual content. Weintroduce the idea behind our novel approach and show itsfeasibility, by performing a range of experiments on standardbenchmark datasets, where we achieve competitive results.

IntroductionText is ubiquitous in our daily lives. Text can be found ondocuments, road signs, billboards, and other objects likecars or telephones. Automatically detecting and reading textfrom natural scene images is an important part of systems,that are to be used for several challenging tasks, such asimage-based machine translation, autonomous cars or im-age/video indexing. In recent years the task of detecting textand recognizing text in natural scenes has seen much inter-est from the computer vision and document analysis com-munity. Furthermore, recent breakthroughs (He et al. 2016a;Jaderberg et al. 2015b; Redmon et al. 2016; Ren et al. 2015)in other areas of computer vision enabled the creation ofeven better scene text detection and recognition systems thanbefore (Gomez and Karatzas 2017; Gupta, Vedaldi, and Zis-serman 2016; Shi et al. 2016). Although the problem of Op-tical Character Recognition (OCR) can be seen as solvedfor text in printed documents, it is still challenging to detectand recognize text in natural scene images. Images contain-ing natural scenes exhibit large variations of illumination,perspective distortions, image qualities, text fonts, diversebackgrounds, etc.

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Recognizer

"Place"

"Erik"

"Satie"

Predicted Bounding Boxes

Input Image

Detector

Figure 1: Schematic overview of our proposed system. Theinput image is fed to a single neural network that consistsof a text detection part and a text recognition part. The textdetection part learns to detect text in a semi-supervised way,by being jointly trained with the recognition part.

The majority of existing research works developed end-to-end scene text recognition systems that consist of com-plex two-step pipelines, where the first step is to detect re-gions of text in an image and the second step is to recognizethe textual content of that identified region. Most of the ex-isting works only concentrate on one of these two steps.

In this paper, we present a solution that consists of a sin-gle Deep Neural Network (DNN) that can learn to detect andrecognize text in a semi-supervised way. In this setting thenetwork only receives the image and the textual labels as in-put. We do not supply any groundtruth bounding boxes. Thetext detection is learned by the network itself. This is con-trary to existing works, where text detection and text recog-nition systems are trained separately in a fully-supervisedway. Recent work (Dai, He, and Sun 2016) showed that Con-volutional Neural Networks (CNNs) are capable of learn-ing how to solve complex multi-task problems, while beingtrained in an end-to-end manner. Our motivation is to usethese capabilities of CNNs and create an end-to-end train-able scene text recognition system, that can be trained onweakly labelled data. In order to create such a system, welearn a single DNN that is able to find single characters,words or even lines of text in the input image and recog-nize their content. This is achieved by jointly learning a lo-calization network that uses a recurrent spatial transformer(Jaderberg et al. 2015b; Sønderby et al. 2015) as attentionmechanism and a text recognition network. Figure 1 pro-vides a schematic overview of our proposed system.

Our contributions are as follows: (1) We present a novel

The Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI-18)

6674

end-to-end trainable system for scene text detection andrecognition by integrating spatial transformer networks.(2) We propose methods that can improve and ease the workwith spatial transformer networks. (3) We train our pro-posed system end-to-end, in a semi-supervised way. (4) Wedemonstrate that our approach is able to reach competitiveperformance on standard benchmark datasets. (5) We pro-vide our code1 and trained models2 to the research commu-nity.

This paper is structured in the following way: We first out-line work of other researchers that is related to ours. Second,we describe our proposed system in detail. We then showand discuss our results on standard benchmark datasets andfinally conclude our findings.

Related WorkOver the course of years a rich environment of different ap-proaches to scene text detection and recognition have beendeveloped and published. Nearly all systems use a two-stepprocess for performing end-to-end recognition of scene text.The first step, is to detect regions of text and extract these re-gions from the input image. The second step, is to recognizethe textual content and return the text strings of the extractedtext regions.

It is further possible to divide these approaches into threebroad categories: (1) Systems relying on hand crafted fea-tures and human knowledge for text detection and textrecognition. (2) Systems using deep learning approaches,together with hand crafted features, or two different deepnetworks for each of the two steps. (3) Systems that do notconsist of a two step approach but rather perform text detec-tion and recognition using a single deep neural network. Foreach category, we will discuss some of these systems.

Hand Crafted Features In the beginning, methods basedon hand crafted features and human knowledge have beenused to perform text detection and recognition. These sys-tems used features like MSERs (Neumann and Matas 2010),Stroke Width Transforms (Epshtein, Ofek, and Wexler 2010)or HOG-Features (Wang, Babenko, and Belongie 2011) toidentify regions of text and provide them to the text recogni-tion stage of the system. In the text recognition stage slidingwindow classifiers (Mishra, Alahari, and Jawahar 2012) andensembles of SVMs (Yao et al. 2014) or k-Nearest Neighborclassifiers using HOG features (Wang and Belongie 2010)were used. All of these approaches use hand crafted featuresthat have a large variety of hyper parameters that need ex-pert knowledge to correctly tune them for achieving the bestresults.

Deep Learning Approaches More recent systems ex-change approaches based on hand crafted features in oneor both steps of recognition systems by approaches usingDNNs. Gomez and Karatzas (Gomez and Karatzas 2017)

1https://github.com/Bartzi/see2https://bartzi.de/research/see

propose a text-specific selective search algorithm that, to-gether with a DNN, can be used to detect (distorted) text re-gions in natural scene images. Gupta et al. (Gupta, Vedaldi,and Zisserman 2016) propose a text detection model basedon the YOLO-Architecture (Redmon et al. 2016) that uses afully convolutional deep neural network to identify text re-gions.

Bissacco et al. (Bissacco et al. 2013) propose a com-plete end-to-end architecture that performs text detectionusing hand crafted features. Jaderberg et al. (Jaderberg etal. 2015a; Jaderberg, Vedaldi, and Zisserman 2014) proposeseveral systems that use deep neural networks for text detec-tion and text recognition. In (Jaderberg et al. 2015a) Jader-berg et al. propose to use a region proposal network withan extra bounding box regression CNN for text detection. ACNN that takes the whole text region as input is used fortext recognition. The output of this CNN is constrained to apre-defined dictionary of words, making this approach onlyapplicable to one given language.

Goodfellow et al. (Goodfellow et al. 2014) propose a textrecognition system for house numbers, that has been refinedby Jaderberg et al. (Jaderberg, Vedaldi, and Zisserman 2014)for unconstrained text recognition. This system uses a sin-gle CNN, taking the whole extracted text region as input,and recognizing the text using one independent classifier foreach possible character in the given word. Based on this ideaHe et al. (He et al. 2016b) and Shi et al. (Shi, Bai, and Yao2016) propose text recognition systems that treat the recog-nition of characters from the extracted text region as a se-quence recognition problem. Shi et al. (Shi et al. 2016) laterimproved their approach by firstly adding an extra step thatutilizes the rectification capabilities of Spatial TransformerNetworks (Jaderberg et al. 2015b) for rectifying extractedtext lines. Secondly they added a soft-attention mechanismto their network that helps to produce the sequence of char-acters in the input image. In their work Shi et al. make use ofSpatial Transformers as an extra pre-processing step to makeit easier for the recognition network to recognize the text inthe image. In our system we use the Spatial Transformer asa core building block for detecting text in a semi-supervisedway.

End-to-End trainable Approaches The presented sys-tems always use a two-step approach for detecting and rec-ognizing text from scene text images. Although recent ap-proaches make use of deep neural networks they are still us-ing a huge amount of hand crafted knowledge in either of thesteps or at the point where the results of both steps are fusedtogether. Smith et al. (Smith et al. 2016) and Wojna et al.(Wojna et al. 2017) propose an end-to-end trainable systemthat is able to recognize text on French street name signs,using a single DNN. In contrast to our system it is not pos-sible for the system to provide the location of the text in theimage, only the textual content can be extracted. RecentlyLi et al. (Li, Wang, and Shen 2017) proposed an end-to-endsystem consisting of a single, complex DNN that is trainedend-to-end and can perform text detection and text recogni-tion in a single forward pass. This system is trained using

6675

groundtruth bounding boxes and groundtruth labels for eachword in the input images, which stands in contrast to ourmethod, where we only use groundtruth labels for each wordin the input image, as the detection of text is learned by thenetwork itself.

Proposed SystemA human trying to find and read text will do so in a sequen-tial manner. The first action is to put attention on a word,read each character sequentially and then attend to the nextword. Most current end-to-end systems for scene text recog-nition do not behave in that way. These systems rather tryto solve the problem by extracting all information from theimage at once. Our system first tries to attend sequentiallyto different text regions in the image and then recognizetheir textual content. In order to do this, we created a sin-gle DNN consisting of two stages: (1) text detection, and(2) text recognition. In this section we will introduce theattention concept used by the text detection stage and theoverall structure of the proposed system.

Detecting Text with Spatial TransformersA spatial transformer proposed by Jaderberg et al. (Jader-berg et al. 2015b) is a differentiable module for DNNs thattakes an input feature map I and applies a spatial transfor-mation to this feature map, producing an output feature mapO. Such a spatial transformer module is a combination ofthree parts. The first part is a localization network computinga function floc, that predicts the parameters θ of the spatialtransformation to be applied. These predicted parameters areused in the second part to create a sampling grid, which de-fines a set of points where the input map should be sampled.The third part is a differentiable interpolation method, thattakes the generated sampling grid and produces the spatiallytransformed output feature map O. We will shortly describeeach component in the following paragraphs.

Localization Network The localization network takes theinput feature map I ∈ RC×H×W , with C channels, heightH and width W and outputs the parameters θ of the transfor-mation that shall be applied. In our system we use the local-ization network (floc) to predict N two-dimensional affinetransformation matrices An

θ , where n ∈ {0, . . . , N − 1}:

floc(I) = Anθ =

[θn1 θn2 θn3θn4 θn5 θn6

](1)

N is thereby the number of characters, words or textlinesthe localization network shall localize. The affine transfor-mation matrices predicted in that way allow the network toapply translation, rotation, zoom and skew to the input im-age.

In our system the N transformation matrices Anθ are pro-

duced by using a feed-forward CNN together with a Recur-rent Neural Network (RNN). Each of the N transformationmatrices is computed using the globally extracted convolu-tional features c and the hidden state hn of each time-step of

the RNN:

c = f convloc (I) (2)

hn = frnnloc (c, hn−1) (3)

Anθ = gloc(hn) (4)

where gloc is another feed-forward/recurrent network. Weuse a variant of the well known ResNet architecture (Heet al. 2016a) as CNN for our localization network. We usethis network architecture, because we found that with thisnetwork structure our system learns faster and more suc-cessfully, as compared to experiments with other networkstructures, such as the VGGNet (Simonyan and Zisserman2015). We argue that this is due to the fact that the residualconnections of the ResNet help with retaining a strong gra-dient down to the very first convolutional layers. The RNNused in the localization network is a Long-Short Term Mem-ory (LSTM) (Hochreiter and Schmidhuber 1997) unit. ThisLSTM is used to generate the hidden states hn, which inturn are used to predict the affine transformation matrices.We used the same structure of the network for all our ex-periments we report in the next section. Figure 2 provides astructural overview of this network.

Rotation Dropout During our experiments, we found thatthe network tends to predict transformation parameters,which include excessive rotation. In order to mitigate sucha behavior, we propose a mechanism that works similarlyto dropout (Srivastava et al. 2014), which we call rotationdropout. Rotation dropout works by randomly dropping theparameters of the affine transformation, which are respon-sible for rotation. This prevents the localization network tooutput transformation matrices that perform excessive rota-tion. Figure 3 shows a comparison of the localization resultof a localization network trained without rotation dropout(top) and one trained with rotation dropout (middle).

Grid Generator The grid generator uses a regularlyspaced grid Go with coordinates yho

, xwo, of height Ho and

width Wo. The grid Go is used together with the affine trans-formation matrices An

θ to produce N regular grids Gn withcoordinates un

i , vnj of the input feature map I , where i ∈ Ho

and j ∈ Wo:(uni

vnj

)= An

θ

(xwo

yho

1

)=

[θn1 θn2 θn3θn4 θn5 θn6

](xwo

yho

1

)(5)

During inference we can extract the N resulting grids Gn,which contain the bounding boxes of the text regions foundby the localization network. Height Ho and width Wo canbe chosen freely.

Localization specific regularizers The datasets used byus, do not contain any samples, where text is mirrored ei-ther along the x- or y-axis. Therefore, we found it beneficialto add additional regularization terms that penalizes grid,which are mirrored along any axis. We furthermore foundthat the network tends to predict grids that get larger over

6676

Con

v

Con

v +

Avg

Pool [...] C

onv

Con

v + FC

Avg

Pool

Con

v

BLSTM

BLSTM

BLSTM

BLSTM

... ... ... ...

LSTM

LSTM

LSTM

LSTM

Localization Network Grid Generator

XSampler

Extracted Text Regions

Con

v

Con

v

+

Avg

Pool [...] C

onv

Con

v

+ FC

Avg

Pool

Con

v

FC

FC

FC

FC

Softmax

Softmax

Softmax

Softmax

Recognition Network

Output

"Chemin",

"Benech"

BBoxes of text regions

=

Figure 2: The network used in our work consists of two major parts. The first is the localization network that takes the inputimage and predicts N transformation matrices, which are used to create N different sampling grids. The generated samplinggrids are used in two ways: (1) for calculating the bounding boxes of the identified text regions (2) for extracting N text regions.The recognition network then performs text recognition on these extracted regions. The whole system is trained end-to-end byonly supplying information about the text labels for each text region.

the time of training, hence we included a further regularizerthat penalizes large grids, based on their area. Lastly, we alsoincluded a regularizer that encourages the network to pre-dict grids that have a greater width than height, as text isnormally written in horizontal direction and typically widerthan high. The main purpose of these localization specificregularizers is to enable faster convergence. Without theseregularizers, the network will eventually converge, but it willtake a very long time and might need several restarts of thetraining. Equation 7 shows how these regularizers are usedfor calculating the overall loss of the network.

Image Sampling The N sampling grids Gn produced bythe grid generator are now used to sample values of the fea-ture map I at the coordinates un

i , vnj for each n ∈ N . Nat-

urally these points will not always perfectly align with thediscrete grid of values in the input feature map. Because ofthat we use bilinear sampling and define the values of the Noutput feature maps On at a given location i, j where i ∈ Ho

and j ∈ Wo to be:

Onij =

H∑

h

W∑

w

Ihwmax(0, 1−|uni −h|)max(0, 1−|vnj −w|)

(6)This bilinear sampling is (sub-)differentiable, hence it ispossible to propagate error gradients to the localization net-work, using standard backpropagation.

The combination of localization network, grid generatorand image sampler forms a spatial transformer and can ingeneral be used in every part of a DNN. In our system we usethe spatial transformer as the first step of our network. Fig-ure 4 provides a visual explanation of the operation methodof grid generator and image sampler.

Text Recognition StageThe image sampler of the text detection stage produces a setof N regions, that are extracted from the original input im-

age. The text recognition stage (a structural overview of thisstage can be found in Figure 2) uses each of these N differ-ent regions and processes them independently of each other.The processing of the N different regions is handled by aCNN. This CNN is also based on the ResNet architecture aswe found that we could only achieve good results, while us-ing a variant of the ResNet architecture for our recognitionnetwork. We argue that using a ResNet in the recognitionstage is even more important than in the detection stage, be-cause the detection stage needs to receive strong gradientinformation from the recognition stage in order to success-fully update the weights of the localization network. TheCNN of the recognition stage predicts a probability distri-bution y over the label space Lε, where Lε = L ∪ {ε}, withL being the alphabet used for recognition, and ε representingthe blank label. The network is trained by running a LSTMfor a fixed number of T timesteps and calculating the cross-entropy loss for the output of each timestep. The choice ofnumber of timesteps T is based on the number of characters,of the longest word, in the dataset. The loss L is computedas follows:Lngrid = λ1 × Lar(G

n) + λ2 × Las(Gn) + Ldi(G

n) (7)

L =

N∑

n=1

(

T∑

t=1

(P (lnt |On)) + Lngrid) (8)

Where Lar(Gn) is the regularization term based on the

area of the predicted grid n, Las(Gn) is the regularization

term based on the aspect ratio of the predicted grid n, andLdi(G

n) is the regularization term based on the direction ofthe grid n, that penalizes mirrored grids. λ1 and λ2 are scal-ing parameters that can be chosen freely. The typical rangeof these parameters is 0 < λ1, λ2 < 0.5. lnt is the label l attime step t for the n-th word in the image.

Model TrainingThe training set X used for training the model consists ofa set of input images I and a set of text labels LI for each

6677

Figure 3: Top: predicted bounding boxes of network trainedwithout rotation dropout. Middle: predicted bounding boxesof network trained with rotation dropout. Bottom: visualiza-tion of image parts that have the highest influence on theoutcome of the prediction. This visualization has been cre-ated using Visualbackprop (Bojarski et al. 2016).

input image. We do not use any labels for training the textdetection stage. The text detection stage is learning to detectregions of text by using only the error gradients, obtainedby calculating the cross-entropy loss, of the predictions andthe textual labels, for each character of each word. Duringour experiments we found that, when trained from scratch,a network that shall detect and recognize more than two textlines does not converge. In order to overcome this problemwe designed a curriculum learning strategy (Bengio et al.2009) for training the system. The complexity of the sup-plied training images under this curriculum is gradually in-creasing, once the accuracy on the validation set has settled.

During our experiments we observed that the performanceof the localization network stagnates, as the accuracy of therecognition network increases. We found that restarting thetraining with the localization network initialized using theweights obtained by the last training and the recognition net-work initialized with random weights, enables the localiza-tion network to improve its predictions and thus improve theoverall performance of the trained network. We argue thatthis happens because the values of the gradients propagatedto the localization network decrease, as the loss decreases,leading to vanishing gradients in the localization networkand hence nearly no improvement of the localization.

ExperimentsIn this section we evaluate our presented network architec-ture on standard scene text detection/recognition benchmarkdatasets. While performing our experiments we tried to an-

I 1

O2

O

Figure 4: Operation method of grid generator and imagesampler. First the grid generator uses the N affine transfor-mation matrices An

θ to create N equally spaced samplinggrids (red and yellow grids on the left side). These samplinggrids are used by the image sampler to extract the imagepixels at that location, in this case producing the two outputimages O1 and O2. The corners of the generated samplinggrids provide the vertices of the bounding box for each textregion, that has been found by the network.

swer the following questions: (1) Is the concept of letting thenetwork automatically learn to detect text feasible? (2) Canwe apply the method on a real world dataset? (3) Can we getany insights on what kind of features the network is tryingto extract?

In order to answer these questions, we used differentdatasets. On the one hand we used standard benchmarkdatasets for scene text recognition. On the other hand wegenerated some datasets on our own. First, we performed ex-periments on the SVHN dataset (Netzer et al. 2011), that weused to prove that our concept as such is feasible. Second,we generated more complex datasets based on SVHN im-ages, to see how our system performs on images that containseveral words in different locations. The third dataset we ex-erimented with, was the French Street Name Signs (FSNS)dataset (Smith et al. 2016). This dataset is the most chal-lenging we used, as it contains a vast amount of irregular,low resolution text lines, that are more difficult to locate andrecognize than text lines from the SVHN datasets. We be-gin this section by introducing our experimental setup. Wewill then present the results and characteristics of the ex-periments for each of the aforementioned datasets. We willconclude this section with a brief explanation of what kindsof features the network seems to learn.

Experimental SetupLocalization Network The localization network used inevery experiment is based on the ResNet architecture (Heet al. 2016a). The input to the network is the image wheretext shall be localized and later recognized. Before the firstresidual block the network performs a 3 × 3 convolution,

6678

followed by batch normalization (Ioffe and Szegedy 2015),ReLU (Nair and Hinton 2010), and a 2× 2 average poolinglayer with stride 2. After these layers three residual blockswith two 3×3 convolutions, each followed by batch normal-ization and ReLU, are used. The number of convolutionalfilters is 32, 48 and 48 respectively. A 2 × 2 max-poolingwith stride 2 follows after the second residual block. Thelast residual block is followed by a 5 × 5 average poolinglayer and this layer is followed by a LSTM with 256 hiddenunits. Each time step of the LSTM is fed into another LSTMwith 6 hidden units. This layer predicts the affine transfor-mation matrix, which is used to generate the sampling gridfor the bilinear interpolation. We apply rotation dropout toeach predicted affine transformation matrix, in order to over-come problems with excessive rotation predicted by the net-work.

Recognition Network The inputs to the recognition net-work are N crops from the original input image, represent-ing the text regions found by the localization network. In ourSVHN experiments, the recognition network has the samestructure as the localization network, but the number of con-volutional filters is higher. The number of convolutional fil-ters is 32, 64 and 128 respectively. We use an ensemble ofT independent softmax classifiers as used in (Goodfellow etal. 2014) and (Jaderberg, Vedaldi, and Zisserman 2014) forgenerating our predictions. In our experiments on the FSNSdataset we found that using ResNet-18 (He et al. 2016a) sig-nificantly improves the obtained recognition accuracies.

Alignment of Groundtruth During training we assumethat all groundtruth labels are sorted in western readingdirection, that means they appear in the following order:1. from top to bottom, and 2. from left to right. We stressthat currently it is very important to have a consistent or-dering of the groundtruth labels, because if the labels are ina random order, the network rather predicts large boundingboxes that span over all areas of text in the image. We hopeto overcome this limitation, in the future, by developing amethod that allows random ordering of groundtruth labels.

Implementation We implemented all our experiments us-ing Chainer (Tokui et al. 2015). We conducted all our exper-iments on a work station which has an Intel(R) Core(TM) i7-6900K CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.

Experiments on the SVHN datasetWith our first experiments on the SVHN dataset (Netzer etal. 2011) we wanted to prove that our concept works. Wetherefore first conducted experiments, similar to the exper-iments in (Jaderberg et al. 2015b), on SVHN image cropswith a single house number in each image crop, that iscentered around the number and also contains backgroundnoise. Table 1 shows that we are able to reach competitiverecognition accuracies.

Based on this experiment we wanted to determine whetherour model is able to detect different lines of text that are ar-ranged in a regular grid, or placed at random locations in the

Method 64px(Goodfellow et al. 2014) 96.0%(Jaderberg et al. 2015b) 96.3%Ours 95.2%

Table 1: Sequence recognition accuracies on the SVHNdataset. When recognizing house number on crops of 64×64pixels, following the experimental setup of (Goodfellow etal. 2014)

Figure 5: Samples from our generated datasets, includingbounding boxes predicted by our model. Left: Sample fromregular grid dataset, Right: Sample from dataset with ran-domly positioned house numbers.

image. In Figure 5 we show samples from our two gener-ated datasets, that we used for our other experiments basedon SVHN data. We found that our network performs well onthe task of finding and recognizing house numbers that arearranged in a regular grid.

During our experiments on the second dataset, createdby us, we found that it is not possible to train a modelfrom scratch, which can find and recognize more than twotextlines that are scattered across the whole image. We there-fore resorted to designing a curriculum learning strategy thatstarts with easier samples first and then gradually increasesthe complexity of the train images.

Experiments on the FSNS datasetFollowing our scheme of increasing the difficulty of the taskthat should be solved by the network, we chose the FrenchStreet Name Signs (FSNS) dataset by Smith et al. (Smithet al. 2016) to be our next dataset to perform experimentson. The FSNS dataset contains more than 1 million im-ages of French street name signs, which have been extractedfrom Google Streetview. This dataset is the most challeng-ing dataset for our approach as it (1) contains multiple linesof text with varying length, which are embedded in naturalscenes with distracting backgrounds, and (2) contains a lotof images where the text is occluded, not correct, or nearlyunreadable for humans.

During our first experiments with that dataset, we foundthat our model is not able to converge, when trained on thesupplied groundtruth. We argue that this is because our net-work was not able to learn the alignment of the supplied la-bels with the text in the images of the dataset. We thereforechose a different approach, and started with experiments

6679

Figure 6: Samples from the FSNS dataset, these examples show the variety of different samples in the dataset and also howwell our system copes with these samples. The bottom row shows two samples, where our system fails to recognize the correcttext. The right image is especially interesting, as the system here tries to mix information, extracted from two different streetsigns, that should not be together in one sample.

Method Sequence Accuracy(Smith et al. 2016) 72.5%(Wojna et al. 2017) 84.2%Ours 78.0%

Table 2: Recognition accuracies on the FSNS benchmarkdataset.

where we tried to find individual words instead of textlineswith more than one word. Table 2 shows the performanceof our proposed system on the FSNS benchmark dataset.We are currently able to achieve competitive performanceon this dataset. We are still behind the results reported byWojna et al. (Wojna et al. 2017). This likely due to the factthat we used a feature extractor that is weaker (ResNet-18)compared to the one used by Wojna et al. (Inception-ResNetv2). Also recall that our method is not only able to determinethe text in the images, but also able to extract the location ofthe text, although we never explicitly told the network whereto find the text! The network learned this completely on itsown in a semi-supervised manner.

InsightsDuring the training of our networks, we used Visualback-prop (Bojarski et al. 2016) to visualize the regions that thenetwork deems to be the most interesting. Using this visual-ization technique, we could observe that our system seemsto learn different types of features for each subtask. Figure 3(bottom) shows that the localization network learns to ex-tract features that resemble edges of text and the recognitionnetwork learns to find strokes of the individual charactersin each cropped word region. This is an interesting obser-vation, as it shows that our DNN tries to learn features thatare closely related to the features used by systems based on

hand-crafted features.

ConclusionIn this paper we presented a system that can be seen as a steptowards solving end-to-end scene text recognition, only us-ing a single multi-task deep neural network. We trained thetext detection component of our model in a semi-supervisedway and are able to extract the localization results of the textdetection component. The network architecture of our sys-tem is simple, but it is not easy to train this system, as a suc-cessful training requires a clever curriculum learning strat-egy. We also showed that our network architecture can beused to reach competitive results on different public bench-mark datasets for scene text detection/recognition.

At the current state we note that our models are not fullycapable of detecting text in arbitrary locations in the image,as we saw during our experiments with the FSNS dataset.Right now our model is also constrained to a fixed numberof maximum words that can be detected with one forwardpass. In our future work, we want to redesign the networkin a way that makes it possible for the network to determinethe number of textlines in an image by itself.

ReferencesBengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009.Curriculum learning. In Proceedings of the 26th Annual Inter-national Conference on Machine Learning, ICML ’09, 41–48.New York, NY, USA: ACM.Bissacco, A.; Cummins, M.; Netzer, Y.; and Neven, H. 2013.Photoocr: Reading text in uncontrolled conditions. In Proceed-ings of the IEEE International Conference on Computer Vision,785–792.Bojarski, M.; Choromanska, A.; Choromanski, K.; Firner, B.;

6680

Jackel, L.; Muller, U.; and Zieba, K. 2016. Visualbackprop:efficient visualization of cnns. arXiv:1611.05418 [cs].Dai, J.; He, K.; and Sun, J. 2016. Instance-aware semantic seg-mentation via multi-task network cascades. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition, 3150–3158.Epshtein, B.; Ofek, E.; and Wexler, Y. 2010. Detecting text innatural scenes with stroke width transform. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2963–2970.Goodfellow, I.; Bulatov, Y.; Ibarz, J.; Arnoud, S.; and Shet, V.2014. Multi-digit number recognition from street view imageryusing deep convolutional neural networks. In ICLR2014.Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic datafor text localisation in natural images. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,2315–2324.Gomez, L., and Karatzas, D. 2017. Textproposals: A text-specific selective search algorithm for word spotting in the wild.Pattern Recognition 70:60–74.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residuallearning for image recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 770–778.He, P.; Huang, W.; Qiao, Y.; Loy, C. C.; and Tang, X. 2016b.Reading scene text in deep convolutional sequences. In Pro-ceedings of the Thirtieth AAAI Conference on Artificial Intelli-gence, 3501–3508. AAAI Press.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory. Neural Computation 9(8):1735–1780.Ioffe, S., and Szegedy, C. 2015. Batch normalization: Acceler-ating deep network training by reducing internal covariate shift.In Proceedings of The 32nd International Conference on Ma-chine Learning, 448–456.Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A.2015a. Reading text in the wild with convolutional neural net-works. International Journal of Computer Vision 116(1):1–20.Jaderberg, M.; Simonyan, K.; Zisserman, A.; and Kavukcuoglu,K. 2015b. Spatial transformer networks. In Advances in NeuralInformation Processing Systems 28, 2017–2025. Curran Asso-ciates, Inc.Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Deepfeatures for text spotting. In Computer Vision - ECCV 2014,number 8692 in Lecture Notes in Computer Science, 512–528.Springer International Publishing.Li, H.; Wang, P.; and Shen, C. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks.arXiv:1707.03985 [cs].Mishra, A.; Alahari, K.; and Jawahar, C. 2012. Scene textrecognition using higher order language priors. In BMVC 2012-23rd British Machine Vision Conference, 127.1–127.11. BritishMachine Vision Association.Nair, V., and Hinton, G. E. 2010. Rectified linear units im-prove restricted boltzmann machines. In Proceedings of the27th international conference on machine learning (ICML-10),807–814.Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng,A. Y. 2011. Reading digits in natural images with unsuper-

vised feature learning. In NIPS workshop on deep learning andunsupervised feature learning, volume 2011, 5.Neumann, L., and Matas, J. 2010. A method for text localiza-tion and recognition in real-world images. In Computer Vision- ACCV 2010, number 6494 in Lecture Notes in Computer Sci-ence, 770–783. Springer Berlin Heidelberg.Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016.You only look once: Unified, real-time object detection. In Pro-ceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, 779–788.Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn:Towards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Systems28, 91–99. Curran Associates, Inc.Shi, B.; Bai, X.; and Yao, C. 2016. An end-to-end trainableneural network for image-based sequence recognition and itsapplication to scene text recognition. IEEE Transactions onPattern Analysis and Machine Intelligence.Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2016. Robustscene text recognition with automatic rectification. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, 4168–4176.Simonyan, K., and Zisserman, A. 2015. Very deep convolu-tional networks for large-scale image recognition. In Interna-tional Conference on Learning Representations.Smith, R.; Gu, C.; Lee, D.-S.; Hu, H.; Unnikrishnan, R.; Ibarz,J.; Arnoud, S.; and Lin, S. 2016. End-to-end interpretationof the french street name signs dataset. In Computer Vision -ECCV 2016 Workshops, 411–426. Springer, Cham.Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; andSalakhutdinov, R. 2014. Dropout: a simple way to preventneural networks from overfitting. Journal of machine learningresearch 15(1):1929–1958.Sønderby, S. K.; Sønderby, C. K.; Maaløe, L.; and Winther,O. 2015. Recurrent spatial transformer networks.arXiv:1509.05329 [cs].Tokui, S.; Oono, K.; Hido, S.; and Clayton, J. 2015. Chainer:a next-generation open source framework for deep learning.In Proceedings of Workshop on Machine Learning Systems(LearningSys) in The Twenty-ninth Annual Conference on Neu-ral Information Processing Systems (NIPS).Wang, K., and Belongie, S. 2010. Word spotting in the wild. InComputer Vision - ECCV 2010, number 6311 in Lecture Notesin Computer Science, 591–604. Springer Berlin Heidelberg.Wang, K.; Babenko, B.; and Belongie, S. 2011. End-to-endscene text recognition. In 2011 International Conference onComputer Vision, 1457–1464.Wojna, Z.; Gorban, A.; Lee, D.-S.; Murphy, K.; Yu, Q.; Li, Y.;and Ibarz, J. 2017. Attention-based extraction of structuredinformation from street view imagery. arXiv:1704.03549 [cs].Yao, C.; Bai, X.; Shi, B.; and Liu, W. 2014. Strokelets: Alearned multi-scale representation for scene text recognition. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition, 4042–4049.

6681

5. SEE

5.3 Additional Experimental Results

In this section, we provide additional experimental results of our presented net-

work architecture on the ICDAR dataset [KGBN+15] for focused scene text recog-

nition, where we explored the performance of our model when it comes to find

and recognize single characters.

5.3.1 Experimental Setup

5.3.1.1 Localization Network

The localization network used in every experiment is based on the ResNet ar-

chitecture [HZRS16]. The input to the network is the image where text shall be

localized and later recognized. Before the first residual block, the network per-

forms a 3× 3 convolution followed by a 2× 2 average pooling layer with stride 2.

After these layers three residual blocks with two 3×3 convolutions, each followed

by batch normalization [IS15b], are used. The number of convolutional filters is

32, 48 and 48 respectively and ReLU is used as the activation function for each

convolutional layer. A 2 × 2 max-pooling with stride 2 follows after the second

residual block. The last residual block is followed by a 5×5 average pooling layer,

and a BLSTM follows this layer with 256 hidden units. For each time step of the

BLSTM, a fully connected layer with 6 hidden units follows. This layer predicts

the affine transformation matrix, that is used to generate the sampling grid for

the bilinear interpolation. As rectification of scene text is beyond the scope of

this work we disabled skew and rotation in the affine transformation matrices

by setting the corresponding parameters to 0. We will discuss the rectification

capabilities of Spatial Transformers for scene text detection in our future work.

5.3.1.2 Recognition Network

The inputs to the recognition network are N crops from the original input image

that represent the text regions found by the localization network. The recognition

network has the same structure as the localization network, but the number of

convolutional filters is higher. The number of convolutional filters is 32, 64 and

128 respectively. Depending on the experiment we either used an ensemble of

100

5.3 Additional Experimental Results

Method ICDAR 2013 SVT IIIT5K

PhotoOCR [BCNN13b] 87.6 78.0 -

CharNet [JVZ14b] 81.8 71.7 -

DictNet* [JSVZ15] 90.8 80.7 -

CRNN [SBY16] 86.7 80.8 78.2

RARE [SWL+16] 87.5 81.9 81.9

Ours 90.3 79.8 86

Table 5.1: Recognition accuracies on the ICDAR 2013, SVT and IIIT5K robust reading

benchmarks. Here we only report results that do not use per image lexicons. (*[JSVZ15]

is not lexicon-free in the strict sense as the outputs of the network itself are constrained to

a 90k dictionary.)

T independent softmax classifiers as used in [GBI+14] and [JVZ14a], where T

is the maximum length that a word may have, or we used CTC with best path

decoding as used in [HHQ+16] and [SBY16].

5.3.1.3 Implementation

We implemented all our experiments using MXNet [CLL+15]. We conducted

all our experiments on a workstation which has an Intel(R) Core(TM) i7-6900K

CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.

5.3.2 Experiments on Robust Reading Datasets

In our next experiments, we used datasets where text regions are already cropped

from the input images. We wanted to see whether our text localization network

can be used as an intelligent sliding window generator that adapts to irregularities

of the text in the cropped text region. Therefore we trained our recognition model

using CTC on a dataset of synthetic cropped word images, that we generated

using our data generator, that works similar to the data generator introduced by

Jaderberg [JSVZ14].

In Table 5.1 we report the recognition results of our model on the ICDAR

2013 robust reading [KSU+13], the Street View Text (SVT) [WBB11] and the

IIIT5K [MAJ12] benchmark datasets. For evaluation on the ICDAR 2013 and

101

5. SEE

Figure 5.1: Samples from ICDAR, SVT and IIIT5K datasets that show how well our

model finds text regions and is able to follow the slope of the words.

SVT datasets, we filtered all images that contain non-alphanumeric characters

and discarded all images that have less than 3 characters as done in [SWL+16,

WBB11]. We obtained our final results by post-processing the predictions using

the standard hunspell English (en-US) dictionary.

Overall we find that our model achieves state-of-the-art performance for un-

constrained recognition models on the ICDAR 2013 and IIIT5K dataset and

competitive performance on the SVT dataset. In Figure 5.1 we show that our

model learns to follow the slope of the individual text regions, proving that our

model intelligently produces sliding windows.

102

6

Learning Binary Neural

Networks with BMXNet

In this work, we developed an open source Binary Neural Networks (BNNs) based

on Apache MXNet. The implemented approaches can drastically reduce memory

size and accesses of a DL model, and replace the arithmetic operations by bit-

wise operations. It significantly improves the efficiency and lowers the energy

consumption at runtime, which enables the application of state-of-the-art deep

learning models on low power devices.

We further worked on increasing our understanding of the training process of

BNNs. We systematically evaluated different network architectures and hyper-

parameters to provide useful insights on how to train a binary neural network

based on BMXNet. Further, we present how we improved accuracy by increasing

the number of connections through the network.


• Main contributor to the formulation and implementation of research ideas

• Main contributor of the conceptual and technical implementation

• Core maintainer of the software project

• Guidance and supervision of the further technical implementation

103

6. LEARNING BINARY NEURAL NETWORKS WITH BMXNET

6.2 Manuscript

104

BMXNet: An Open-Source Binary Neural NetworkImplementation Based on MXNet

Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph MeinelHasso Plattner Institute (HPI), University of Potsdam, Germany

Potsdam D-14480{haojin.yang,christian.bartz,meinel}@hpi.de

[email protected]

ABSTRACTBinary Neural Networks (BNNs) can drastically reduce memorysize and accesses by applying bit-wise operations instead of stan-dard arithmetic operations. Therefore it could significantly improvethe efficiency and lower the energy consumption at runtime, whichenables the application of state-of-the-art deep learning models onlow power devices. BMXNet is an open-source BNN library basedon MXNet, which supports both XNOR-Networks and QuantizedNeural Networks. The developed BNN layers can be seamlesslyapplied with other standard library components and work in bothGPU and CPU mode. BMXNet is maintained and developed by themultimedia research group at Hasso Plattner Institute and releasedunder Apache license. Extensive experiments validate the efficiencyand effectiveness of our implementation. The BMXNet library, sev-eral sample projects, and a collection of pre-trained binary deepmodels are available for download at https://github.com/hpi-xnor

CCS CONCEPTS• Software and its engineering→ Software libraries and repos-itories; •Computer systems organization→Neural networks;• Computing methodologies → Computer vision;

KEYWORDSOpen Source, Computer Vision, Binary Neural Networks, MachineLearning

ACM Reference format:Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph Meinel. 2017.BMXNet: An Open-Source Binary Neural Network Implementation BasedonMXNet. In Proceedings of MM ’17, Mountain View, CA, USA, October 23–27,2017, 4 pages.https://doi.org/10.1145/3123266.3129393

1 INTRODUCTIONIn recent years, deep learning technologies achieved excellent per-formance and many breakthroughs in both academia and industry.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’17, October 23–27, 2017, Mountain View, CA, USA© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-4906-2/17/10. . . $15.00https://doi.org/10.1145/3123266.3129393

However the state-of-the-art deep models are computational ex-pensive and consume large storage space. Deep learning is alsostrongly demanded by numerous applications from areas such asmobile platforms, wearable devices, autonomous robots and IoTdevices. How to efficiently apply deep models on such low powerdevices becomes a challenging research problem. The recently intro-duced Binary Neural Networks (BNNs) could be one of the possiblesolutions for this problem.

Several approaches [4, 7, 13, 15] introduce the usage of BNNs.These BNNs have the capability of decreasing the memory con-sumption and computational complexity of the neural network.This is achieved by on the one hand storing the weights, that aretypically stored as 32 bit floating point values, as binary values, bybinarizing the floating point values with the sign function, to be ofeither {0, 1} or {−1, 1}, and storing several of them in a single 32 bitfloat or integer. Computational complexity, on the other hand, isreduced by using xnor and popcount for performing matrix multi-plications used in convolutional and fully connected layers. Mostof the publicly available implementations of BNN do not store theweights in their binarized form [4, 7, 13, 15], nor use xnor andpopcount [7, 15] while performing the matrix multiplications inconvolutional and fully connected layers.

The deep learning library Tensorflow [8] tries to decrease thememory consumption and computational complexity of deep neuralnetworks, by quantizing the 32 bit floating point weights and inputsinto 8 bit integers. Together with the minimum and maximumvalue of the weight/input matrix, 4× less memory usage and alsodecreased computational complexity is achieved, as all operationsonly need to be performed on 8 bit values rather than 32 bit values.

BMXNet stores the weights of convolutional and fully connectedlayers in their binarized format, which enables us to store 32/64weights in a single 32/64 bit float/integer and use 32× less mem-ory. During training and inference we binarize the input to eachbinary convolution and fully connected layer in the same way asthe weights get binarized, and perform matrix multiplication usingbit-wise operations (xnor and popcount). Our implementation isalso prepared to use networks that store weights and use inputswith arbitrary bit widths as proposed by Zhou et al. [15].

The deep learning library MXNet [3] serves as a base for ourcode. MXNet is a high performance and modular deep learninglibrary, that is written in C++. MXNet provides Bindings for otherpopular programming languages like Python, R, Scala and Go, andis used by a wide range of researchers and companies.

Session: Open Source Software Competition MM’17, October 23-27, 2017, Mountain View, CA, USA

1209

2 FRAMEWORKBMXNet provides activation, convolution and fully connected lay-ers that support quantization and binarization of input data andweights. These layers are designed as drop-in replacements forthe corresponding MXNet variants and are called QActivation,QConvolution and QFullyConnected. They provide an additionalparameter, act_bit, which controls the bit width the layers calcu-late with.

A Python example usage of our framework in comparison toMXNet is shown in Listing 1 and 2. We do not use binary layers forthe first and last layer in the network, as we have confirmed theexperiments of [13] showing that this greatly decreases accuracy.The standard block structure of a BNN in BMXNet is conducted as:QActivation-QConv/QFC-BatchNorm-Pooling as shown in Listing 2.

Listing 1: LeNetdef get_lenet ():

data = mx.symbol.Variable(data)

# first conv layer

conv1 = mx.sym.Convolution (...)

tanh1 = mx.sym.Activation (...)

pool1 = mx.sym.Pooling (...)

bn1 = mx.sym.BatchNorm (...)

# second conv layer





# first fullc layer

flatten = mx.sym.Flatten (...)

fc1 = mx.symbol.FullyConnected ()



# second fullc

fc2 = mx.sym.FullyConnected (...)

# softmax loss

lenet = mx.sym.SoftmaxOutput (..)

return lenet

Listing 2: Binary LeNetdef get_binary_lenet ():

data = mx.symbol.Variable(data)

# first conv layer





# second conv layer

ba1 = mx.sym.QActivation(...)

conv2 = mx.sym.QConvolution(...)



# first fullc layer

flatten = mx.sym.Flatten (...)

ba2 = mx.symbol.QActivation(..)

fc1 = mx.symbol.QFullyConnected(..)



# second fullc

fc2 = mx.sym.FullyConnected (...)

# softmax loss

lenet = mx.sym.SoftmaxOutput (..)

return lenet

2.1 QuantizationThe quantization on bit widths ranging from 2 to 31 bit is availablefor experiments with training and prediction, using low precisionweights and inputs. The quantized data is still stored in the default32 bit float values and the standard MXNet dot product operationsare applied.

We quantize the weights following the linear quantization asshown by [15]. Equation 1 will quantize a real number input in therange [0, 1] to a number in the same range representable with a bitwidth of k bit.

quantize (input ,k ) =round ((2k − 1) ∗ input )

2k − 1 (1)

2.2 BinarizationThe extreme case of quantizing to 1 bit wide values is the bina-rization. Working with binarized weights and input data allowsfor highly performant matrix multiplications by utilizing the CPUinstructions xnor and popcount.

2.2.1 Dot Product with xnor and popcount. Fully connectedand convolution layers heavily rely on dot products of matrices,

which in turn require massive floating point operations. Most mod-ern CPUs are optimized for these types of operations. But especiallyfor real time applications on embedded or less powerful devices(cell phones, IoT devices) there are optimizations that improveperformance, reduce memory and I/O footprint and lower powerconsumption [2].

To calculate the dot product of two binary matrices A ◦ B, nomultiplication operation is required. The element-wise multiplica-tion and summation of each row of A with each column of B canbe approximated by first combining them with the xnor operationand then counting the number of bits set to 1 in the result which isthe population count [13].

Listing 3: Baseline xnor GEMM Kernelvoid xnor_gemm_baseline_no_omp(int M, int N, int K,

BINARY_WORD *A, int lda ,

BINARY_WORD *B, int ldb ,

float *C, int ldc){

for (int m = 0; m < M; ++m) {

for (int k = 0; k < K; k++) {

BINARY_WORD A_PART = A[m*lda+k];

for (int n = 0; n < N; ++n) {

C[m*ldc+n] += __builtin_popcountl(∼(A_PART ∧ B[k*ldb+n]));

}

}

}

}

We can approximate the multiplication and addition of two times64 matrix elements in very concise processor instructions on x64CPUs and two times 32 elements on x86 and ARMv7 processors.This is enabled by hardware support for the xnor and popcountoperations. They translate directly into a single assembly command.The population count instruction is available on x86 and x64 CPUssupporting SSE4.2, while on ARM architecture it is included in theNEON instruction set.

An unoptimized GEMM (General Matrix Multiplication) imple-mentation utilizing these instructions is shown in Listing 3. Thecompiler intrinsic __builtin_popcount is supported by both gccand clang compilers and translates into the machine instruction onsupported hardware. BINARY_WORD is the packed data type storing32 (x86 and ARMv7) or 64 (x64) matrix elements, each representedby a single bit. We implemented several optimized versions of xnorGEMM kernel. We leverage processor cache hierarchies by blockingand packing the data, use unrolling and parallelization techniques.

2.2.2 Training. We carefully designed the binarized layers (uti-lizing xnor and population count operations) to exactly match theoutput of the built-in layers of MXNet (computing with BLAS dotproduct operations) when limiting those to the discrete values -1and +1. This enables massively parallel training with GPU supportby utilizing CuDNN on high performance clusters. The trainedmodel can then be used on less powerful devices where the forwardpass for prediction will calculate the dot product with the xnor andpopcount operations instead of multiplication and addition.

The possible values after performing an xnor and popcountmatrix multiplication A

(m×n)◦ B(n×k )

are in the range [0,+n] with

the step size 1, whereas a normal dot product of matrices limited todiscrete values -1 and +1 will be in the range [−n,+n] with the stepsize 2. To enable GPU supported training we modify the training


1210

Figure 1: Processing time comparison of GEMMmethods

Figure 2: Speedup comparison based on naive gemmmethodby varying filter number of the convolution layer. The inputchannel size is fixed to 256 while the kernel size and batchsize are set to 5×5 and 200 respectively.

process. After calculation of the dot product we map the result backto the range [0,+n] to match the xnor dot product, as in Equation 2.

outputxnor_dot =outputdot + n

2(2)

2.2.3 Model Converter. After training a network with BMXNet,the weights are stored in 32 bit float variables. This is also the casefor networks trained with a bit width of 1 bit. We provide a modelconverter1 that reads in a binary trained model file and packs theweights of QConvolution and QFullyConnected layers. After thisconversion only 1 bit of storage and runtime memory is used perweight. A ResNet-18 network with full precision weights has a sizeof 44.7MB. The conversion with our model converter achieves 29×compression resulting in a file size of 1.5MB (cf. Table 1).

3 EVALUATIONIn this section we report the evaluation results of both efficiencyanalysis and classification accuracy overMNIST [12], CIFAR-10 [11]and ImageNet [5] datasets using BMXNet.

1https://github.com/hpi-xnor/BMXNet/tree/master/smd_hpi/tools/model-converter

Figure 3: Speedup comparison based on naive gemmmethodby varying kernel size of the convolution layer. The inputchannel size, batch size and filter number are set to 256, 200and 64 respectively.

3.1 Efficiency AnalysisAll the experiments in this section have been performed on Ubuntu16.04/64-bit platform with Intel 2.50GHz × 4 CPU with popcnt in-struction (SSE4.2) and 8G RAM.

In the current deep neural network implementations, most ofthe fully connected and convolution layers are implemented us-ing GEMM. According to the evaluation result from [9], over 90%of the processing time of the Caffe-AlexNet [10] model is spenton such layers. We thus conducted experiments to measure theefficiency of different GEMM methods. The measurements wereperformed within a convolution layer, where we fixed the parame-ters as follows: filter number=64, kernel size=5×5, batch size=200,and the matrix sizes M, N, K are 64, 12800, kernelw × kernelh ×inputChannelSize , respectively. Figure 1 shows the evaluation re-sults. The colored columns denote the processing time in millisec-onds across varying input channel size; xnor_32 and xnor_64 de-note the xnor_gemm operator in 32 bit and 64 bit; xnor_64_ompdenotes the 64 bit xnor_gemm accelerated by using the OpenMP2parallel programming library; binarize input and xnor_64_ompfurther accumulated the processing time of input data binarization.From the results we can determine that xnor_64_omp achievedabout 50× and 125× acceleration in comparison to Cblas(Atlas3)and naive gemm kernel, respectively. By accumulating the bina-rization time of input data we still achieved about 13× accelerationcompared with Cblas method.

Figures 2 and 3 illustrate the speedup achieved by varying filternumber and kernel size based on the naive gemm method.

3.2 Classification AccuracyWe further conducted experiments with our BNNs on the MNIST,CIFAR-10 and ImageNet datasets. The experiments were performedon a work station which has an Intel(R) Core(TM) i7-6900K CPU,64 GB RAM and 4 TITAN X (Pascal) GPUs.

By following the same strategy as applied in [7, 13, 15] wealways avoid binarization at the first convolution layer and the

2http://www.openmp.org/3http://math-atlas.sourceforge.net/


1211

Architecture Test Accuracy (Binary/Full Precision) Model Size (Binary/Full Precision)MNIST Lenet 0.97/0.99 206kB/4.6MB

CIFAR-10 ResNet-18 0.86/0.90 1.5MB/44.7MBTable 1: Classification test accuracy of binary and full precision models trained on MNIST and CIFAR-10 dataset. No pre-training or data augmentation was used.

Full Pre-cisionStage

Val-acc-top-1 Val-acc-top-5 Model Size

none 0.42 0.66 3.6MB1st 0.48 0.73 4.1MB2nd 0.44 0.69 5.6MB3rd 0.49 0.73 11.3MB4th 0.47 0.71 36MB1st, 2nd 0.49 0.73 6.2MBAll 0.61 0.84 47MB

Table 2: Classification test accuracy of binary, partially bi-narized and full precision models trained on ImageNet.ResNet-18 architecture was used in the experiment.

last fully connected layer. Table 1 depicts the classification testaccuracy of our binary, as well as full precision models trainedon MNIST and CIFAR10. The table shows that the size of binarymodels is significantly reduced, while the accuracy is still compet-itive. Table 2 demonstrates the validation accuracy of our binary,partially-binarized and full precision models trained on ImageNet.The ResNet implementation in MXNet consists of 4 ResUnit stages,we thus also report the results of a partially-binarized model withspecific full precision stages. The partially-binarized model withthe first full precision stage shows a great accuracy improvementwith very minor model size increase, compared to the full binarizedmodel.

4 EXAMPLE APPLICATIONS4.1 Python ScriptsThe BMXNet repository[1] contains python scripts that can trainand validate binarized neural networks. The script smd_hpi/examples/binary_mnist/mnist_cnn.py will train a binary LeNet[14]with theMNIST[12] data set. To train a networkwith the CIFAR10[11]or ImageNet[5] data set there is a python script based on theResNet18[6] architecture. Find it at smd_hpi/examples/binary-imagenet1k/train_cifar10/train_[dataset].py. For furtherinformation and example invocation see the corresponding README.md

4.2 Mobile Applications4.2.1 Image Classification. The Android application android-

image-classificationand iOS application ios-image-classificationcanclassify the live camera feed based on a binarized ResNet18 modeltrained on the ImageNet dataset.

4.2.2 HandwrittenDigit Detection. The iOS application ios-mnistcan classify handwritten numbers based on a binarized LeNet modeltrained on the MNIST dataset.

5 CONCLUSIONWe introduced BMXNet, an open-source binary neural networkimplementation in C/C++ based on MXNet. The evaluation resultsshow up to 29× model size saving and much more efficient xnorGEMM computation. In order to demonstrate the applicability wedeveloped sample applications for image classification on Androidas well as iOS using a binarized ResNet-18 model. Source code, doc-umentation, pre-trained models and sample projects are publishedon GitHub [1].

REFERENCES[1] 2017. BMXNet: an open-source binary neural network library. https://github.

com/hpi-xnor. (2017).[2] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2016. YodaNN:

An ultra-low power convolutional neural network accelerator based on binaryweights. In VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on. IEEE,236–241.

[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, TianjunXiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexibleand Efficient Machine Learning Library for Heterogeneous Distributed Systems.CoRR abs/1512.01274 (2015).

[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryCon-nect: Training Deep Neural Networks with binary weights during propagations.In Advances in Neural Information Processing Systems 28. 3123–3131.

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: ALarge-Scale Hierarchical Image Database. In CVPR09.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. 770–778.

[7] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and YoshuaBengio. 2016. Binarized Neural Networks. In Advances in Neural InformationProcessing Systems 29. 4107–4115.

[8] Google Inc. 2015. TensorFlow: Large-Scale Machine Learning on HeterogeneousSystems. (2015). http://tensorflow.org/ Software available from tensorflow.org.

[9] Yangqing Jia. 2014. Learning Semantic Image Representations at a Large Scale.Ph.D. Dissertation. EECS Department, University of California, Berkeley.

[10] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: ConvolutionalArchitecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

[11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. CIFAR-10 (CanadianInstitute for Advanced Research). (2014).

[12] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. (2010).http://yann.lecun.com/exdb/mnist/

[13] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016.XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Net-works. In Computer Vision - ECCV 2016. 525–542.

[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.Going deeper with convolutions. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. 1–9.

[15] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou.2016. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks withLow Bitwidth Gradients. arXiv:1606.06160 [cs] (2016).


1212

Learning to Train a Binary Neural Network

Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel

Hasso Plattner Institute, University of Potsdam, GermanyP.O. Box 900460, Potsdam D-14480

{joseph.bethge,haojin.yang,christian.bartz,meinel}@hpi.de

Abstract. Convolutional neural networks have achieved astonishing re-sults in different application areas. Various methods which allow us touse these models on mobile and embedded devices have been proposed.Especially binary neural networks seem to be a promising approach forthese devices with low computational power. However, understandingbinary neural networks and training accurate models for practical ap-plications remains a challenge. In our work, we focus on increasing ourunderstanding of the training process and making it accessible to every-one. We publish our code and models based on BMXNet for everyoneto use1. Within this framework, we systematically evaluated differentnetwork architectures and hyperparameters to provide useful insights onhow to train a binary neural network. Further, we present how we im-proved accuracy by increasing the number of connections in the network.

1 Introduction

Nowadays, significant progress through research is made towards automatingdifferent tasks of our everyday lives. From vacuum robots in our homes to entireproduction facilities run by robots, many tasks in our world are already highlyautomated. Other advances, such as self-driving cars, are currently being devel-oped and depend on strong machine learning solutions. Further, more and moreordinary devices are equipped with embedded chips (with limited resources) forvarious reasons, such as smart home devices. Even operating systems and appson smartphones adopt deep learning techniques for tackling several problems andwill likely continue to do so in the future. All these devices have limited compu-tational power, often while trying to achieve minimal energy consumption, andmight provide future applications for machine learning.

Consider a fully automated voice controlled coffee machine that identifiesusers by their face and remembers their favorite beverage. The machine couldbe connected to a cloud platform which runs the machine learning models andstores user information. The machine transfers the voice or image data to theserver for processing, and receives the action to take or which settings to load.

There are a few requirements for this setup, which can be enumerated easily:A stable internet connection with sufficient bandwidth is required. Furthermore,the users have to agree on sending the required data to the company hosting the

1 https://github.com/Jopyth/BMXNet

arX

iv:1

809.

1046

3v1

[cs

.LG

] 2

7 Se

p 20

18

2 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel

Table 1: Comparison of available implementations for binary neural networks

Title GPU CPUPythonAPI

C++API

SaveBinaryModel

DeployonMobile

OpenSource

CrossPlatform

BNNs [1] X X XDoReFa-Net [2] X X X X XXNOR-Net [3] X XBMXNet [4] X X X X X X X X

cloud platform. This not only requires trust from the users, but data privacycan be an issue, too, especially in other potential application areas, such ashealthcare or finances.

All of these potential problems can be avoided by hosting the machine learn-ing models directly on the coffee machine itself. However, there are other chal-lenges, such as limited computational resources and limited memory, in additionto a possible reliance on battery power. We focus on solving these challengesby training a Binary Neural Network (BNN). In a BNN the commonly usedfull-precision weights of a convolutional neural network are replaced with binaryweights. This results in a storage compression by a factor of 32× and allowsfor more efficient inference on CPU-only architectures. We discuss existing ap-proaches, which have promising results, in Section 2. However, architectures,design choices, and hyperparameters are often presented without thorough ex-planation or experiments. Often, there is no source code for actual BNN imple-mentations present (see Table 1). This makes follow-up experiments and buildingactual applications based on BNNs difficult.

Therefore we provide our insights on existing network architectures and pa-rameter choices, while striving to achieve a better understanding of BNNs (Sec-tion 3). We evaluate these choices and our novel ideas based on the open sourceframework BMXNet [4]. We discuss the results of a set of experiments on theMNIST, CIFAR10 and ImageNet datasets (Section 4). Finally, we examine fu-ture ideas, such as quantized neural networks, wherein the binary weights arereplaced with lower precision floating point numbers (Section 5).

Summarized, our contributions presented in this paper are:

– We provide novel empirical proof for choice of methods and parameters com-monly used to train BNNs, such as how to deal with bottleneck architecturesand the gradient clipping threshold.

– We found that dense shortcut connections can improve the classificationaccuracy of BNNs significantly and show how to create efficient models withthis architecture.

– We offer our work as a contribution to the open source framework BMXNet [4],from which both academia and industry can take advantage from. We shareour code and developed models in this paper for research use.

– We present an overview about performance of commonly used network ar-chitectures with binary weights.

Learning to Train a Binary Neural Network 3

2 Related Work

In this section we first present two network architectures, Residual Networks [5]and Densely Connected Networks [6], which focus on increasing information flowthrough the network. Afterwards we give an overview about networks and tech-niques which were designed to allow execution on mobile or embedded devices.

Residual Networks [5] combine the information of all previous layers withshortcut connections leading to increased information flow. This is done throughaddition of identity connections to the outputs of previous layers together withthe output of the current layer. Consequently, the shortcut connections addneither extra weights nor computational cost.

In Densely Connected Networks [6] the shortcut connections are instead builtby concatenating the outputs of previous layers and the current layer. There-fore, new information gained in one layer can be reused throughout the entiredepth of the network. To reduce the total model size, the original full-precisionarchitecture includes a bottleneck design, which reduces the number of filters intransition layers. These effectively keep the network at a very small total size,even though the concatenation adds new information into the network every fewlayers.

There are two main approaches which allow for execution on mobile devices:On the one hand, information in a CNN can be compressed through compactnetwork design. These designs rely on full-precision floating point numbers, butreduce the total number of parameters with a clever network design, while pre-venting loss of accuracy. On the other hand, information can be compressed byavoiding the common usage of full-precision floating point weights, which use32 bit of storage. Instead, quantized floating-point number with lower precision(e.g. 8 bit of storage) or even binary (1 bit of storage) weights are used in theseapproaches.

We first present a selection of techniques which utilize the former method.The first of these approaches, SqueezeNet, was presented by Iandola et al. [7] in2016. The authors replace a large portion of 3×3 filters with smaller 1×1 filtersin convolutional layers and reduce the number of input channels to the remaining3×3 filters for a reduced number of parameters. Additionally, they facilitate latedownsampling to maximize their accuracy based on the lower number of weights.Further compression is done by applying deep compression [8] to the model foran overall model size of 0.5 MB.

A different approach, MobileNets, was implemented by Howard et al. [9].They use a depth-wise separable convolution where convolutions apply a sin-gle 3×3 filter to each input channel. Then, a 1×1 convolution is applied tocombine their outputs. Zhang et al. [10] use channel shuffling to achieve groupconvolutions in addition to depth-wise convolution. Their ShuffleNet achievescomparably lower error rate for the same number of operations needed for Mo-bileNets. These approaches reduce memory requirements, but still require GPUhardware for efficient training and inference. Specific acceleration strategies forCPUs still need to be developed for these methods.


In contrast to this, approaches which use binary weights instead of full-precision weights can achieve compression and acceleration. However, the draw-back usually is a severe drop in accuracy. For example, the weights and acti-vations in Binarized Neural Networks are restricted to either +1 or -1, as pre-sented by Hubara et al. [1]. They further provide efficient calculation methodsof the equivalent of a matrix multiplication by using XNOR and popcount oper-ations. XNOR-Nets are built on a similar idea and were published by Rastegariet al. [3]. They include a channel-wise scaling factor to improve approxima-tion of full-precision weights, but require weights between layers to be storedas full-precision numbers. Another approach, called DoReFa-Net, was presentedby Zhou et al. [2]. They focus on quantizing the gradients together with differ-ent bit-widths (down to binary values) for weights and activations and replacethe channel-wise scaling factor with one constant scalar for all filters. Anotherattempt to remove everything except binary weights is taken in ABC-Nets byLin et al. [11]. This approach achieves a drop in top1-accuracy of only about 5%on the ImageNet dataset compared to a full-precision network using the ResNetarchitecture. They suggest to use between 3 to 5 binary weight bases to ap-proximate full-precision weights, which increases model capacity, but also modelcomplexity and size. Therefore finding a way to accurately train a binary neuralnetwork still remains an unsolved task.

3 Methodology

In alignment with our goal to contribute to open-source frameworks, we publishthe code and models and offer them as a contribution to the BMXNet framework.A few implementation details are provided here. We use the sign function foractivation (and thus transform from real-valued values into binary values):

sign(x) =

{+1 if x ≥ 0,

−1 otherwise,(1)

The implementation uses a Straight-Through Estimator (STE) [12] which cancelsthe gradients, when they get too large, as proposed by Hubara et al. [1]. Let cdenote the objective function, ri be a real number input, and ro ∈ {−1,+1}a binarized output. Furthermore tclip is a threshold for clipping gradients. Inprevious works the clipping threshold was set to tclip = 1 [1]. Then, the straight-through estimator is:

Forward: ro = sign(ri) Backward:∂c

∂ri=

∂c

∂ro1|ri|≤tclip (2)

Usually in full-precision networks a large amount of calculations is spenton calculating dot products of matrices, as is needed by fully connected andconvolutional layers. The computational cost of binary neural networks can behighly reduced by using the XNOR and popcount CPU instructions. Both oper-ations combined approximate the calculation of dot products of matrices. That


is because element-wise multiplication and addition of a dot product can be re-placed with the XNOR instruction and then counting all bits, which are set to 1(popcount) [3]. Let x,w ∈ {−1,+1}n denote the input and weights respectively(with n being the number of inputs). Then the matrix multiplication x · w canbe replaced as follows:

x · w = 2 · bitcount(xnor(x,w))− n (3)

Preliminary experiments showed, that an implementation as custom CUDA ker-nels was slower than using the highly optimized cuDNN implementation. But theabove simplification means, that we can still use normal training methods withGPU acceleration. We simply need to convert weights from {−1,+1} to {0, 1}before deployment in a CPU architecture. Afterwards we can take advantage ofthe CPU implementation.

In the following sections we describe which parameters we evaluate and howwe gain explanations about the whole system. First, we discuss common trainingparameters, such as including a scaling factor during training and the thresholdfor clipping the gradients. Secondly, we examine different deep neural networkarchitectures, such as AlexNet [13], Inception [14,15], ResNet [5], DenseNet [6].During this examination, we focus on the effect of reducing weights in favor ofincreasing the number of connections on the example of the DenseNet architec-ture. Thirdly, we determine the differences of learned features between binaryneural networks and full-precision networks with feature visualization.

3.1 Network Architectures

Before thinking about model architectures, we must consider the main aspects,which are necessary for binary neural networks. First of all, the information den-sity is theoretically 32 times lower, compared to full-precision networks. Researchsuggests, that the difference between 32 bits and 8 bits seems to be minimal and8-bit networks can achieve almost identical accuracy as full-precision networks[8]. However, when decreasing bit-width to four or even one bit (binary), the ac-curacy drops significantly [1]. Therefore, the precision loss needs to be alleviatedthrough other techniques, for example by increasing information flow throughthe network. This can be successfully done through shortcut connections, whichallow layers later in the network to access information gained in earlier layersdespite of information loss through binarization. These shortcut connections,were proposed for full-precision model architectures in Residual Networks [5]and Densely Connected Networks [6] (see Fig. 1a, c).

Following the same idea, network architectures including bottlenecks alwaysare a challenge to adopt. The bottleneck architecture reduces the number offilters and values significantly between the layers, resulting in less informationflow through binary neural networks. Therefore we hypothesize, that either weneed to eliminate the bottleneck parts or at least increase the number of filtersin these bottleneck parts for accurate binary neural networks to achieve bestresults (see Fig. 1b, d).


1⨉1

3⨉3

1⨉1

+

1⨉1

3⨉3

1⨉1

+

(a) ResNet(bottleneck)

3⨉3

3⨉3

+

3⨉3

3⨉3

+

(b) ResNet

1⨉1

3⨉3

1⨉1

3⨉3

(c) DenseNet(bottleneck)

3⨉3

3⨉3

(d) DenseNet

Fig. 1: Two (identical) building blocks of different network architectures. (a) Theoriginal ResNet design features a bottleneck architecture (length of bold blackline represents number of filters). A low number of filters reduces informationcapacity for binary neural networks. (b) A variation of the ResNet architecturewithout the bottleneck design. The number of filters is increased, but with onlytwo convolutions instead of three. (c) The original DenseNet design with a bot-tleneck in the second convolution operation. (d) The DenseNet design without abottleneck. The two convolution operations are replaced by one 3×3 convolution

To increase the information flow, the blocks which add or derive new featuresto ResNet and DenseNet (see Fig. 1) have to be modified. In full-precision net-works, the size of such a block ranges from 64 to 512 for ResNet [5]. The authorsof DenseNet call this parameter growth rate and set it to k = 32 [6]. Our prelim-inary experiments showed, that reusing the full-precision DenseNet architecturefor binary neural networks and only removing the bottleneck architecture, isnot achieving satisfactory performance. There are different possibilities to in-crease the information flow for a DenseNet architecture. The growth rate canbe increased (e.g. k = 64, k = 128), we can use a larger number of blocks, or acombination of both (see Fig. 2). Both approaches add roughly the same amountof parameters to the network. It is not exactly the same, since other layers alsodepend on the growth rate parameter (e.g. the first fully-connected layer whichalso changes the size of the final fully-connected layer and the transition lay-ers). Our hypothesis of favoring an increased number of connections over simplyadding more weights indicates, that in this case increasing the number of blocksshould provide better results (or a reduction of the total number of parametersfor equal model performance) compared to increasing the growth rate.

3.2 Common Hyperparameters

One technique which was used in binary neural networks before, is a scalingfactor [2,3]. The result of a convolution operation is multiplied by this scaling


3⨉3, 128

(a)

3⨉3, 64

3⨉3, 64

(b)

3⨉3, 32

3⨉3, 32

3⨉3, 32

3⨉3, 32

(c)

Fig. 2: Different ways to extract information with 3×3 convolutions. (a) A largeblock which generates a high amount of features through one convolution. (b)Splitting one large block in two, which are half as large and generate half asmany features respectively. This allows the features generated in the first blockto be used by the second block. (c) This process can be repeated until a minimaldesirable block size is found (e.g. 32 for binary neural networks)

factor. This should help binary weights to act more similarly to full-precisionweights, by increasing the value range of the convolution operation. However,this factor was applied in different ways. We evaluated whether this scalingfactor proves useful in all cases, because it adds additional complexity to thecomputation and the implementation in Section 4.1.

Another parameter specific to binary neural networks, is the clipping thresh-old tclip. The value of this parameter influences which gradients are canceled andwhich are not. Therefore the parameter has a significant influence on the trainingresult, and we evaluated different values for this parameter (also in Section 4.1).

3.3 Visualization of Trained Models

We used an implementation of the deep dream visualization [16] to visualize whatthe trained models had learned (see Fig. 5). The core idea is a normal forwardpass followed by specifying an optimization objective, such as maximizing acertain neuron, filter, layer, or class during the backward pass.

Another tool we used for visualization is VisualBackProp [17]. It uses thehigh certainty about relevancy of information of the later layers in the networktogether with the higher resolution of earlier layers to efficiently identify thoseparts in the image which contribute most to the prediction.


4 Experiments and Results

Following the structure of the previous section, we provide our experimentalresults to compare the various parameters and techniques. First, we focus onclassification accuracy as a measure to determine which parameter choices arebetter. Afterwards, we examine the results of our feature visualization tech-niques.

4.1 Classification Accuracy

In this section we apply classification accuracy as the general measurement toevaluate the different architectures, hyperparameters etc. We use the MNIST [18],CIFAR-10 [19] and ImageNet [20] datasets in terms of different levels of taskcomplexity. The experiments were performed on a work station which has anIntel(R) Core(TM) i9-7900X CPU, 64 GB RAM and 4×Geforce GTX1080TiGPUs.

As a general experiment setup, we use full-precision weights for the first (oftena convolutional) layer and the last layer (often a fully connected layer which hasa number of output neurons equal to the number of classes) for all involveddeep networks. We did not apply a scaling factor as proposed by Rastegari et al.in [3] in our experiments. Instead we examined a (similar) scaling factor methodproposed by Zhou et al. [2]. However, as shown in our hyperparameter evaluation(page 12) we chose not to apply this scaling factor for our other experiments.Further, the results of a binary LeNet for the MNIST dataset and a binaryDenseNet with 21 layers can be seen in Table 2.

Popular Deep Architectures In this experiment our intention is to evaluatea selection of popular deep learning architectures by using binary weights andactivations. We wanted to discover positive and negative design patterns withrespect to training binary neural networks. The first experiment is based onAlexNet [13], InceptionBN [21] and ResNet [5] (see Table 3). Using the AlexNetarchitecture, we were not able to achieve similar results as presented by Raste-gari et al. [3]. This might be due to us disregarding their scaling factor approach.Further, we were quite surprised that InceptionBN achieved even worse resultsthan AlexNet. Our assumption for the bad result is that the Inception seriesapplies “bottleneck” blocks intended to reduce the number of parameters and

Table 2: Evaluation of model performance on the MNIST and CIFAR-10 datasets.

Architecture Accuracy Model Size (Binary/Full Precision)

MNIST LeNet 99.3% 202KB/4.4MB

CIFAR-10 DenseNet-21 87.1% 1.9MB/51MB


Table 3: Classification accuracy (Top-1 and Top-5) of several popular deeplearning architectures using binary weights and activations in their convolutionand fully connected layers. Full-precision results are denoted with FP. ResNet-34-thin applies a lower number of filters (64, 64, 128, 256, 512), whereas ResNet-34-wide and ResNet-68-wide use a higher number of filters (64, 128, 256, 512, 1024).

Architecture Top-1 Top-5 EpochModel Size

(Binary/Full Precision)Top-1

FPTop-5

FP

AlexNet 30.2% 54.0% 70 22MB/233MB 62.5% 83.0%

InceptionBN 24.8% 48.0% 80 8MB/44MB - 92.1%

ResNet-18 42.0% 66.2% 37 3.4MB/45MB - -

ResNet-18 (from [11]) 42.7% 67.6% - - - -

ResNet-26 bottleneck 25.2% 47.1% 40 - - -

ResNet-34-thin 44.3% 69.1% 40 4.8MB/84MB 78.2% 94.3%

ResNet-34-wide 54.0% 77.2% 37 15MB/329MB - -

ResNet-68-wide 57.5% 80.3% 40 25MB/635MB - -

computational costs, which may negatively impact information flow. With thisidea, we continued the experiments with several ResNet models, and the resultsseem to verify our conjecture. If the ResNet architecture is used for full-precisionnetworks, gradually increasing the width and depth of the network yields im-provements in accuracy. On the contrary, when using binary neural networks,the bottleneck design seems to limit the performance as is expected. We werenot able to obtain higher accuracy with the ResNet-26 bottleneck architecturecompared to ResNet-18. Additionally, if we only increase the depth, without in-creasing the number of filters, we were not able to obtain a significant increase inaccuracy (ResNet-34-thin compared to ResNet-18 ). To test our theory, that thebottleneck design hinders information flow, we enlarged the number of filtersthroughout the network from (64, 64, 128, 256, 512) to (64, 128, 256, 512, 1024).This achieves almost 10% top-1 accuracy gain in a ResNet architecture with34 layers (ResNet-34-wide). Further improvements can be obtained by usingResNet-68-wide with both increased depth and width. This suggests, that net-work width and depth should be increased simultaneously for best results.

We also conducted experiments on further architectures such as VGG-Net [22],Inception-resnet [23] and MobileNet [9]. Although we applied batch normaliza-tion, the VGG-style networks with more than 10 layers have to be trained ac-cumulatively (layer by layer), since the models did not achieve any result whenwe trained them from scratch. Other networks such as Inception-ResNet andMobileNet are also not appropriate for the binary training due to their designedarchitecture (bottleneck design and models with a low number of filters). Weassume that the shortcut connections of the ResNet architecture can retain theinformation flow unobstructed during the training. This is why we could directlytrain a binary ResNet model from scratch without additional support. Accord-ing to the confidence of the results obtained in our experiment, we achieved the


Table 4: Classification accuracy comparison by using binary DenseNet andResNet models for the ImageNet dataset. The amount of parameters are kepton a similar level for both architectures to verify that improvements are basedon the increased number of connections and not an increase of parameters.

Architecture Top-1 Top-5 Epoch Model Size (FP) Number of Parameters

DenseNet-21 50.0% 73.3% 49 44MB 11 498 086

ResNet-18 42.0% 66.2% 37 45MB 11 691 950

DenseNet-45 57.9% 80.0% 52 250MB 62 611 886

ResNet-34-wide 54.0% 77.2% 37 329MB 86 049 198

ResNet-68-wide 57.5% 80.3% 35 635MB 166 283 182

same level in terms of classification accuracy comparing to the latest result fromABC-Net [11] (ResNet-18 result with weight base 1 and activation base 1).

As we learned from the previous experiments, we consider the shortcut con-nections as a useful compensation for the reduced information flow. But thisraised the following question: could we improve the model performance furtherby simply increasing the number of shortcut connections? To answer this ques-tion, we conducted further experiments based on the DenseNet [6] architecture.

Shortcut Connections Driven Accuracy Gain In our first experiment wecreated binary models using both DenseNet and ResNet architectures with sim-ilar complexities. We keep the amount of parameters on a roughly equal level toverify that the improvements obtained by using the DenseNet architecture arecoming from the increased number of connections and not a general increase ofparameters. Our evaluation results show that these dense connections can sig-nificantly compensate for the information loss from binarization (see Table 4).The gained improvement by using DenseNet-21 compared to ResNet-18 is upto 8%2, whereas the number of utilized parameters is even lower. Furthermore,when we compare binary ResNet-68-wide to DenseNet-45, the latter has lessthan half the number of parameters compared to the former, but can achieve avery similar result in terms of classification accuracy.

In our second set of experiments, we wanted to confirm our hypothesis, thatincreasing the number of blocks is more efficient than just increasing block size onthe example of a DenseNet architecture. We distinguish the four architecturesthrough the two main parameters relevant for this experiment: growth rate kper block, number of blocks per unit b, and total number of layers n, wheren = 8·b+5. The four architectures we are comparing are: DenseNet-13 (k = 256,b = 1), DenseNet-21 (k = 128, b = 2), DenseNet-37 (k = 64, b = 4), andDenseNet-69 (k = 32, b = 8).

2 We note, that this is significantly more, than the improvement between two full-precision models with a similar number of parameters (DenseNet-264 and ResNet-50 ), which is less than 2% (22.15% and 23.9% top 1 error rate, reported by [6]).


16.3%

35.4%

49.2%

72.8%

Model: 8.0 MB

Weights: 13.2 M

16.5%

35.9%

48.5%

72.2%

Model: 6.5 MB

Weights: 9.7 M

16.5%

35.6%47.4%

71.2%

Model: 5.8 MB

Weights: 8.2 M

16.8%

36.6%48.3%

71.7%

Model: 5.5 MB

Weights: 7.4 M

DenseNet−13 (k=256, b=1) DenseNet−21 (k=128, b=2) DenseNet−37 (k=64, b=4) DenseNet−69 (k=32, b=8)

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

20%

40%

60%

Epoch

Acc

urac

y

Top−1 Accuracy Top−5 Accuracy

Fig. 3: Model performance of binary DenseNet models with different growthrates k and number of blocks b. Increasing b, while decreasing k leads to smallermodels, without a significant decrease in accuracy, since the reduction of weightsis compensated by increasing the number of connections

Despite a 31% reduction in model size between DenseNet-13 and DenseNet-69, the accuracy loss is only 1% (see Fig. 3). We further conclude, that thissimilarity is not randomness, since all architectures perform very similarly overthe whole training process. We note again, that for all models the first convolu-tional layer and the final fully-connected layer use full-precision weights. Further,we set the size of the former layer depending on the growth rate k, with a num-ber of filters equal to 2 ·k. Therefore, a large portion of the model size reductioncomes from reducing the size of first convolutional layer, which subsequently alsoreduces the size of the final fully connected layer.

However, a larger fully-connected layer could simply add additional duplicateor similar features, without affecting performance. This would mean, that thereduction of model size in our experiments comes from a different independentvariable. To elimnate this possibility, we ran a post-hoc analysis to check whetherwe can reduce the size of the first layer without impacting performance. We usedDenseNet-13 with a reduced first layer, which has the same size as for DenseNet-69 (which uses k = 32), so 2 ·k = 64 filters. Even though the performance of themodel is similar for the first few epochs, the accuracy does not reach comparablelevels: after 31 Epochs, its Top-1 accuracy is only 47.1% (2.1% lower) and itsTop-5 accuracy is only 70.7% (2.1% lower). In addition to degrading the accuracymore than as if increasing connections, it only reduces the model size by 6% (0.4MB), since the transition layers are unchanged. This confirms our hypothesis,that we can eliminate the usual reduction in accuracy of a binary neural networkwhen reducing the number of weights by increasing the number of connections.

In summary, we have learned two important findings from the previous ex-periments for training an accurate binary network:

– Increasing information flow through the network improves classification ac-curacy of a binary neural network.

– We found two ways to realize this: Increase the network width appropriatelywhile increasing depth or increasing the number of shortcut connections.


0.0

0.2

0.4

0.6

0 10 20 30 40

Epoch

Top

5 A

ccur

acy

0.1

0.25

0.5

0.75

1

2

(a)

0.0

0.2

0.4

0.6

0 10 20 30 40

Epoch

Top

5 A

ccur

acy

ResNet−18 (N)

ResNet−18 (FB)

ResNet−18 (B)

(b)

Fig. 4: (a) Classification accuracy by varying gradient clipping threshold. Theapplied validation model is trained on ImageNet with ResNet-18 topology. (b)Accuracy evaluation by using scaling factor on network weights in three differ-ent modes: (N ) no scaling, (B) use scaling factor on weights only in backwardcomputation, and (FB) apply weight scaling in both forward and backward pass.

Specific Hyperparameter Evaluation In this section we evaluated two spe-cific hyperparameters for training a binary neural network: the gradient clippingthreshold and usage of a scaling factor.

Using a gradient clipping threshold tclip was originally proposed byHubara et al. [1], and reused in more recent work [3,11] (see Section 3.2). Inshort, when using STE we only let the gradients pass through if the input risatisfies |ri| ≤ tclip. Setting tclip = 1 is presented in the literature with only cur-sory explanation. Thus, we evaluated it by exploring a proper value range (seeFig. 4a). We used classification accuracy as the evaluation metric and selectedthresholds from the value range of [0.1, 2.0] empirically. The validation modelis trained on the ImageNet dataset with the ResNet-18 network architecture.From the results we can recognize that tclip = 1 is suboptimal, the optimum isbetween 0.5 and 0.75. We thus applied tclip = 0.5 to the all other experimentsin this paper.

Scaling factors have been proposed by Rastegari et al. [3]. In their work, thescaling factor is the mean of absolute values of each output channel of weights.Subsequently, Zhou et al. [2] proposed a scaling factor, which is intended to scaleall filters instead of performing channel-wise scaling. The intuition behind bothmethods is to increase the value range of weights with the intention of solvingthe information loss problem during training of a binary network. We conductedan evaluation of accuracy according to three running modes according to theimplementation of Zhou et al. [2]: (N) no scaling, (B) use the scaling factoron weights only in backward computation, (FB) apply weight scaling in bothforward and backward pass. The result indicates that no accuracy gain can beobtained by using a scaling factor on the ResNet-18 network architecture (seeFig. 4b). Therefore we did not apply a scaling factor in our other experiments.


(a) DenseNet (FP) (b) DenseNet-21

(c) ResNet-18 (d) ResNet-68

Fig. 5: The deep dream [16] visualization of binary models with different com-plexity and size (best viewed digitally with zoom). The DenseNet full precisionmodel (a) is the only one, which produces visualizations of animal faces and ob-jects. Additional models and samples can be seen in the supplementary material

4.2 Visualization Results

To better understand the differences between binary and full-precision networks,and the various binary architectures, we created several visualizations. The re-sults show, that the full-precision version of the DenseNet captures overall con-cepts, since rough objects, such as animal faces, can be recognized in the Deep-Dream visualization (see Fig. 5a). The binary networks perform much worse.Especially the ResNet architecture (see Fig. 5c) with 18 layers seems to learnmuch more noisy and less coherent shapes. Further, we can see small and largeareas of gray, which hints at the missing information flow in certain parts of thenetwork. This most likely comes from the loss of information through binariza-tion which stops neurons from activating. This issue is less visible for a largerarchitecture, but even there, small areas of gray appear (see Fig. 5d). Howeverthe DenseNet architecture (see Fig. 5b) with 21 layers, which has a comparablenumber of parameters, produces more object-like pictures with less noise. Theareas without any activations seem to not exist, indicating that the informationcan be passed through the network more efficiently in a binary neural network.

The visualization with VisualBackprop shows a similar difference in qualityof the learned features (see Fig. 6). It reflects the parts of the image, whichcontributed to the final prediction of the model. The visualization of a full-precision ResNet-18 clearly highlights the remarkable features of the classes tobe detected (e.g. the outline of lighthouse, or the head of a dog). In contrast, thevisualization of a binary ResNet-18 only highlights small relevant parts of theimage, and considers other less relevant elements in the image (e.g. a horizon


Fig. 6: Two samples of the ImageNet dataset visualized with VisualBackProp ofbinary neural network architectures (from top to bottom): full-precision ResNet-18, binary ResNet-18, binary DenseNet-21. Each depiction shows (from left toright): original image, activation map, composite of both (best viewed digitallywith zoom). Additional samples can be seen in the supplementary material

behind a lighthouse). The binary DenseNet-21 model also achieves less claritythan the full-precision model, but highlights more of the relevant features (e.g.parts of the outline of a dog).

5 Conclusion

In this paper, we presented our insights on training binary neural networks. Ouraim is to fill the information gap between theoretically designing binary neu-ral networks, by communicating our insights in this work and providing accessto our code and models, which can be used on mobile and embedded devices.We evaluated hyperparameters, network architectures and different methods oftraining a binary neural network. Our results indicate, that increasing the num-ber of connections between layers of a binary neural network can improve itsaccuracy in a more efficient way than simply adding more weights.

Based on these results, we would like to explore more methods of increas-ing the number of connections in binary neural networks in our future work.Additionally similar ideas for quantized networks can be explored, for example,how networks with multiple binary bases work in comparison to quantized lowbit-width networks. The information density should be equal in theory, but arethere differences in practice, when training these networks?


References

1. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarizedneural networks. In: Advances in neural information processing systems. (2016)4107–4115

2. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: Training LowBitwidth Convolutional Neural Networks with Low Bitwidth Gradients. 1 (2016)1–14

3. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classi-fication using binary convolutional neural networks. In: European Conference onComputer Vision, Springer (2016) 525–542

4. Yang, H., Fritzsche, M., Bartz, C., Meinel, C.: Bmxnet: An open-source binaryneural network implementation based on mxnet. In: Proceedings of the 2017 ACMon Multimedia Conference, ACM (2017) 1209–1212

5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition.(2016) 770–778

6. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connectedconvolutional networks. In: Proceedings of the IEEE conference on computer visionand pattern recognition. Volume 1. (2017) 3

7. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB modelsize. (2016) 1–13

8. Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing Deep NeuralNetworks with Pruning, Trained Quantization and Huffman Coding. (2015) 1–14

9. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks forMobile Vision Applications. (2017)

10. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An Extremely Efficient Convo-lutional Neural Network for Mobile Devices. (2017) 1–10

11. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network.In: Advances in Neural Information Processing Systems. (2017) 344–352

12. Hinton, G.: Neural Networks for Machine Learning, Coursera. URL:http://coursera.org/course/neuralnets (last accessed 2018-03-13) (2012)

13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in neural information processing systems.(2012) 1097–1105

14. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,Vanhoucke, V., Rabinovich, A., Others: Going deeper with convolutions, Cvpr(2015)

15. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep-tion Architecture for Computer Vision. (2015)

16. Mordvintsev, A., Olah, C., Tyka, M.: Inceptionism: Going Deeper into Neu-ral Networks. URL: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html (last accessed 2018-03-13) (2015)

17. Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller,U., Zieba, K.: VisualBackProp: efficient visualization of CNNs. (2016)

18. LeCun, Y., Cortes, C.: MNIST handwritten digit database. (2010)19. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (canadian institute for advanced

research) (2014)


20. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09. (2009)

21. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. In: International conference on machine learning.(2015) 448–456

22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. In: ICLR. (2015)

23. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnetand the impact of residual connections on learning. In: AAAI. Volume 4. (2017)12

Back to Simplicity: How to Train Accurate BNNs from Scratch?

Joseph Bethge∗, Haojin Yang∗, Marvin Bornstein, Christoph MeinelHasso Plattner Institute, University of Potsdam, Germany

{joseph.bethge,haojin.yang,meinel}@hpi.de, {marvin.bornstein}@student.hpi.de

Abstract

Binary Neural Networks (BNNs) show promising progressin reducing computational and memory costs but suffer fromsubstantial accuracy degradation compared to their real-valued counterparts on large-scale datasets, e.g., ImageNet.Previous work mainly focused on reducing quantization er-rors of weights and activations, whereby a series of approxi-mation methods and sophisticated training tricks have beenproposed. In this work, we make several observations thatchallenge conventional wisdom. We revisit some commonlyused techniques, such as scaling factors and custom gradi-ents, and show that these methods are not crucial in trainingwell-performing BNNs. On the contrary, we suggest severaldesign principles for BNNs based on the insights learnedand demonstrate that highly accurate BNNs can be trainedfrom scratch with a simple training strategy. We proposea new BNN architecture BinaryDenseNet, which signifi-cantly surpasses all existing 1-bit CNNs on ImageNet with-out tricks. In our experiments, BinaryDenseNet achieves18.6% and 7.6% relative improvement over the well-knownXNOR-Network and the current state-of-the-art Bi-RealNet in terms of top-1 accuracy on ImageNet, respectively.https://github.com/hpi-xnor/BMXNet-v2

1. Introduction

Convolutional Neural Networks have achieved state-of-the-art on a variety of tasks related to computer vision, for ex-ample, classification [17], detection [7], and text recogni-tion [15]. By reducing memory footprint and acceleratinginference, there are two main approaches which allow forthe execution of neural networks on devices with low com-putational power, e.g. mobile or embedded devices: Onthe one hand, information in a CNN can be compressedthrough compact network design. Such methods use full-precision floating point numbers as weights, but reduce thetotal number of parameters and operations through clevernetwork design, while minimizing loss of accuracy, e.g.,

∗Authors contributed equally

SqueezeNet [13], MobileNets [10], and ShuffleNet [30]. Onthe other hand, information can be compressed by avoidingthe common usage of full-precision floating point weightsand activations, which use 32 bits of storage. Instead,quantized floating-point numbers with lower precision (e.g.4 bit of storage) [31] or even binary (1 bit of storage)weights and activations [12, 19, 22, 23] are used in theseapproaches. A BNN achieves up to 32× memory savingand 58× speedup on CPUs by representing both weightsand activations with binary values [23]. Furthermore, com-putationally efficient, bitwise operations such as xnor andbitcount can be applied for convolution computation in-stead of arithmetical operations. Despite the essential ad-vantages in efficiency and memory saving, BNNs still suf-fer from the noticeable accuracy degradation that preventstheir practical usage. To improve the accuracy of BNNs,previous approaches mainly focused on reducing quanti-zation errors by using complicated approximation meth-ods and training tricks, such as scaling factors [23], multi-ple weight/activation bases [19], fine-tuning a full-precisionmodel, multi-stage pre-training, or custom gradients [22].These work applied well-known real-valued network archi-tectures such as AlexNet, GoogLeNet or ResNet to BNNswithout thorough explanation or experiments on the designchoices. However, they don’t answer the simple yet essen-tial question: Are those real-valued network architecturesseamlessly suitable for BNNs? Therefore, appropriate net-work structures for BNNs should be adequately explored.

In this work, we first revisit some commonly used tech-niques in BNNs. Surprisingly, our observations do notmatch conventional wisdom. We found that most of thesetechniques are not necessary to reach state-of-the-art per-formance. On the contrary, we show that highly accurateBNNs can be trained from scratch by “simply” maintainingrich information flow within the network. We present howincreasing the number of shortcut connections improves theaccuracy of BNNs significantly and demonstrate this by de-signing a new BNN architecture BinaryDenseNet. Withoutbells and whistles, BinaryDenseNet reaches state-of-the-artby using standard training strategy which is much more ef-ficient than previous approaches.

1

arX

iv:1

906.

0863

7v1

[cs

.LG

] 1

9 Ju

n 20

19

Table 1: A general comparison of the most related methods to this work. Essential characteristics such as value space ofinputs and weights, numbers of multiply-accumulate operations (MACs), numbers of binary operations, theoretical speeduprate and operation types, are depicted. The results are based on a single quantized convolution layer from each work. β and αdenote the full-precision scaling factor used in proper methods, whilst m, n, k denote the dimension of weight (W ∈ Rn×k)and input (I ∈ Rk×m). The table is adapted from [28].

Methods Inputs Weights MACs Binary Operations Speedup OperationsFull-precision R R n×m×k 0 1× mul,add

BC [3] R {−1, 1} n×m×k 0 ∼ 2× sign,addBWN [23] R {−α, α} n×m×k 0 ∼ 2× sign,addTTQ [33] R {−αn, 0, αp} n×m×k 0 ∼ 2× sign,add

DoReFa [31] {0, 1}×4 {0, α} n×k 8×n×m×k ∼ 15× and,bitcountHORQ [18] {−β, β}×2 {−α, α} 4×n×m 4×n×m×k ∼ 29× xor,bitcountTBN [28] {−1, 0, 1} {−α, α} n×m 3×n×m×k ∼ 40× and,xor,bitcount

XNOR [23] {−β, β} {−α, α} 2×n×m 2×n×m×k ∼ 58× xor,bitcountBNN [12] {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcount

Bi-Real [22] {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcountOurs {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcount

Summarized, our contributions in this paper are:

• We show that highly accurate binary models can betrained by using standard training strategy, which chal-lenges conventional wisdom. We analyze why apply-ing common techniques (as e.g., scaling methods, cus-tom gradient, and fine-tuning a full-precision model)is ineffective when training from scratch and provideempirical proof.

• We suggest several general design principles for BNNsand further propose a new BNN architecture Binary-DenseNet, which significantly surpasses all existing 1-bit CNNs for image classification without tricks.

• To guarantee the reproducibility, we contribute toan open source framework for BNN/quantized NN.We share codes, models implemented in this paperfor classification and object detection. Additionally,we implemented the most influential BNNs including[12, 19, 22, 23, 31] to facilitate follow-up studies.

The rest of the paper is organized as follows: We de-scribe related work in Section 2. We revisit common tech-niques used in BNNs in Section 3. Section 4 and 5 presentour approach and the main result.

2. Related workIn this section, we roughly divide the recent efforts for bi-narization and compression into three categories: (i) com-pact network design, (ii) networks with quantized weights,(iii) and networks with quantized weights and activations.Compact Network Design. This sort of methods use full-precision floating point numbers as weights, but reduce thetotal number of parameters and operations through com-pact network design, while minimizing loss of accuracy.

The commonly used techniques include replacing a largeportion of 3×3 filters with smaller 1×1 filters [13]; Usingdepth-wise separable convolution to reduce operations [10];Utilizing channel shuffling to achieve group convolutions inaddition to depth-wise convolution [30]. These approachesstill require GPU hardware for efficient training and infer-ence. A strategy to accelerate the computation of all thesemethods for CPUs has yet to be developed.Quantized Weights and Real-valued Activations. Recentefforts from this category, for instance, include BinaryCon-nect (BC) [3], Binary Weight Network (BWN) [23], andTrained Ternary Quantization (TTQ) [33]. In these work,network weights are quantized to lower precision or evenbinary. Thus, considerable memory saving with relativelylittle accuracy loss has been achieved. But, no noteworthyacceleration can be obtained due to the real-valued inputs.Quantized Weights and Activations. On the contrary, ap-proaches adopting quantized weights and activations canachieve both compression and acceleration. Remarkableattempts include DoReFa-Net [31], High-Order ResidualQuantization (HORQ) [18] and SYQ [6], which reportedpromising results on ImageNet [4] with 1-bit weights andmulti-bits activations.Binary Weights and Activations. BNN is the extremecase of quantization, where both weights and activationsare binary. Hubara et al. proposed Binarized Neural Net-work (BNN) [12], where weights and activations are re-stricted to +1 and -1. They provide efficient calculationmethods for the equivalent of matrix multiplication by us-ing xnor and bitcount operations. XNOR-Net [23] im-proved the performance of BNNs by introducing a channel-wise scaling factor to reduce the approximation error offull-precision parameters. ABC-Nets [19] used multiple

Table 2: The influence of using scaling, a full-precisiondownsampling convolution, and the approxsign function onthe CIFAR-10 dataset based on a binary ResNetE18. Us-ing approxsign instead of sign slightly boosts accuracy, butonly if training a model with scaling factors.

Usescalingof [23]

Downsampl.convolution

Useapproxsignof [22]

AccuracyTop1/Top5

nobinary yes 84.9%/99.3%

no 87.2%/99.5%

full-precision yes 86.1%/99.4%no 87.6%/99.5%

yesbinary yes 84.2%/99.2%

no 83.6%/99.2%


weight bases and activation bases to approximate their full-precision counterparts. Despite the promising accuracy im-provement, the significant growth of weight and activationcopies offsets the memory saving and speedup of BNNs.Wang et al. [28] attempted to use binary weights andternary activations in their Ternary-Binary Network (TBN).They achieved a certain degree of accuracy improvementwith more operations compared to fully binary models. InBi-Real Net, Liu et al. [22] proposed several modificationson ResNet. They achieved state-of-the-art accuracy by ap-plying an extremely sophisticated training strategy that con-sists of full-precision pre-training, multi-step initialization(ReLU→leaky clip→clip [21]), and custom gradients.

Table 1 gives a thorough overview of the recent effortsin this research domain. We can see that our work followsthe most straightforward binarization strategy as BNN [12],that achieves the highest theoretical speedup rate and thehighest compression ratio. Furthermore, we directly train abinary network from scratch by adopting a simple yet effec-tive strategy.

3. Study on Common Techniques

In this section, to ease the understanding, we first provide abrief overview of the major implementation principles of abinary layer (see supplementary materials for more details).We then revisit three commonly used techniques in BNNs:scaling factors [23, 31, 28, 27, 18, 33, 19], full-precisionpre-training [31, 22], and approxsign function [22]. Wedidn’t observe accuracy gain as expected. We analyze whythese techniques are not as effective as previously presentedwhen training from scratch and provide empirical proof.The finding from this study motivates us to explore moreeffective solutions for training accurate BNNs.

0.1

-0.7

0.5

0.3

0.5 -0.5 -0.1

-0.1 0.5 0.5

-0.4 -0.7 0.3

0.3 -0.1 -0.7

Input Weight

*

0.01 -0.78 -0.42Result

1

-1

1

1

1 -1 -1

-1 1 1

-1 -1 1

1 -1 -1

Input Weight

*

2 -4 -2Result 0.26 -0.72 -0.32Result

0.33 0.45 0.4Weight Scaling

0.4Input Scaling

1.01 -0.95 -0.06Normalized 1.25 -1.0 -0.25 1.2 -1.06 -0.14

Error 0.24 0.05 0.19 Error 0.19 0.11 0.08

Normalized Normalized

Error 1.99 3.22 1.58 Error 0.25 0.06 0.1>

≈

Full-precision Binary Binary with scaling

Figure 1: An exemplary implementation shows that nor-malization minimizes the difference between a binary con-volution with scaling (right column) and one without (mid-dle column). In the top row, the columns from left to rightrespectively demonstrate the gemm results of full-precision,binary, and binary with scaling. The bottom row shows theirresults after normalization. Errors are the absolute differ-ence between full-precision and binary results. The resultsindicate that normalization dilutes the effect of scaling.

3.1. Implementation of Binary Layers

We apply the sign function for binary activation, thus trans-forming floating-point values into binary values:

sign(x) =

{+1 if x ≥ 0,

−1 otherwise.(1)

The implementation uses a Straight-Through Estimator(STE) [1] with the addition, that it cancels the gradients,when the inputs get too large, as proposed by Hubara et al.[12]. Let c denote the objective function, ri be a real num-ber input, and ro ∈ {−1,+1} a binary output. Furthermore,tclip is the threshold for clipping gradients, which was setto tclip = 1 in previous works [31, 12]. Then, the resultingSTE is:

Forward: ro = sign(ri) . (2)

Backward:∂c

∂ri=

∂c

∂ro1|ri|≤tclip . (3)

3.2. Scaling Methods

Binarization will always introduce an approximation errorcompared to a full-precision signal. In their analysis, Zhouet al. [32] show that this error linearly degrades the accuracyof a CNN.

Consequently, Rastegari et al. [23] propose to scale theoutput of the binary convolution by the average absoluteweight value per channel (α) and average absolut activationover all input channels (K).

x ∗w ≈ binconv(sign(x), sign(w)) ·K · α (4)

Table 3: The influence of using scaling, a full-precisiondownsampling convolution, and the approxsign function onthe ImageNet dataset based on a binary ResNetE18.

Usescalingof [23]

Downsampl.convolution

Useapproxsignof [22]

AccuracyTop1/Top5

nobinary yes 54.3%/77.6%

no 54.5%/77.8%


yesbinary yes 53.3%/76.4%

no 52.7%/76.1%


The scaling factors should help binary convolutions toincrease the value range. Producing results closer to thoseof full-precision convolutions and reducing the approxima-tion error. However, these different scaling values influencespecific output channels of the convolution. Therefore, aBatchNorm [14] layer directly after the convolution (whichis used in all modern architectures) theoretically minimizesthe difference between a binary convolution with scalingand one without. Thus, we hypothesize that learning a use-ful scaling factor is made inherently difficult by BatchNormlayers. Figure 1 demonstrates an exemplary implementationof our hypothesis.

We empirically evaluated the influence of scaling factors(as proposed by Rastegari et al. [23]) on the accuracy ofour trained models based on the binary ResNetE architec-ture (see Section 4.2). First, the results of our CIFAR-10[17] experiments verify our hypothesis, that applying scal-ing when training a model from scratch does not lead tobetter accuracy (see Table 2). All models show a decreasein accuracy between 0.7% and 3.6% when applying scal-ing factors. Secondly, we evaluated the influence of scalingfor the ImageNet dataset (see Table 3). The result is sim-ilar, scaling reduces model accuracy ranging from 1.0% to1.7%. We conclude that the BatchNorm layers followingeach convolution layer absorb the effect of the scaling fac-tors. To avoid the additional computational and memorycosts, we don’t use scaling factors in the rest of the paper.

3.3. Full-Precision Pre-Training

Fine-tuning a full-precision model to a binary one is ben-eficial only if it yields better results in comparable, totaltraining time. We trained our binary ResNetE18 in threedifferent ways: fully from scratch (1), by fine-tuning a full-precision ResNetE18 with ReLU (2) and clip (proposed by[22]) (3) as activation function (see Figure 2). The full-precision trainings followed the typical configuration of

0 5 10 15 20 25 30 35 40time (epoch)

10

20

30

40

50

60

top-

1 ac

cura

cy (i

n %

)

57.0 56.3 55.1

fp relu 1b signfp clip 1b sign1b sign

Figure 2: Top-1 validation accuracy per epoch of trainingbinary ResNetE18 from scratch (red, 40 epochs, Adam),from a full-precision pre-training (20 epochs, SGD) withclip (green) and ReLU (blue) as activation function. Thedegradation peak of the green and blue curve at epoch 20depicts a heavy “re-learning” effect when we start fine-tuning a full-precision model to a binary one.

momentum SGD with weight decay over 20 epochs withlearning rate decay of 0.1 after 10 and 15 epochs. Forall binary trainings, we used Adam [16] without weightdecay with learning rate updates at epoch 10 and 15 forthe fine-tuning and 30 and 38 for the full binary training.Our experiment shows that clip performs worse than ReLUfor fine-tuning and in general. Additionally, the trainingfrom scratch yields a slightly better result than with pre-training. Pre-training inherently adds complexity to thetraining procedure, because the different architecture of bi-nary networks does not allow to use published ReLU mod-els. Thus, we advocate the avoidance of fine-tuning full-precision models. Note that our observations are based onthe involved architectures in this work. A more comprehen-sive evaluation of other networks remains as future work.

3.4. Backward Pass of the Sign Function

Liu et al. [22] claim that a differentiable approximationfunction, called approxsign, can be made by replacing thebackward pass with

∂c

∂ri=

∂c

∂ro1|ri|≤tclip ·

{2− 2ri if ri ≥ 0,

2 + 2ri otherwise.(5)

Since this could also benefit when training a binary networkfrom scratch, we evaluated this in our experiments. Wecompared the regular backward pass sign with approxsign.First, the results of our CIFAR-10 experiments seem to de-pend on whether we use scaling or not. If we use scal-ing, both functions perform similarly (see Table 2). With-out scaling the approxsign function leads to less accuratemodels on CIFAR-10. In our experiments on ImageNet,the performance difference between the use of the func-tions is minimal (see Table 3). We conclude that applyingapproxsign instead of sign function seems to be specific to

Table 4: Comparison of our binary ResNetE18 model to state-of-the-art binary models using ResNet18 on the ImageNetdataset. The top-1 and top-5 validation accuracy are reported. For the sake of fairness we use the ABC-Net result with 1weight base and 1 activation base in this table.

Downsampl.convolution Size Our result Bi-Real [22] TBN [28] HORQ [18] XNOR [23] ABC-Net (1/1) [19]

full-precision 4.0 MB 58.1%/80.6% 56.4%/79.5% 55.6%/74.2% 55.9%/78.9% 51.2%/73.2% n/abinary 3.4 MB 54.5%/77.8% n/a n/a n/a n/a 42.7%/67.6%

fine-tuning from full-precision models [22]. We thus don’tuse approxsign in the rest of the paper for simplicity.

4. Proposed ApproachIn this section, we present several essential design princi-ples for training accurate BNNs from scratch. We then prac-ticed our design philosophy on the binary ResNetE model,where we believe that the shortcut connections are essen-tial for an accurate BNN. Based on the insights learned wepropose a new BNN model BinaryDenseNet which reachesstate-of-the-art accuracy without tricks.

4.1. Golden Rules for Training Accurate BNNs

As shown in Table 4, with a standard training strategy ourbinary ResNetE18 model outperforms other state-of-the-artbinary models by using the same network structure. Wesuccessfully train our model from scratch by following sev-eral general design principles for BNNs, summarized as fol-lows:• The core of our theory is maintaining rich information

flow of the network, which can effectively compensatethe precision loss caused by quantization.

• Not all the well-known real-valued network architec-tures can be seamlessly applied for BNNs. The net-work architectures from the category compact networkdesign are not well suited for BNNs, since their de-sign philosophies are mutually exclusive (eliminatingredundancy↔ compensating information loss).

• Bottleneck design [26] should be eliminated in yourBNNs. We will discuss this in detail in the followingparagraphs (also confirmed by [2]).

• Seriously consider using full-precision downsamplinglayer in your BNNs to preserve the information flow.

• Using shortcut connections is a straightforward way toavoid bottlenecks of information flow, which is partic-ularly essential for BNNs.

• To overcome bottlenecks of information flow, weshould appropriately increase the network width (thedimension of feature maps) while going deeper (ase.g., see BinaryDenseNet37/37-dilated/45 in Table 7).However, this may introduce additional computationalcosts.

• The previously proposed complex training strategies,as e.g. scaling factors, approxsign function, FP pre-training are not necessary to reach state-of-the-art per-formance when training a binary model directly fromscratch.

Before thinking about model architectures, we must con-sider the main drawbacks of BNNs. First of all, the infor-mation density is theoretically 32 times lower, comparedto full-precision networks. Research suggests, that the dif-ference between 32 bits and 8 bits seems to be minimaland 8-bit networks can achieve almost identical accuracy asfull-precision networks [8]. However, when decreasing bit-width to four or even one bit (binary), the accuracy dropssignificantly [12, 31]. Therefore, the precision loss needsto be alleviated through other techniques, for example byincreasing information flow through the network. We fur-ther describe three main methods in detail, which help topreserve information despite binarization of the model:

First, a binary model should use as many shortcut con-nections as possible in the network. These connections al-low layers later in the network to access information gainedin earlier layers despite of precision loss through binariza-tion. Furthermore, this means that increasing the numberof connections between layers should lead to better modelperformance, especially for binary networks.

Secondly, network architectures including bottlenecksare always a challenge to adopt. The bottleneck design re-duces the number of filters and values significantly betweenthe layers, resulting in less information flow through BNNs.Therefore we hypothesize that either we need to eliminatethe bottlenecks or at least increase the number of filters inthese bottleneck parts for BNNs to achieve best results.

The third way to preserve information comes from re-placing certain crucial layers in a binary network with fullprecision layers. The reasoning is as follows: If layersthat do not have a shortcut connection are binarized, theinformation lost (due to binarization) can not be recov-ered in subsequent layers of the network. This affects thefirst (convolutional) layer and the last layer (a fully con-nected layer which has a number of output neurons equalto the number of classes), as learned from previous work[23, 31, 22, 28, 12]. These layers generate the initial infor-mation for the network or consume the final information for

1⨉1

3⨉3

1⨉1

+

(a) ResNet(bottleneck)

3⨉3

3⨉3

+

(b) ResNet(no bottleneck)

3⨉3

3⨉3

+

+

(c) ResNetE(added shortcut)

1⨉1

3⨉3

(d) DenseNet(bottleneck)

3⨉3

(e) DenseNet(no bottleneck)

3⨉3

3⨉3

(f) BinaryDenseNet

Figure 3: A single building block of different network ar-chitectures (the length of bold black lines represents thenumber of filters). (a) The original ResNet design features abottleneck architecture. A low number of filters reduces in-formation capacity for BNNs. (b) A variation of the ResNetwithout the bottleneck design. The number of filters is in-creased, but with only two convolutions instead of three.(c) The ResNet architecture with an additional shortcut, firstintroduced in [22]. (d) The original DenseNet design witha bottleneck in the second convolution operation. (e) TheDenseNet design without a bottleneck. The two convolu-tion operations are replaced by one 3 × 3 convolution. (f)Our suggested change to a DenseNet where a convolutionwith N filters is replaced by two layers with N

2 filters each.

the prediction, respectively. Therefore, full-precision layersfor the first and the final layer are always applied previously.Another crucial part of deep networks is the downsamplingconvolution which converts all previously collected infor-mation of the network to smaller feature maps with morechannels (this convolution often has stride two and outputchannels equal to twice the number of input channels). Anyinformation lost in this downsampling process is effectivelyno longer available. Therefore, it should always be consid-ered whether these downsampling layers should be in full-precision, even though it slightly increases model size andnumber of operations.

4.2. ResNetE

ResNet combines the information of all previous layers withshortcut connections. This is done by adding the input ofa block to its output with an identity connection. As sug-gested in the previous section, we remove the bottleneckof a ResNet block by replacing the three convolution layers(kernel sizes 1, 3, 1) of a regular ResNet block with two3 × 3 convolution layers with a higher number of filters(see Figure 3a, b). We subsequently increase the number

3⨉3, Δ(2⨉2)

+

1⨉1, Δ(1⨉1)

2⨉2 AvgPool

(a) ResNet

2⨉2 AvgPool

1⨉1 Conv

(b) DenseNet

2⨉2 MaxPool

ReLU1⨉1 Conv

(c)BinaryDenseNet

Figure 4: The downsampling layers of ResNet, DenseNetand BinaryDenseNet. The bold black lines mark the down-sampling layers which can be replaced with FP layers. If weuse FP downsampling in a BinaryDenseNet, we increase thereduction rate to reduce the number of channels (the dashedlines depict the number of channels without reduction). Wealso swap the position of pooling and Conv layer that effec-tively reduces the number of MACs.

of connections by reducing the block size from two convo-lutions per block to one convolution per block, as inspiredby [22]. This leads to twice the amount of shortcuts, as thereare as many shortcuts as blocks, if the amount of layers iskept the same (see Figure 3c). However, [22] also incorpo-rates other changes to the ResNet architecture. Therefore wecall this specific change in the block design ResNetE (shortfor Extra shortcut). The second change is using the full-precision downsampling convolution layer (see Figure 4a).In the following we conduct an ablation study for testing theexact accuracy gain and the impact of the model size.

We evaluated the difference between using binary andfull-precision downsampling layers, which has been oftenignored in the literature. First, we examine the results of bi-nary ResNetE18 on CIFAR-10. Using full-precision down-sampling over binary leads to an accuracy gain between0.2% and 1.2% (see Table 2). However, the model size alsoincreases from 1.39 MB to 2.03 MB, which is arguably toomuch for this minor increase of accuracy. Our results showa significant difference on ImageNet (see Table 3). The ac-curacy increases by 3% when using full-precision down-sampling. Similar to CIFAR-10, the model size increasesby 0.64 MB, in this case from 3.36 MB to 4.0 MB. Thelarger base model size makes the relative model size differ-ence lower and provides a stronger argument for this trade-off. We conclude that the increase in accuracy is significant,especially for ImageNet.

Inspired by the achievement of binary ResNetE, we nat-urally further explored the DenseNet architecture, which issupposed to benefit even more from the densely connectedlayer design.

4.3. BinaryDenseNet

DenseNets [11] apply shortcut connections that, contraryto ResNet, concatenate the input of a block to its output

Table 5: The difference of performance for different Bi-naryDenseNet models when using different downsamplingmethods evaluated on ImageNet.

Blocks,growth-rate

Modelsize(binary)

Downsampl.convolution,reduction

AccuracyTop1/Top5

16, 128 3.39 MB binary, low 52.7%/75.7%3.03 MB FP, high 55.9%/78.5%

32, 64 3.45 MB binary, low 54.3%/77.3%3.08 MB FP, high 57.1%/80.0%

(see Figure 3d, b). Therefore, new information gained inone layer can be reused throughout the entire depth of thenetwork. We believe this is a significant characteristic formaintaining information flow. Thus, we construct a novelBNN architecture: BinaryDenseNet.

The bottleneck design and transition layers of the origi-nal DenseNet effectively keep the network at a smaller totalsize, even though the concatenation adds new informationinto the network every layer. However, as previously men-tioned, we have to eliminate bottlenecks for BNNs. Thebottleneck design can be modified by replacing the twoconvolution layers (kernel sizes 1 and 3) with one 3 × 3convolution (see Figure 3d, e). However, our experimentsshowed that DenseNet architecture does not achieve satis-factory performance, even after this change. This is due tothe limited representation capacity of binary layers. Thereare different ways to increase the capacity. We can increasethe growth rate parameter k, which is the number of newlyconcatenated features from each layer. We can also use alarger number of blocks. Both individual approaches addroughly the same amount of parameters to the network.To keep the number of parameters equal for a given Bina-ryDenseNet we can halve the growth rate and double thenumber of blocks at the same time (see Figure 3f) or viceversa. We assume that in this case increasing the number ofblocks should provide better results compared to increasingthe growth rate. This assumption is derived from our hy-pothesis: favoring an increased number of connections oversimply adding weights.

Another characteristic difference of BinaryDenseNetcompared to binary ResNetE is that the downsampling layerreduces the number of channels. To preserve informationflow in these parts of the network we found two options:On the one hand, we can use a full-precision downsamplinglayer, similarly to binary ResNetE. Since the full-precisionlayer preserves more information, we can use higher reduc-tion rate for downsampling layers. To reduce the numberof MACs, we modify the transition block by swapping theposition of pooling and convolution layers. We use Max-Pool→ReLU→1×1-Conv instead of 1×1-Conv→AvgPool

Table 6: The accuracy of different BinaryDenseNet modelsby successively splitting blocks evaluated on ImageNet. Asthe number of connections increases, the model size (andnumber of binary operations) changes marginally, but theaccuracy increases significantly.

Blocks Growth-rate

Model size(binary)

AccuracyTop1/Top5

8 256 3.31 MB 50.2%/73.7%16 128 3.39 MB 52.7%/75.7%32 64 3.45 MB 55.5%/78.1%

in the transition block (see Figure 4c, b). On the otherhand, we can use a binary downsampling conv-layer in-stead of a full-precision layer with a lower reduction rate, oreven no reduction at all. We coupled the decision whetherto use a binary or a full-precision downsampling convo-lution with the choice of reduction rate. The two vari-ants we compare in our experiments (see Section 4.3.1) arethus called full-precision downsampling with high reduction(halve the number of channels in all transition layers) andbinary downsampling with low reduction (no reduction inthe first transition, divide number of channels by 1.4 in thesecond and third transition).

4.3.1 Experiment

Downsampling Layers. In the following we present ourevaluation results of a BinaryDenseNet when using a full-precision downsampling with high reduction over a binarydownsampling with low reduction. The results of a Bina-ryDenseNet21 with growth rate 128 for CIFAR-10 resultshow an accuracy increase of 2.7% from 87.6% to 90.3%.The model size increases from 673 KB to 1.49 MB. Thisis an arguably sharp increase in model size, but the modelis still smaller than a comparable binary ResNet18 with amuch higher accuracy. The results of two BinaryDenseNetarchitectures (16 and 32 blocks combined with 128 and 64growth rate respectively) for ImageNet show an increase ofaccuracy ranging from 2.8% to 3.2% (see Table 5). Fur-ther, because of the higher reduction rate, the model size de-creases by 0.36 MB at the same time. This shows a highereffectiveness and efficiency of using a FP downsamplinglayer for a BinaryDenseNet compared to a binary ResNet.Splitting Layers. We tested our proposed architecturechange (see Figure 3f) by comparing BinaryDenseNet mod-els with varying growth rates and number of blocks (andthus layers). The results show, that increasing the num-ber of connections by adding more layers over simply in-creasing growth rate increases accuracy in an efficient way(see Table 6). Doubling the number of blocks and halv-ing the growth rate leads to an accuracy gain ranging from2.5% to 2.8%. Since the training of a very deep Binary-

Table 7: Comparison of our BinaryDenseNet to state-of-the-art 1-bit CNN models on ImageNet.

Modelsize Method Top-1/Top-5

accuracy

∼4.0MB

XNOR-ResNet18 [23] 51.2%/73.2%TBN-ResNet18 [28] 55.6%/74.2%Bi-Real-ResNet18 [22] 56.4%/79.5%BinaryResNetE18 58.1%/80.6%BinaryDenseNet28 60.7%/82.4%

∼5.1MB

TBN-ResNet34 [28] 58.2%/81.0%Bi-Real-ResNet34 [22] 62.2%/83.9%BinaryDenseNet37 62.5%/83.9%BinaryDenseNet37-dilated∗ 63.7%/84.7%

7.4MB BinaryDenseNet45 63.7%/84.8%46.8MB Full-precision ResNet18 69.3%/89.2%249MB Full-precision AlexNet 56.6%/80.2%

∗ BinaryDenseNet37-dilated is slightly different to other modelsas it applies dilated convolution kernels, while the spatial dimen-tion of the feature maps are unchanged in the 2nd, 3rd and 4thstage that enables a broader information flow.

DenseNet becomes slow (it is less of a problem during in-ference, since no additional memory is needed during in-ference for storing some intermediate results), we have nottrained even more highly connected models, but highly sus-pect that this would increase accuracy even further. The to-tal model size slightly increases, since every second half ofa split block has slightly more inputs compared to those ofa double-sized normal block. In conclusion, our techniqueof increasing number of connections is highly effective andsize-efficient for a BinaryDenseNet.

5. Main ResultsIn this section, we report our main experimental resultson image classification and object detection using Binary-DenseNet. We further report the computation cost in com-parison with other quantization methods. Our implementa-tion is based on the BMXNet framework first presented byYang et al. [29]. Our models are trained from scratch usinga standard training strategy. Due to space limitations, moredetails of the experiment can be found in the supplementarymaterials.Image Classification. To evaluate the classification accu-racy, we report our results on ImageNet [4]. Table 7 showsthe comparison result of our BinaryDenseNet to state-of-the-art BNNs with different sizes. For this comparison, wechose growth and reduction rates for BinaryDenseNet mod-els to match the model size and complexity of the corre-sponding binary ResNet architectures as closely as possible.Our results show that BinaryDenseNet surpass all the exist-ing 1-bit CNNs with noticeable margin. Particularly, Bi-naryDenseNet28 with 60.7% top-1 accuracy, is better thanour binary ResNetE18, and achieves up to 18.6% and 7.6%

Table 8: Object detection performance (in mAP) of our Bi-naryDenseNet37/45 and other BNNs on VOC2007 test set.

Method Ours†

37/45TBN∗

ResNet34XNOR-Net∗

ResNet34

Binary SSD 66.4/68.2 59.5 55.1Full-precision

SSD512/faster rcnn/yolo 76.8/73.2/66.4

∗ SSD300 result read from [28], † SSD512 result

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00number of operations 1e9

40

45

50

55

60

65

70

75

top-

1 Im

ageN

et a

ccur

acy

in %

BinaryResNetE18 (ours)XNOR-NetBi-Real NetHORQBinaryDenseNet{28, 37, 45} (ours)ResNet18 (FP)ABC-Net {1/1, 5/5}DoReFa (W:1,A:4)SYQ (W:1,A:8)TBN

Figure 5: The trade-off of top-1 validation accuracy on Im-ageNet and number of operations. All the binary/quantizedmodels are based on ResNet18 except BinaryDenseNet.

relative improvement over the well-known XNOR-Networkand the current state-of-the-art Bi-Real Net, even thoughthey use a more complex training strategy and additionaltechniques, e.g., custom gradients and a scaling variant.Preliminary Result on Object Detection. We adopted theoff-the-shelf toolbox Gluon-CV [9] for the object detectionexperiment. We change the base model of the adopted SSDarchitecture [20] to BinaryDenseNet and train our mod-els on the combination of PASCAL VOC2007 trainval andVOC2012 trainval, and test on VOC2007 test set [5]. Ta-ble 8 illustrates the results of binary SSD as well as someFP detection models [20, 25, 24].Efficiency Analysis. For this analysis, we adopted thesame calculation method as [22]. Figure 5 shows that ourbinary ResNetE18 demonstrates higher accuracy with thesame computational complexity compared to other BNNs,and BinaryDenseNet28/37/45 achieve significant accuracyimprovement with only small additional computation over-head. For a more challenging comparison we include mod-els with 1-bit weight and multi-bits activations: DoReFa-Net (w:1, a:4) [31] and SYQ (w:1, a:8) [6], and a modelwith multiple weight and multiple activation bases: ABC-Net {5/5}. Overall, our BinaryDenseNet models showsuperior performance while measuring both accuracy andcomputational efficiency.

In closing, although the task is still arduous, we hope theideas and results of this paper will provide new potentialdirections for the future development of BNNs.

References[1] Y. Bengio, N. Leonard, and A. C. Courville. Estimating or

propagating gradients through stochastic neurons for condi-tional computation. CoRR, abs/1308.3432, 2013. 3

[2] J. Bethge, H. Yang, C. Bartz, and C. Meinel. Learning totrain a binary neural network. CoRR, abs/1809.10463, 2018.5

[3] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect:Training deep neural networks with binary weights duringpropagations. In Advances in neural information processingsystems, pages 3123–3131, 2015. 2

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 248–255. Ieee, 2009. 2, 8

[5] M. Everingham, L. Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. Int. J. Comput. Vision, 88(2):303–338, June 2010. 8

[6] J. Faraone, N. Fraser, M. Blott, and P. H. Leong. Syq:Learning symmetric quantization for efficient deep neuralnetworks. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2018. 2, 8

[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2014. 1

[8] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-pressing deep neural networks with pruning, trained quanti-zation and huffman coding. In International Conference onLearning Representations (ICLR), 2016. 5

[9] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li. Bagof tricks for image classification with convolutional neuralnetworks. arXiv preprint arXiv:1812.01187, 2018. 8

[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-cient convolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861, 2017. 1, 2

[11] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.Densely connected convolutional networks. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, volume 1, page 3, 2017. 6

[12] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, andY. Bengio. Binarized neural networks. In Advances in neuralinformation processing systems, 2016. 1, 2, 3, 5

[13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracywith 50x fewer parameters and 0.5 mb model size. arXivpreprint arXiv:1602.07360, 2016. 1, 2

[14] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456, 2015. 4

[15] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep featuresfor text spotting. In Computer Vision – ECCV 2014, pages512–528, Cham, 2014. Springer International Publishing. 1

[16] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In ICLR, 2015. 4

[17] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10. URLhttp://www. cs. toronto. edu/kriz/cifar. html, 2010. 1, 4

[18] Z. Li, B. Ni, W. Zhang, X. Yang, and W. Gao. Perfor-mance guaranteed network acceleration via high-order resid-ual quantization. In Proceedings of the IEEE InternationalConference on Computer Vision, 2017. 2, 3, 5

[19] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convo-lutional neural network. In Advances in Neural InformationProcessing Systems, pages 344–352, 2017. 1, 2, 3, 5

[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,C. Fu, and A. C. Berg. SSD: single shot multibox detector.In ECCV, pages 21–37, 2016. 8

[21] Z. Liu, W. Luo, B. Wu, X. Yang, W. Liu, and K. Cheng.Bi-real net: Binarizing deep network towards real-networkperformance. CoRR, abs/1811.01335, 2018. 3

[22] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng.Bi-real net: Enhancing the performance of 1-bit cnns withimproved representational capability and advanced trainingalgorithm. In ECCV, September 2018. 1, 2, 3, 4, 5, 6, 8

[23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neu-ral networks. In European Conference on Computer Vision,pages 525–542. Springer, 2016. 1, 2, 3, 4, 5, 8

[24] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.You only look once: Unified, real-time object detection.In 2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,2016, pages 779–788, 2016. 8

[25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-wards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Sys-tems 28, pages 91–99, 2015. 8

[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, andOthers. Going deeper with convolutions. Cvpr, 2015. 5

[27] W. Tang, G. Hua, and L. Wang. How to Train a CompactBinary Neural Network with High Accuracy. AAAI, 2017. 3

[28] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, andH. Tao Shen. Tbn: Convolutional neural network withternary inputs and binary weights. In ECCV, September2018. 2, 3, 5, 8

[29] H. Yang, M. Fritzsche, C. Bartz, and C. Meinel. Bmxnet:An open-source binary neural network implementation basedon mxnet. In Proceedings of the 2017 ACM on MultimediaConference, pages 1209–1212. ACM, 2017. 8

[30] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: Anextremely efficient convolutional neural network for mobiledevices. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2018. 1, 2

[31] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.Dorefa-net: Training low bitwidth convolutional neuralnetworks with low bitwidth gradients. arXiv preprintarXiv:1606.06160, 2016. 1, 2, 3, 5, 8

[32] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, andP. Frossard. Adaptive Quantization for Deep Neural Net-work. 2017. 3

[33] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternaryquantization. arXiv preprint arXiv:1612.01064, 2016. 2, 3

6. LEARNING BINARY NEURAL NETWORKS WITH BMXNET

134

7

Image Captioner

In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-

LSTM) model to address the image captioning problem. We demonstrate that

Bi-LSTM models achieve state-of-the-art performance on both caption generation

and image-sentence retrieval task. Our experiments also prove that multi-task

learning is beneficial to increase model generality and gain performance. Our

model significantly outperforms previous methods on the Pascal1K dataset.


• Contributor to the formulation and implementation of research ideas

• Significantly contributed to the conceptual discussion and implementation.


7.2 Manuscript

Additional to the manuscript, we prepared a demo video to represent the proposed

real-time system.1

1https://youtu.be/a0bh9_2LE24

135

https://youtu.be/a0bh9_2LE24

40

Image Captioning with Deep Bidirectional LSTMs

and Multi-Task Learning

CHENG WANG, HAOJIN YANG, and CHRISTOPH MEINEL, Hasso Plattner Institute,

University of Potsdam

Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision,natural language processing, and multimedia communities. In this work, we propose an end-to-end trainabledeep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combin-ing a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable oflearning long-term visual-language interactions by making use of history and future context informationat high-level semantic space. We also explore deep multimodal bidirectional models, in which we increasethe depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Dataaugmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models “translate” image to sentence, we visualizeand qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generalityof proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1Kdatasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both captiongeneration and image-sentence retrieval even without integrating an additional mechanism (e.g., object de-tection, attention model). Our experiments also prove that multi-task learning is beneficial to increase modelgenerality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTMmodel significantly outperforms previous methods on the Pascal1K dataset.

CCS Concepts: • Computing methodologies→ Natural language generation; Neural networks; Com-

puter vision representations;

Additional Key Words and Phrases: Deep learning, LSTM, multimodal representations, image captioning,mutli-task learning

ACM Reference format:

Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image Captioning with Deep Bidirectional LSTMsand Multi-Task Learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s, Article 40 (April 2018),20 pages.https://doi.org/10.1145/3115432

1 INTRODUCTION

It is challenging to describe an image using sentence-level captions (Karpathy and Li 2015;Karpathy et al. 2014; Kiros et al. 2014b; Kuznetsova et al. 2012, 2014; Mao et al. 2015; Socher et al.2014; Vinyals et al. 2015), where the task is to map the input image to a sentence output thatpossesses its own structure. Inspired by the success of machine translation:translate source lan-guage to target language, image captioning system tries to “translate” an image to a sentence. It

Authors’ addresses: C. Wang, H. Yang, and C. Meinel, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany; emails: {Cheng.Wang, Haojin.Yang, Christoph.Meinel}@hpi.de.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2018 ACM 1551-6857/2018/04-ART40 $15.00https://doi.org/10.1145/3115432

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.

40:2 C. Wang et al.

Fig. 1. Deep Multimodal Bidirectional LSTM. L1: sentence embedding layer. L2: Text-LSTM (T-LSTM) layer

which receives text only. L3:Multimodal-LSTM (M-LSTM) layer which receives both image and text input. L4:

Softmax layer. We feed sentence in both forward (blue arrows) and backward (red arrows) order which allows

our model to summarize context information from both the left and right sides for generating a sentence

word by word over time. Our model is end-to-end trainable by minimizing a joint loss.

requires not only the recognition of visual objects in an image and the semantic interactions be-tween objects, but the ability to capture visual-language interactions and learn how to “translate”the visual understanding to sensible sentence descriptions. A general approach is to train a visualmodel using images and train a language model using provided captions. By learning a multi-modal joint representation on images and captions, the semantic similarity of images and captionscan be measured and thus recommend the most descriptive caption for a given input image. Themost important part at the center of this visual-language modeling is to capture the semantic cor-relations across image and text modalities. While some previous works (Li et al. 2011; Kulkarniet al. 2013; Mitchell et al. 2012; Kuznetsova et al. 2012, 2014) have been proposed to address theproblem of image captioning, they mostly use sentence templates, or treat image captioning as aretrieval task through ranking the best matching sentence in the database as the caption. Thoseapproaches usually suffer difficulties in generating variable-length and novel sentences. Recentwork (Karpathy and Li 2015; Karpathy et al. 2014; Kiros et al. 2014b; Mao et al. 2015; Socher et al.2014; Vinyals et al. 2015) indicates that embedding visual and language to common semantic spacewith relatively shallow recurrent neural network (RNN) yields promising results.

In this work, we propose novel architectures to generate novel image descriptions. The overviewof architecture is shown in Figure 1. Different from previous approaches, we learn a visual-language space where sentence embeddings are encoded using bidirectional Long Short-TermMemory (Bi-LSTM) and visual embeddings are encoded with Convolutional Neural Network(CNN). Typically, in unidirectional sentence generation, one general way of predicting next wordwt with visual context I and history textual contextw1:t−1 is to maximize log P (wt |I ,w1:t−1). Whilethe unidirectional model includes past context, it is still limited to retaining future contextwt+1:T

that can be used for reasoning previous word wt by maximizing log P (wt |I ,wt+1:T ). The bidirec-tional model tries to overcome the shortcomings that each unidirectional (forward and backwarddirection) model suffers on its own and exploits the past and future dependence to give a pre-diction. As in Figure 2, two example images with bidirectionally generated sentences intuitivelysupport our assumption that bidirectional captions are complementary; combining them can gen-erate more sensible captions. Thus, our Bi-LSTM is able to summarize long-range visual-languageinteractions from forward and backward directions.


Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:3

Fig. 2. Illustration of generated captions. Two example images from Flickr8K dataset and their best match-

ing captions that generated in forward order (blue) and backward order (red). Bidirectional models capture

different levels of visual-language interactions (more evidence see Section 4.7). The final caption is the sen-

tence with higher probabilities (histogram under sentence). In both examples, backward caption is selected

as final caption for corresponding images.

Inspired by the architectural depth of human brain, to learn higher level visual-language em-beddings, we also explore the deeper bidirectional LSTM architectures where we increase the non-linearity by adding a hidden-to-hidden transformation layer. All of our proposed models can betrained in an end-to-end way by optimizing a joint loss in forward and backward directions. Inaddition, we design multi-task learning (Caruana 1998) and transfer learning (Pan and Yang 2010)to increase the generality of the proposed method on different datasets.

The core contributions of this work are fourfold:

—We propose an end-to-end trainable multimodal bidirectional LSTM and its deeper variantmodels (see Section 3.3) that embed image and sentence into a high-level semantic spaceby exploiting both long-term history and future context. The code, networks, and examplesfor this work can be found at our Github repository.1

—We evaluate the effectiveness of proposed models on three benchmark datasets: Flickr8K,Flickr30K, and MSCOCO. Our experimental results show that bidirectional LSTM modelsachieve highly competitive performance on caption generation (Section 4.6).

—We explore the generality on multi-task/transfer learning models on Pascal1K (Section 4.5).It demonstrates that transferring a multi-task joint model on Flickr8K, Flickr30K, andMSCOCO to Pascal1K is beneficial and performs significantly better than recent methods(see Section 4.6).

—We visualize the evolution of hidden states of bidirectional LSTM units to qualitativelyanalyze and understand how to generate a sentence that is conditioned by visual contextinformation over time (see Section 4.7).

The rest of the article is organized as follows. In Section 2, we review the related work on imagecaptioning using deep architectures. In Section 3, we introduce the proposed deep multimodalbidirectional LSTM for image captioning and explore its deeper variant models. Section 4 presentsseveral groups of experiments to illustrate the effectiveness of proposedmethods. In Section 4.6, wecompare our models with state-of-the-art methods; it shows that Bi-LSTM models achieve verycompetitive performance. In Section 4.7, we visualize the internal states of LSTM hidden unitsand show how our methods generalize to new datasets with multi-task/transfer learning; we alsoprovide some illustrative examples. Section 5 summarizes our methods and presents future work.

2 RELATEDWORK

This section gives the related knowledge. It starts by introducing Recurrent Neural Network (RNN)which equips neural networks with memories, followed by the review of recently proposed ap-proaches on the image captioning task.

1https://github.com/deepsemantic/image_captioning.


40:4 C. Wang et al.

2.1 RNN

RNN is a powerful network architecture for processing sequential data. It has been widely usedin natural language processing (Socher et al. 2011), speech recognition (Graves et al. 2013), andhandwriting recognition (Graves et al. 2009) in recent years. In RNN, it allows cyclical connectionand reuse of the weights across different instances of neurons; each of them is associated withdifferent timesteps. This idea can explicitly support the network to learn the entire history ofprevious states andmap them to current states.With this property, RNN is able to map an arbitrarylength sequence to a fixed length vector.

LSTM (Long short-term memory) (Hochreiter and Schmidhuber 1997) is a particular form oftraditional RNN. Compared to traditional RNN, LSTM can learn the long-term dependencies be-tween inputs and outputs; it can also effectively prevent backpropagation errors from vanishingor exploding. LSTM has increasing popularity in the field of machine translation (Cho et al. 2014),speech recognition (Graves et al. 2013), and sequence learning (Sutskever et al. 2014) recently. An-other special type of RNN is Gated Recurrent Unit (GRU) (Cho et al. 2014). GRU simplifies LSTM byremoving the memory cell and provides a different way to prevent the vanishing gradient problem.GRU has been recently explored in language modeling (Chung et al. 2015), face aging (Wang et al.2016a), face alignment (Wang et al. 2016b), and speech synthesis (Wu and King 2016). Motivatedby those works, in the context of automatic image captioning, our networks build on bidirectionalLSTM in order to learn the long-term interaction across image and sentence from both history andfuture information.

2.2 Image Captioning

Multimodal representation learning (Ngiam et al. 2011; Srivastava and Salakhutdinov 2012; Wanget al. 2016c) has significant value in multimedia understanding and retrieval. The shared con-cept across modalities plays an important role in bridging the “semantic gap” of multimodal data(Rasiwasia et al. 2007; Yang et al. 2015, 2016). Image captioning falls into this general category oflearning multimodal representations.

Recently, several approaches have been proposed for image captioning. We can roughly classifythose methods into three categories. The first category is template-based approaches that generatecaption templates through detecting objects and discovering attributes in an image. For example,the work Li et al. (2011) was proposed to parse a whole sentence into several phrases, and learn therelationships between phrases and objects in an image. In Kulkarni et al. (2013), conditional ran-dom field (CRF) was used to correspond objects, attributes, and prepositions of image content andpredict the best label. Other similar methods were presented in Mitchell et al. (2012), Kuznetsovaet al. (2012, 2014). These methods are typically hard-designed and rely on a fixed template, whichmostly lead to poor performance in generating variable-length sentences. The second category isretrieval-based approaches. This sort of method treats image captioning as a retrieval task by lever-aging a distance metric to retrieve similar captioned images, and then modifying and combiningretrieved captions to generate a caption (Kuznetsova et al. 2014). But these approaches generallyneed additional procedures such as modification and generalization process to fit image query.

Inspired by the recent success of CNN (Krizhevsky et al. 2012; Zeiler and Fergus 2014) and RNN(Mikolov et al. 2010, 2011; Bahdanau et al. 2015), the third category emerged as neural networkbased methods (Vinyals et al. 2015; Xu et al. 2015; Kiros et al. 2014b; Karpathy et al. 2014; Karpathyand Li 2015). Our work also belongs to this category. Thework conducted by Kiros et al. (2014a) canbe seen as a pioneer work to use neural network for image captioning with a multimodal neurallanguage model. In their follow-up work (Kiros et al. 2014b), Kiros et al. introduced an encoder-decoder pipeline where a sentence was encoded by LSTM and decoded with a structure-content



neural language model (SC-NLM). Socher et al. (2014) presented a DT-RNN (Dependency Tree-Recursive Neural Network) to embed a sentence into a vector space in order to retrieve images.Later on, Mao et al. (2015) proposed m-RNN which replaces the feed-forward neural languagemodel in Kiros et al. (2014b). Similar architectures were introduced in NIC (Vinyals et al. 2015) andLRCN (Donahue et al. 2015); both approaches use LSTM to learn text context. But NIC only feedsvisual information at the first timestep while Mao et al. (2015) and LRCN (Donahue et al. 2015)consider image context at each timestep. Another group of neural network based approaches hasbeen introduced in Karpathy et al. (2014) and Karpathy and Li (2015) where object detection withR-CNN (region-CNN) (Girshick et al. 2014) was used for inferring the alignment between imageregions and descriptions.

Most recently, Fang et al. (2015) used multi-instance learning and a traditional maximum-entropy language model for image description generation. Chen and Zitnick (2015) proposed tolearn visual representation with RNN for generating image captions. Xu et al. (2015) introducedan attention mechanism of human visual system into an encoder-decoder framework (Cho et al.2015). It is shown that an attention model can visualize what the model “sees” and yields sig-nificant improvements on image caption generation. In You et al. (2016), the authors proposed asemantic attention model by combining top-down and bottom-up approaches in the framework ofrecurrent neural networks. In the bottom-up approach, semantic concepts or attributes are usedas candidates. In the top-down approach, visual features are employed to guide where and whenattention should be activated.

Unlike those models, our model directly assumes the mapping relationship between visual-semantic is antisymmetric and dynamically learns long-term bidirectional and hierarchical visual-semantic interactions with deep LSTM models. This is proved to be very effective in generationand retrieval tasks as we demonstrate in Section 4.

3 MODEL

In this section, we describe our multimodal Bi-LSTM model and explore its deeper variants. Wefirst briefly introduce LSTM; the LSTM we used is described in Zaremba and Sutskever (2014).

3.1 Long Short-Term Memory

Our model builds on the LSTM cell; as shown in Figure 3, the reading and writing memory cell cis controlled by a group of sigmoid gates. At given timestep t , LSTM receives inputs from differentsources: current input x, the previous hidden state of all LSTM units ht−1, as well as previousmemory cell state ct−1. The updating of those gates at timestep t for given inputs xt , ht−1, and ct−1is as follows:

it = σ (Wxixt +Whiht−1 + bi ), (1)

ft = σ (Wxf xt +Whf ht−1 + bf ), (2)

ot = σ (Wxoxt +Whoht−1 + bo ), (3)

gt = ϕ (Wxcxt +Whcht−1 + bc ), (4)

ct = ft ⊙ ct−1 + it ⊙ gt , (5)

ht = ot ⊙ ϕ (ct ), (6)

where without considering the optional peephole connections, W is the weight matrix learnedfrom the network and b is the bias term. σ is the sigmoid activation function σ (x ) = 1

1+exp(−x )


40:6 C. Wang et al.

Fig. 3. Long Short-Term Memory (LSTM) cell. It consists of an input gate i , a forget gate f , a memory cell c ,

and an output gate o. The input gate decides to let an incoming signal go through to the memory cell or block

it. The output gate can allow new output or prevent it. The forget gate decides to remember or forget the

cell’s previous state. Updating cell states is performed by feeding previous cell output to itself by recurrent

connections in two consecutive timesteps.

and ϕ presents hyperbolic tangent ϕ (x ) = exp(x )−exp(−x )exp(x )+exp(−x ) . ⊙ denotes the products with a gate value.

The LSTM hidden output ht = {htk }Kk=0, ht ∈ RK will be used to predict the next word by Softmaxfunction with parametersWs and bs :

F (pt i ;Ws , bs ) =exp(Wsht i + bs )∑Kj=1 exp(Wsht j + bs )

, (7)

where pt i is the probability distribution for predicted word.Our key motivation of chosen LSTM is that it can learn long-term temporal activities and avoid

quick exploding and vanishing problems that traditional RNN suffers from during backpropagationoptimization.

3.2 Bidirectional LSTM

In order to make use of both the past and future context information of a word in sentence predic-tion, we propose a bidirectional model by feeding a sentence to LSTM from forward and backwardorder. Figure 1 presents the overview of our model; it is comprised of three modules: a CNN forencoding image inputs, a Text-LSTM (T-LSTM) for encoding sentence inputs, and a MultimodalLSTM (M-LSTM) for embedding visual and textual vectors to a common semantic space and de-coding to sentence. The bidirectional LSTM is implemented with two separate LSTM layers for

computing forward hidden sequences−→h and backward hidden sequences

←−h . The forward LSTM

starts at time t = 1 and the backward LSTM starts at time t = T . Formally, our model works as

follows: for a given raw image input I , forward order sentence−→S , and backward order sentence

←−S ,

the encoding performs as

It = C (I ;Θv ),−→h 1t = T (

−→E−→S ;−→Θl ),

←−h 1t = T (

←−E←−S ;←−Θl ), (8)

where C, T represent CNN, T-LSTM, respectively, and Θv , Θl are their corresponding weights.Following previous work (Mao et al. 2015; Donahue et al. 2015), It is considered at all timesteps as

visual context information.−→E and

←−E are bidirectional embedding matrices learned from network.

Encoded visual and textual representations are then embedded to multimodal LSTM by−→h 2t =M

(−→h 1t , It ;−→Θm

),←−h 2t =M

(←−h 1t , It ;←−Θm

), (9)



Fig. 4. Illustrations of proposed deep architectures for image captioning. The network in (a) is commonly

used in previous work. (b) Our proposed Bidirectional LSTM (Bi-LSTM). (c) Our proposed Bidirectional

Stacked LSTM (Bi-S-LSTM). (d) Our proposed Bidirectional LSTM with full connected (FC) transition layer

(Bi-F-LSTM). T-LSTM receives text input only and M-LSTM receives both image and text input.

whereM presents M-LSTM and its weightΘm .M aims to capture the correlation of visual contextand words at different timesteps. We feed visual vector It to the model at each timestep for cap-turing strong visual-word correlation. On the top of M-LSTM are Softmax layers with parametersWs and bs which compute the probability distribution of the next predicted word by

−→p t+1 = F(−→h 2t ;−→Ws ,−→b s

), ←−p t+1 = F

(←−h 2t ;←−Ws ,←−b s

), (10)

where p ∈ RK and K is the vocabulary size.

3.3 Deeper LSTM Architecture

The recent success of deep CNN in image classification and object detection (Krizhevsky et al. 2012;Simonyan and Zisserman 2014b) demonstrates that deep, hierarchical models can be more efficientat learning representation than shallower ones. This motivated our work to explore deeper LSTMarchitectures in the context of learning bidirectional visual-language embeddings. As claimed inPascanu et al. (2013), if we consider LSTM as a composition of multiple hidden layers that unfoldedin time, LSTM is already a deep network. But this is a way of increasing the “horizontal depth”in which network weightsW are reused at each timestep and limited to learn more representa-tive features such as increasing the “vertical depth” of the network. To design deep LSTM, onestraightforward way is to stack multiple LSTM layers as a hidden-to-hidden transition. Alterna-tively, instead of stacking multiple LSTM layers, we propose to add multilayer perceptron (MLP)as an intermediate transition between LSTM layers. This can not only increase LSTM networkdepth, but can also prevent the parameter size from growing dramatically because the number ofrecurrent connections at a hidden layer can be largely decreased.

Directly stacking multiple LSTMs on top of each other leads to Bi-S-LSTM (Figure 4(c)). In addi-tion, we propose to use a fully connected layer as an intermediate transition layer. Our motivationcomes from the finding of Pascanu et al. (2013), in which DT(S)-RNN (deep transition RNN withshortcut) is designed by adding a hidden-to-hidden multilayer perceptron (MLP) transition. It isarguably easier to train such network. Inspired by this, we extend Bi-LSTM (Figure 4(b)) with afully connected layer that we called Bi-F-LSTM (Figure 4(d)); a shortcut connection between theinput and hidden states is introduced to make it easier to train the model. The aim of the extension


40:8 C. Wang et al.

Fig. 5. Transition for Bi-S-LSTM (left) and Bi-F-LSTM (right).

models is to learn an extra hidden transition function Fh . Formally, in Bi-S-LSTM

hl+1t = Fh(hl−1t , h

lt−1)= Uhl−1t + Vhlt−1, (11)

where hlt presents the hidden states of the l-th layer at time t , and U and V are matrices connectedto the transition layer (also see Figure 5 (left)). For readability, we consider one direction trainingand suppress bias terms. Similarly, in Bi-F-LSTM, to learn a hidden transition function Fh by

hl+1t = Fh(hl−1t

)= ϕr(Whl−1t ⊕

(V(Uhl−1t

)), (12)

where ⊕ is the operator that concatenates hl−1t and its abstractions to a long hidden state (alsosee Figure 5 (right)). ϕr represents the rectified linear unit (Relu) activation function for transitionlayer, which performs ϕr (x ) = max(0,x ).

3.4 Data Augmentation

One of the most challenging aspects of training deep bidirectional LSTM models is preventingoverfitting. Since our largest dataset has only 80K images (Lin et al. 2014) which might cause over-fitting easily, we adopted several techniques such as fine-tuning on a pre-trained visual model,weight decay, dropout, and early stopping that were commonly used in previous work. Addition-ally, it has been proved that data augmentation such as randomly cropping and horizontal mirror(Simonyan and Zisserman 2014a; Lu et al. 2014), adding noise, blur, and rotation (Wang et al. 2015)can effectively alleviate overfitting. Inspired by this, we designed new data augmentation tech-niques to increase the number of image-sentence pairs. Our implementation performs on a visualmodel, as follows:

—Multi-Corp: Instead of randomly cropping on input image, we crop at the four cornersand center region because we found that random cropping tends to select center regionand cause overfitting easily. By cropping four corners and center, the variations of networkinput can be increased to alleviate overfitting.

—Multi-Scale: To further increase the number of image-sentence pairs, we rescale input im-age to multiple scales. For each input image I with sizeH ×W , it is resized to 256× 256, thenwe randomly select a region with a size of s ∗ H × s ∗W , where s ∈ [1, 0.925, 0.875, 0.85] isthe scale ratio. s = 1 means we do not multi-scale operation on a given image. Finally, weresize it to AlexNet input size 227 × 227 or VGG-16 input size 224 × 224.

—Vertical Mirror: Motivated by the effectiveness of the widely used horizontal mirror, it isnatural to also consider the vertical mirror of image for the same purpose.



Those augmentation techniques are implemented in a real-time fashion. Each input image israndomly transformed using one of the augmentations to network input for training. In principle,our data augmentation can increase image-sentence training pairs by roughly 40 times (5 × 4 × 2).We report the evaluation of data augmentation in Section 4.4.

3.5 Multi-Task/Transfer Learning

Although our data augmentation can reduce overfitting in training deep LSTM network, it onlyhelps to a certain extent. Increasing the effective training size with fresh training examples canfurther enlarge the variations of training data. This can effectively prevent training loss from go-ing down quickly and reduce overfitting. On the other hand, it is also beneficial to increase themodel robustness and generality. To address this issue, we propose to combine the training exam-ples from different datasets; for example, in our case, Dmulti = Df l ickr8K

⋃Df l ickr30K

⋃Dmscoco .

With combined dataset Dmulti , we train a multi-task joint modelMmulti , then we evaluate modelperformance on validation/test sets of different datasets, respectively.

In order to further test the generality and performance of multi-task joint model Mmulti intransferring knowledge learned on Dmulti to new dataset, we propose to useMmulti to performimage captioning and image-sentence retrieval on target dataset Dpascal1K . Here, we do not useany images from Pascal1K for training, only for validation. We report the evaluation of multi-task/transfer learning of Bi-LSTM in Section 4.5.

3.6 Training and Inference

Our model is end-to-end trainable by using Stochastic Gradient Descent (SGD). The joint loss

function L =−→L +←−L is computed by accumulating the Softmax losses of forward and backward

directions. Our objective is to minimize L, which is equivalent to maximizing the probabilities ofcorrectly generated sentences. We compute the gradient ▽L with the Back-Propagation ThroughTime (BPTT) algorithm (Werbos 1990).The trained model is used to predict a word wt with given image context I and previous word

contextw1:t−1 by P (wt |w1:t−1, I ) in forward order, or by P (wt |wt+1:T , I ) in backward order. We setw1=wT=0 at the start point for forward and backward directions, respectively. Ultimately, withgenerated sentences from two directions, we decide the final sentence for a given image p (w1:T |I )according to the average of word probability within the sentence:

p (w1:T |I ) = max ��

1

T

T∑

t=1

(−→p (wt |I )), 1

T

∑T

t=1(←−p (wt |I ))�

�, (13)

−→p (wt |I ) =

T∏

t=1

p (wt |w1,w2, . . . ,wt−1, I ), (14)

←−p (wt |I ) =

T∏

t=1

p (wt |wt+1,wt+2, . . . ,wT , I ). (15)

Following previous work, we adopted beam search to consider the best k candidate sentences attime t to infer the sentence at next timestep. In our work, we fix k = 1 on all experiments, althoughthe average of 2 BLEU (Papineni et al. 2002) points out that better results can be achieved withk = 20 compared to k = 1 as reported in Vinyals et al. (2015).

4 EXPERIMENTS

In this section, we design several groups of experiments to accomplish the following objectives:


40:10 C. Wang et al.

—Measure the benefits and performance of the proposed bidirectional model and its deepervariant models so that we increase their nonlinearity depth in different ways.

—Examine the influences of data augmentation and multi-task/transfer learning on bidirec-tional LSTM.

—Compare our approach with state-of-the-art methods in terms of sentence generation andimage-sentence retrieval tasks on popular benchmark datasets.

—Qualitatively analyze and understand how bidirectional multimodal LSTM learns to gener-ate a sentence conditioned by visual context information over time.

4.1 Datasets

To validate the effectiveness, generality, and robustness of our models, we conduct experimentson four benchmark datasets: Flickr8K (Hodosh et al. 2013), Flickr30K (Young et al. 2014), MSCOCO(Lin et al. 2014), and Pascal1K (Rashtchian et al. 2010) (used only for transfer learning experiment).

Flickr8K. It consists of 8,000 images and each of them has five sentence-level captions. Wefollow the standard dataset divisions provided by authors; 6,000/1,000/1,000 images for training/validation/testing, respectively.

Flickr30K. An extension version of Flickr8K. It has 31,783 images and each of them has fivecaptions. We follow the publicly accessible2 dataset division by Karpathy and Li (2015). In thisdataset split, 29,000/1,000/1,000 images are used for training/validation/testing, respectively.

MSCOCO. This is a recent released dataset that covers 82,783 images for training and 40,504images for validation. Each of the images has five sentence annotations. Since there is a lack ofstandard splits, we also follow the splits provided by Karpathy and Li (2015). Namely, 80,000 train-ing images and 5,000 images for both validation and testing.

Pascal1K. This dataset is only used for evaluating the generalities of models in our transferlearning experiment. It is a subset of images from the PASCAL VOC challenge. It contains 1,000images; each of them has five sentence descriptions. We do not use any images from this datasetfor training. Following the protocol in Socher et al. (2014), we randomly selected 100 images forvalidation.

4.2 Implementation Details

Visual feature. We use two visual models for encoding images: Caffe (Jia et al. 2014) refer-ence model which is pre-trained with AlexNet (Krizhevsky et al. 2012) and 16-layer VGG model(Simonyan and Zisserman 2014b). We extract features from the last fully connected layer and feedthem to train the visual-language model with LSTM. Previous work (Vinyals et al. 2015; Mao et al.2015) has demonstrated that more powerful image models such as GoogleNet (Szegedy et al. 2015)and ResNet (He et al. 2016) can achieve promising improvements. To make a fair comparison withrecent works, we selected two widely used models for experiments.

Textual feature.We first represent each wordw within a sentence as a one-hot vector,w ∈ RK ,where K is the vocabulary size built on training sentences for a given dataset. By performingbasic tokenization and removing the words that occur less than five times in the training set, wehave 2,028, 7,400, and 8,801 words for Flickr8K, Flickr30K, and MSCOCO dataset vocabularies,respectively.

Our work uses the LSTM implementation of Donahue et al. (2015) on the Caffe framework. Allof our experiments were conducted on Ubuntu 14.04, 16G RAM and single Titan X GPU with 12Gmemory. Our LSTMs use 1,000 hidden units and weights were initialized uniformly from [−0.08,0.08]. The batch sizes are 150, 100, and 100 for Bi-LSTM, Bi-S-LSTM, and Bi-F-LSTM, respectively,

2http://cs.stanford.edu/people/karpathy/deepimagesent/.



Fig. 6. METEOR/CIDEr scores on data augmentation.

when we use AlexNet as the visual model. When we use VGG as the visual model, the batch size isset to 32. Models are trained with learning rates η = 0.01 (AlexNet-based training) and η = 0.005(VGG-based training), weight decay λ is 0.0005, and we used momentum 0.9. Each model is trainedfor 18–35 epochs with early stopping.

4.3 Evaluation Metrics

We evaluate our models mainly on caption generation; we follow previous work to use BLEU-N(N=1,2,3,4) scores (Papineni et al. 2002):

BN = min(1, e1−

r

c

)· e 1

N

∑N

n=1 logpn , (16)

where r , c represent the length of the reference sentence and the generated sentence, respectively,and pn is the modified n-gram precisions. We also report the METETOR (Lavie 2014) and CIDEr(Vedantam et al. 2015) scores for further comparison. To evaluate the generality of our models, weconduct a transfer learning experiment using Pascal1K on image-sentence retrieval3 (image querysentence and vice versa). It performs by computing the score of each image-sentence pair, andranking the scores to obtain the top-K (K = 1,5,10) retrieved results. We adopt R@K and Mean r asthe evaluation metrics. R@K is the recall rate R at top K candidates and Mean r is the mean rank.All mentioned metric scores are computed by the MSCOCO caption evaluation server,4 which iscommonly used for image captioning challenge.5

4.4 Experiments on Data Augmentation

In this subsection, we design a group of experiments to examine the effects of utilized data aug-mentation techniques. To this end, we use Bi-S-LSTM for experiment, because it has deeper LSTMand we believe that training a deeper LSTM network on limited data is more challenging andhelpful to measure the benefits brought by data augmentation. In this experiment, we turn off theintroduced augmentation techniques in Section 3.4 and keep other configurations unchanged. TheBLEU performance is reported in Table 1 and Table 2; METEOR/CIDEr performance is reportedin Figure 6 (shown as Bi-S-LSTMA,−D ). It is clear to see that without using data augmentation,the model performance drops significantly on all metrics. Those results also reveal how data aug-mentation affects datasets at different scales. For example, the model performance on small-scale

3Although this work focuses on image captioning task, we conduct an image-sentence retrieval experiment here to examinethe generality of our models across datasets and tasks. The task has been discussed widely in our previous work (Wanget al. 2016d).4https://github.com/tylin/coco-caption.5http://mscoco.org/home/.



Table 1. BLEU-N Performance Comparison on Flickr8K and Flickr30K (High Score is Good)

Flickr8K Flickr30K

Models B-1 B-2 B-3 B-4 B-1 B-2 B-3 B-4

NIC (Vinyals et al. 2015)G,‡ 63 41 27.2 - 66.3 42.3 27.7 18.3X. Chen et al. (Chen and Zitnick 2014) - - - 14.1 - - - 12.6LRCN (Donahue et al. 2015)A,‡ - - - - 58.8 39.1 25.1 16.5DeepVS (Karpathy and Li 2015)V 57.9 38.3 24.5 16 57.3 36.9 24.0 15.7m-RNN (Mao et al. 2015)A,‡ 56.5 38.6 25.6 17.0 54 36 23 15m-RNN (Mao et al. 2015)V ,‡ - - - - 60 41 28 19Hard-Attention (Xu et al. 2015)V 67 45.7 31.4 21.3 66.9 43.9 29.6 19.9ATT-FCN (You et al. 2016)G - - - - 64.7 46.0 32.4 23.0

C. Wang et al. (Wang et al. 2016d)V 65.5 46.8 32.0 21.5 62.1 42.6 28.1 19.3Bi-LSTMA 63.7 44.7 31 20.9 61.0 40.9 27.1 18.1Bi-S-LSTMA 65.1 45.0 29.3 18.4 60.0 40.3 27.1 18.2Bi-F-LSTMA 63.9 44.6 30.2 19.9 60.7 41.0 27.5 18.5Bi-LSTMV 66.7 48.3 33.7 23 63.3 44.1 29.6 20.1Bi-S-LSTMV 66.9 48.8 33.3 22.8 63.6 44.8 30.4 20.5Bi-F-LSTMV 66.5 48.4 32.8 22.4 63.4 44.3 30.1 20.4

Bi-LSTMA,+M 58.4 42.1 28.6 18.2 61.0 41.4 27.8 18.5Bi-S-LSTMA,−D 55.4 38.0 24.6 15.3 58.2 39.0 25.1 16.3

The superscript “A”means the visual model is AlexNet (or similar network), “V” is VGG-16, “G” is GoogleNet, “-D”meanswithout using data augmentations in Section 3.4, “+M” means using multi-task learning in Section 3.5, “-” indicatesunknown value, “‡” means different data splits.6 The best results are marked in bold and the second best results with anunderline (the superscripts are also applicable to Tables 2, 3, and 4).

dataset Flickr8K is worse than that on Flickr30K and MSCOCO. This confirms that data augmen-tation is beneficial in preventing overfitting and particularly helpful on small-scale dataset.

4.5 Experiments on Multi-Task/Transfer Learning

In addition to using data augmentation to increase the variations of training examples and re-duce overfitting, another effective way should be multi-task learning. Inspired by Simonyan andZisserman (2014a) and Donahue et al. (2015) in which datasets were combined to train a jointmodel, we combine the training set of Flickr8K, Flickr30K, and MSCOCO in order to increase thenumber of training examples. Then we train a multi-task joint model with combined training setsand evaluate on each validation set to examine its performance and generality. To save trainingtime, we initialize the training of the multi-task joint model with the best-performing pre-trainedMSCOCO model. We change the input unit numbers of embedding layers and the output unitsnumber of the last fully connected layer; they are the vocabulary size (11,557) of the combinedtraining set.

To compare with baseline models without using multi-task learning, we select the best-performing models7 for Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets, respectively. Thecomparison with baseline models in terms of BLEU scores is reported in Table 1 and Table 2. Theresults show that the multi-task joint model (shown as Bi-LSTMA,+M ) did not improve the BLEU

6On the MSCOCO dataset, NIC uses 4K images for validation and test. LRCN randomly selects 5K images from MSCOCOvalidation set for validation and test. m-RNN uses 4K images for validation and 1K as test.7The model from 100000th iterations has the best performance on Flickr8K, the model from 90000th iterations performsbest on the rest of datasets.



Table 2. BLEU-N, METEOR and CIDEr Performance Comparison on MSCOCO

MSCOCO

Models B-1 B-2 B-3 B-4 METEOR CIDEr

NIC (Vinyals et al. 2015)G,‡ 66.6 46.1 32.9 24.6 - -X. Chen et al. (Chen and Zitnick 2014) - - - 19.0 20.4 -LRCN (Donahue et al. 2015)A,‡ 62.8 44.2 30.4 - - -DeepVS (Karpathy and Li 2015)V 62.5 45 32.1 23 19.5 66.0m-RNN (Mao et al. 2015)V ,‡ 67 49 35 25 - -Hard-Attention (Xu et al. 2015)V 71.8 50.4 35.7 25 23.0 -ATT-FCN (You et al. 2016)G 70.9 53.7 40.2 30.4 24.3 -C. Wang et al. (Wang et al. 2016d)V 67.2 49.2 35.2 24.4 21.6 71.0Bi-LSTMA 65.1 45.0 29.3 18.4 20.0 64.1Bi-S-LSTMA 64.1 45.4 31.3 21.1 20.7 68.1Bi-F-LSTMA 64.0 45.5 31.5 21.5 20.5 67.5Bi-LSTMV 68.5 50.5 36.0 25.3 22.1 73.0Bi-S-LSTMV 68.7 50.9 36.4 25.8 22.9 73.9

Bi-F-LSTMV 68.2 50.6 36.1 25.6 22.6 73.5

Bi-LSTMA,+M 65.6 47.4 33.3 23.0 21.1 69.5Bi-S-LSTMA,−D 62.8 44.4 30.2 20.0 19.7 60.7

Fig. 7. METEOR/CIDEr scores on multi-task learning.

score on small dataset Flickr8K. We conjecture the reason is, on the one hand, multi-task jointmodel helps to make diversity of training examples and increase model generality. On the otherhand, it enlarges the differences between training and validation data. Those factors lead to worseBLEU performance on Flickr8K even though the generated sentences are highly descriptive andsensible (also see examples in Figure 11). However, the multi-task joint model shows promising im-provements on Flickr30K and MSCOCO. In addition, in Table 1 and Table 2 we also found that themulti-task joint model tends to improve B-2, B-3, and B-4 performance (2.4, 4.0, and 4.6 points in-creased on MSCOCO). In Figure 7, The METEOR/CIDEr performance is improved with multi-tasklearning except METETOR on Flickr8K.

4.6 Comparison with State-of-The-Art Methods

4.6.1 Model Performance. Now we compare with state-of-the-art methods. Table 1 and Table 2summarize the comparison results in terms of BLEU-N. Our approach achieves very competitive



Table 3. Image Captioning Performance Comparison on Pascal1K

Methods BLEU METEOR

Midge (Mitchell et al. 2012) 2.89 8.80Baby talk (Kulkarni et al. 2011) 0.49 9.69RNN (Chen and Zitnick 2014) 2.79 10.08

RNN+IF (Chen and Zitnick 2014) 10.16 16.43RNN+IF+FT (Chen and Zitnick 2014) 10.18 16.45X.Chen et al. (Chen and Zitnick 2014) 10.48 16.69

X.Chen et al.+FT (Chen and Zitnick 2014) 10.77 16.87Bi-LSTMA(transfer) 16.4 18.30

performance on evaluated datasets, although with a less powerful visual model—AlexNet. Increas-ing the depth of LSTM is beneficial on generation task. Deeper variant models mostly obtain bet-ter performance compared to Bi-LSTM, but they are inferior to the latter one in B-3 and B-4 onFlickr8K. We believe it should be the reason that Flick8K is a relatively small dataset which suffersdifficulty in training deep models with limited data. One of the interesting facts we found is thatstacking multiple LSTM layers is generally superior to LSTM with a fully connected transitionlayer, although Bi-S-LSTM needs more training time. Replacing AlexNet with VGG-16 results insignificant improvement on all BLEU evaluation metrics. We should be aware that a recent inter-esting work (Xu et al. 2015) achieves the best results on B-1 by integrating an attention mechanism(LeCun et al. 2015; Xu et al. 2015). Semantic attention (You et al. 2016) with GoogleNet achievesthe best performance on B-2, B-3, and B-4.

Regarding METEOR and CIDEr performance, our baseline model (Bi-LSTMA) outperformsDeepVSV (Karpathy and Li 2015) in a certain margin. It achieves 19.1/51.8 on Flickr8K (compare to16.7/31.8 of DeepVSV ) and 16.1/29.0 on Flickr30K (15.3/24.7 of DeepVSV ). On MSCOCO, our bestresults are 22.9/73.9; the METEOR score is sightly inferior to 23.0 in Xu et al. (2015) and 24.3 inYou et al. (2016) but exceeds the rest of the methods. Although we believe incorporating an atten-tion mechanism into our framework can make further improvements, note that our current modelachieves competitive results while the small gap between ourmodel and the attention-basedmodel(Xu et al. 2015; You et al. 2016) existed.

Comparing to our prior work (Wang et al. 2016d), we use the mean probability in Equation (13),rather than the sum probability of all words when selecting the final caption from the bidirection-ally generated captions. This sightly improves our model performance on nearly all metrics by anaverage 1.7 points on Flickr8K, 1.2 points on Flickr30K, and 1.1 points on MSCOCO.

4.6.2 Model Generality. In order to further evaluate the generality of our model on image cap-tioning, we test our jointmodel on the Pascal1K validation dataset. Table 3 presents the comparisonwith related work on BLEU and METEOR. We can see that even with our base model Bi-LSTMA,the performance on generation task exceeds previous approaches in a certain margin even withoutusing the training images of Pascal1K.

On the same dataset, we also examine the generality of our model on a different task: image-sentence retrieval. The results are reported in Table 4. It shows that without using any trainingimages from Pascal1K, our model substantially outperforms previous work in all metrics. Partic-ularly on R@1, transfer learning achieves more than 20 points on both image-to-sentence andsentence-to-image retrieval tasks.

Those experiments demonstrate that although with less powerful visual model, our simplestnetwork (Bi-LSTM) achieves the best performance on both image captioning and image-sentenceretrieval tasks.



Table 4. Image-Sentence Retrieval Performance Comparison on Pascal1K

Image to Sentence Sentence to Image

Methods R@1 R@5 R@10 M_r R@1 R@5 R@10 M_r

Random Ranking 4.0 9.0 12.0 71.0 1.6 5.2 10.6 50.0

KCCA (Socher et al. 2014) 21.0 47.0 61.0 18.0 16.4 41.4 58.0 15.9

DeViSE (Frome et al. 2013) 17.0 57.0 68.0 11.9 21.6 54.6 72.4 9.5

SDT-RNN (Socher et al. 2014) 25.0 56.0 70.0 13.4 35.4 65.2 84.4 7.0

DeepFE (Karpathy et al. 2014) 39.0 68.0 79.0 10.5 23.6 65.2 79.8 7.6

RNN+IF (Chen and Zitnick 2014) 31.0 68.0 87.0 6.0 27.2 65.4 79.8 7.0

X. Chen et al. (Chen and Zitnick 2014) 25.0 71.0 86.0 5.4 28.0 65.4 82.2 6.8

X. Chen et al. (T+I) (Chen and Zitnick 2014) 30.0 75.0 87.0 5.0 28.0 67.4 83.4 6.2

Bi-LSTMA (transfer) 65.0 90.0 95.0 2.0 52.8 86.0 95.4 2.1

Fig. 8. Visualization of LSTM cell. The horizontal axis corresponds to timesteps. The vertical axis is cell index.

Here we visualize the gates and cell states of the first 32 Bi-LSTM units of T-LSTM in forward directional

over 11 timesteps.



Fig. 9. Pattern of the first 96 hidden units chosen at each layer of Bi-LSTM in both forward and backward

directions. The vertical axis presents timesteps. The horizontal axis corresponds to different LSTM units.

In this example, we visualize the T-LSTM layer for text only, the M-LSTM layer for both text and image,

and the Softmax layer for word prediction. The model was trained on Flickr 30K dataset for generating a

sentence word by word at each timestep. In (g), we provide the predicted words at different timesteps and

their corresponding index in vocabulary where we can also read from (e) and (f) (the highlight point at each

row). Word with highest probability is selected as the predicted word.

Fig. 10. Examples of generated captions for a given query image on MSCOCO validation set. Blue captions

are generated in forward direction and red captions are generated in backward direction. The final caption

is selected according to Equation (13) which selects the sentence with the higher mean probability. The final

captions are marked in bold.

4.7 Visualization and Qualitative Analysis

The aim of this set experiment is to visualize the properties of the proposed bidirectional LSTMmodel and explain how it works in generating a sentence word by word over time.

First, we examine the temporal evolution of internal gate states and understand how bidirec-tional LSTM units retain valuable context information and attenuate unimportant information.Figure 8 shows input and output data, the pattern of three sigmoid gates (input, forget, and out-put), as well as cell states. We can clearly see that dynamic states are periodically distilled to unitsfrom timestep t = 0 to t = 11. At t = 0, the input data are sigmoid modulated to input gate i(t )



Fig. 11. Examples of generated captions for given query images on Flickr8K, Flickr30K, MSCOCO, and Pas-

cal1K validation set. Left: input images. Right: → and ← present the generated captions in forward and

backward direction, respectively. The superscript M or T means the captions generated with multi-task or

transfer learning. The final captions are marked in bold.

where values lie within in [0,1]. At this step, the values of forget gates f (t ) of different LSTM unitsare zeros. Along with the increasing of timestep, forget gate starts to decide which unimportantinformation should be forgotten, and meanwhile, decide to retain useful information. Then thememory cell states c(t ) and output gate o(t ) gradually absorb the valuable context informationover time and make a rich representation h(t ) of the output data.

Next, we examine how visual and textual features are embedded to common semantic spaceand used to predict words over time. Figure 9 shows the evolution of hidden units at differentlayers. For the T-LSTM layer where LSTM units are conditioned by textual context from the past



and future, It performs as the encoder of forward and backward sentences. At the M-LSTM layer,LSTM units are conditioned by both visual and textual context. It learns the correlations betweeninput word sequence and visual information that were encoded by CNN. At a given timestep, byremoving unimportant information that makes less contribution to correlate input word and visualcontext, the units tend to appear sparsity pattern and learn more discriminative representationsfrom inputs. At higher layer, embedded multimodal representations are used to compute the prob-ability distribution of next predict word with Softmax. It should be noted, for a given image, thenumber of words in a generated sentence from forward and backward direction can be different.

Figure 10 presents some example images with generated captions. From generated captions, wefound bidirectionally generated captions cover different semantic information; for example, in (b)the forward sentence captures “couch” and “table” while the backward one describes “chairs” and“table.” We also found that a significant proportion (88% by randomly selected 1,000 images onMSCOCO validation set) of generated sentences are novel (do not appear in training set). But gen-erated sentences are highly similar to ground-truth captions; for example, in (d), forward captionis similar to one of the ground-truth captions (“A passenger train that is pulling into a station”) andthe backward caption is similar to the ground-truth caption (“a train is in a tunnel by a station”).It illustrates that our model has a strong capability in learning visual-language correlation andgenerates novel sentences.

More example sentence generations on Flickr8K, Flickr30K, MSCOCO, and Pascal1K can befound in Figure 11. Those examples demonstrate that without using an explicit pre-trained lan-guage model on additional corpus, our models generate sentences which are highly descriptiveand semantically relevant to corresponding images.

5 CONCLUSIONS

We proposed a bidirectional LSTM model that generates a descriptive sentence for an image bytaking both history and future context into account. We further designed deep bidirectional LSTMarchitectures to embed image and sentence at high semantic space for learning visual-languagemodel. We proved multi-task learning of Bi-LSTM is beneficial to increase model generality andfurther confirmed by transfer learning experiment. We also qualitatively visualized internal statesof the proposed model to understand how multimodal bidirectional LSTM generates words atconsecutive timesteps. The effectiveness, generality, and robustness of the proposed models wereevaluated with numerous datasets on two different tasks: image captioning and image-sentenceretrieval. Our models achieve highly competitive results on both tasks. Our future work will fo-cus on exploring more sophisticated language representation (e.g., word2vec) and incorporatingan attention mechanism into our model. It would also be interesting to explore the multilingualcaption generation problem. We also plan to apply our models to other sequence learning taskssuch as text recognition and video captioning.

REFERENCES

D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95–133.Xinlei Chen and C. Lawrence Zitnick. 2014. Learning a recurrent visual representation for image caption generation.

arXiv:1411.5654.X. Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In

CVPR. 2422–2431.Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describingmultimedia content using attention-based encoder-

decoder networks. IEEE Transactions on Multimedia 17, 11 (2015), 1875–1886.K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase

representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.



Junyoung Chung, Caglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks.In ICML. 2067–2075.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-termrecurrent convolutional networks for visual recognition and description. In CVPR. 2625–2634.

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. 2015. From captionsto visual concepts and back. In CVPR. 1473–1482.

A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model.In NIPS. 2121–2129.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detectionand semantic segmentation. In Computer Vision and Pattern Recognition.

Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. 2009. A novelconnectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine

Intelligence 31, 5 (2009), 855–868.A. Graves, A. Mohamed, and G. E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE,

6645–6649.KaimingHe, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and

evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional

architecture for fast feature embedding. In ACMMM. ACM, 675–678.A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS.

1889–1897.A. Karpathy and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128–3137.R. Kiros, R. Salakhutdinov, and R. Zemel. 2014a. Multimodal neural language models. In ICML. 595–603.R. Kiros, R. Salakhutdinov, and R. Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language

models. arXiv:1411.2539.A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In

NIPS. 1097–1105.Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby

talk: Understanding and generating simple image descriptions. In 2011 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR’11). IEEE, 1601–1608.G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. 2013. Babytalk: Understanding and

generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35, 12(2013), 2891–2903.

P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. 2012. Collective generation of natural image descriptions. InACL, Vol. 1. ACL, 359–368.

P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. 2014. TREETALK: Composition and compression of trees for image de-scriptions.Transactions of the Association for Computational Linguistics (TACL) 2, 10 (2014), 351–362.

M. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. ACL (2014), 376.Y. LeCun, Y. Bengio, and G. E. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams.

In CoNLL. ACL, 220–228.T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common

objects in context. In ECCV. Springer, 740–755.Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z. Wang. 2014. Rapid: Rating pictorial aesthetics using deep learning.

In Proceedings of the ACM International Conference on Multimedia. ACM, 457–466.J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural

networks (m-rnn). ICLR 2015.T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model.

In INTERSPEECH. 1045–1048.T. Mikolov, S. Kombrink, L. Burget, J. H. Černocky, and S. Khudanpur. 2011. Extensions of recurrent neural network lan-

guage model. In ICASSP. IEEE, 5528–5531.M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. 2012.

Midge: Generating image descriptions from computer vision detections. In ACL. ACL, 747–756.J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689–696.



Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering22, 10 (2010), 1345–1359.

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. InACL. ACL, 311–318.

R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2013. How to construct deep recurrent neural networks. arXiv:1312.6026.C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical

Turk. In NAACL HLT Workshop. Association for Computational Linguistics, 139–147.Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Trans-

actions on Multimedia 9, 5 (2007), 923–938.K. Simonyan and A. Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. 568–

576.K. Simonyan andA. Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and

describing images with sentences. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 207–218.Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with

recursive neural networks. In 28th International Conference on Machine Learning (ICML’11). 129–136.N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222–2230.I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112.C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going

deeper with convolutions. In CVPR. 1–9.R. Vedantam, Z. Lawrence, and D. Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566–4575.O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156–3164.ChengWang, Haojin Yang, Christian Bartz, and ChristophMeinel. 2016d. Image captioning with deep bidirectional LSTMs.

arXiv:1604.00790.ChengWang, Haojin Yang, and Christoph Meinel. 2016c. A deep semantic framework for multimodal representation learn-

ing. Multimedia Tools and Applications (2016), 1–22.Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, and Nicu Sebe. 2016a. Recurrent face aging. In

IEEE Conference on Computer Vision and Pattern Recognition. 2378–2386.Wei Wang, Sergey Tulyakov, and Nicu Sebe. 2016b. Recurrent convolutional face alignment. In Asian Conference on Com-

puter Vision. Springer, 104–120.Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang.

2015. DeepFont: Identify your font from an image. In 23rd ACM International Conference on Multimedia (MM’15). ACM,New York, 451–459. DOI:http://dx.doi.org/10.1145/2733373.2806219

Paul J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78, 10 (1990),1550–1560.

ZhizhengWu and Simon King. 2016. Investigating gated recurrent networks for speech synthesis. In 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5140–5144.

K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural imagecaption generation with visual attention. ICML 2015.

Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentationlayout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 33.

Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015. Cross-domain feature learning inmultimedia. IEEE Transactionson Multimedia 17, 1 (2015), 64–78.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. InCVPR. 4651–4659.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similaritymetrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics

(TACL) 2 (2014), 67–78.W. Zaremba and I. Sutskever. 2014. Learning to execute. arXiv:1410.4615.M. D. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV. Springer, 818–833.

Received December 2016; revised March 2017; accepted March 2017


7. IMAGE CAPTIONER

156

8

A Deep Semantic Framework for

Multimodal Representation

Learning

In this paper, inspired by the success of deep networks in multimedia computing,

we propose a novel unified deep neural framework for multimodal representation

learning. The extensive experiments on benchmark Wikipedia and MIR Flickr

25K datasets show that our approach achieves promising results compared to

both shallow and deep models in multimodal and cross-modal retrieval tasks.





8.2 Manuscript

157

Multimed Tools Appl (2016) 75:9255–9276DOI 10.1007/s11042-016-3380-8

A deep semantic framework for multimodalrepresentation learning

Cheng Wang1 ·Haojin Yang1 ·Christoph Meinel1

Received: 24 September 2015 / Revised: 21 December 2015 / Accepted: 18 February 2016 /Published online: 3 March 2016© Springer Science+Business Media New York 2016

Abstract Multimodal representation learning has gained increasing importance in variousreal-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g.Canonical Correlation Analysis (CCA). These works neglected the exploration of fusingmultiple modalities at higher semantic level. In this paper, inspired by the success of deepnetworks in multimedia computing, we propose a novel unified deep neural framework formultimodal representation learning. To capture the high-level semantic correlations acrossmodalities, we adopted deep learning feature as image representation and topic feature astext representation respectively. In joint model learning, a 5-layer neural network is designedand enforced with a supervised pre-training in the first 3 layers for intra-modal regular-ization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasetsshow that our approach achieves state-of-the-art results compare to both shallow and deepmodels in multimodal and cross-modal retrieval.

Keywords Multimodal representation · Deep neural networks · Semantic feature ·Cross-modal retrieval

� Cheng [email protected]

Haojin [email protected]

Christoph [email protected]

1 Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam,Germany

9256 Multimed Tools Appl (2016) 75:9255–9276

1 Introduction

Multimodal data has been the subject of much attention last years due to the rapid increas-ing of multimedia data on the Web. It brings new challenges for many multimedia analysistasks such as multimedia retrieval and multimedia content recommendation. Conventionalunimodal based analytical frameworks that concentrate on domain-specific knowledge arelimited to explore the semantic dependency across modalities and bridge the “semanticgap” [35] between different modalities. In the field of multimedia, multimodal representa-tion learning is becoming more and more necessary and important as different modalitiestypically carry different information. Furthermore, one modality can be a semantic com-plementary for another modality [32] in expressing similar concepts. Many research worksuch as [7, 20, 25, 36] have illustrated that multimodal representation has shown the abilityto outperform unimodal representation based approach in various applications and achievedremarkable results.

The learning of correlation in multimodal data is a prevalent approach for handlingthe problems including multimodal and cross-modal retrieval. Recently, several approacheshave been proposed to learn the correlation between image and text data. In image repre-sentation, the conventional approach is to represent the image to visual words with SIFT[21] descriptor. In text representation, bag-of-words(BOW) feature and topic feature thatderived from Latent Dirichlet Allocation(LDA) [3] are usually used as textual feature. Forlearning cross-modal and multimodal correlations, one popular approach is to build a jointmodel for fusing text and image modality and discover the underlying shared conceptsacross them. Figure 1 shows two example images with associated text. Note that, imageand text are loosely related to each other because not all text words are representative tocorresponding image. But the shared semantics between them are key to learn the latentrelationships between different content modalities. Although previous work have made sig-nificant progress, those works are restrictive for exploring intra-modal and inter-modalrelationships at higher semantic space. Recent advances in deep learning [16, 17, 25, 39]open up new opportunities in multimodal data representation and modeling. In this paper,we are interested in exploring the highly non-linear “semantic-level” relationships acrossmodalities with deep networks. To perform semantic correlation learning, we utilized deepconvolutional neural network(CNN) feature as image(visual) representation. CNN featurehas been demonstrated powerful ability in image representation comparing to traditionalhand-crafted visual features that derived from SIFT [21], DSIFT [38], SURF [1] and Fishervectors [29] et al. For text modality, we adopted topic feature that derived from LDA as tex-tual representation. Since visual features are highly abstracted by deep CNNs and textualfeatures are extracted by computing topic distribution, both of them are high-level features.Based on the semantic features we extracted, we propose a 5-layer neural network to learn ajoint model where visual and textual features are fused. Thus, our proposed framework con-sists of three parts: a visual model for learning image feature, a textual model for learningtopic feature and a joint model for semantically correlating multiple features from differ-ent modalities. Note that, during the training phase, we use image-text pairs to train a jointmodel, which will be used as feature extractor in multimodal and cross-modal retrieval.

As an instance of multi-task learning [5, 12, 51], our unified framework can be gener-alized to address multimodal and cross-modal retrieval problems. In mutlimodal retrieval[19, 30, 31], both image and text modalities are involved. We use pre-trained joint modelto extract the shared representation across image-text pairs. In cross-modal (unimodal)retrieval, only single modality is available. Similarly, we used joint model to project image

Multimed Tools Appl (2016) 75:9255–9276 9257

Fig. 1 Two examples of image-text pairs: (a) is selected from Wikipedia dataset (“geography” category),(b) is selected from MIR Flickr 25K dataset . The shared concept between image and text is the key tomultimodal and cross-modal retrieval tasks

and text to a common semantic space respectively. Because different modalities have beenmapped to common semantic space, cross-modal retrieval can be performed by employingdistance metrics. Our approach can be applied to additional modalities such as audio andvideo, however, we investigated image-text modality for multimodal representation learningin this work.

Our main novelties and contributions can be summarized as follows:

1. We proposed a supervised deep architecture for mapping different modalities to a com-mon feature space. By imposing supervised pre-training as a regularizer, intra-modaland inter-modal relationships can be better captured and better performance is achieved.

2. We investigated deep CNN feature as image representation in multimodal representa-tion learning and explored visual-textual fusion at higher semantic level. Our work canbe complementary to existing non-deep feature based approaches.

3. Extensive experiments were conducted to demonstrate the effectiveness of the proposedframework in multimodal/cross-modal tasks on open benchmark datasets. Detailedcomparisons and discussion with related approaches (including deep/non-deep mod-els) are provided. Our experiments show that proposed approach achieved competitiveresults on mutlimodal retrieval and state-of-the-art results on cross-modal retrieval task.

The rest of this paper is structured as follows. Section 2 introduces recent works interms of multimodal and cross-modal retrieval. Section 3 describes the learning architec-ture for image and text representation and the network we proposed to learn multimodalrepresentation. Section 4 demonstrates the training procedure of proposed network. Sec-tion 5 presents the experiments and results for verifying the effectiveness of our approachin various retrieval tasks. Section 6 concludes this work and gives the outlook.

2 Related work

Recently, there are many works concentrate on mutlimodal/cross-modal problems. We candivide these works into following categories: (1) CCA-based approach, (2) topic model-based approach, (3) hash-based approach, (4) deep learning-based approach, (5) others.

One of the most popular approaches to perform cross-modal retrieval is to match textand image modality via canonical correlation analysis(CCA) [37]. N. Rasiwasia and J.Costa Pereira et al. [27, 32] proposed semantic correlation matching(SCM) that combinedwith CCA to learn the maximally correlated subspace. In [33] A. Sharma et al. extendedCCA to generalized multiview analysis(GMA), which is useful for cross-view classification


and cross-media retrieval. Unfortunately, CCA and its extension approaches are difficult toprocess unpaired data during training. Therefore, the unpaired data are not appropriatelyconsidered.

In [18] a nonparametric Bayesian approach was proposed to learn upstream supervisedtopic models for analyzing multimodal data. Y. F. Wang et al. [42] introduced a super-vised multimodal mutual topic reinforce modeling (M3R) approach by building a jointcross-modal probabilistic graphical model. This approach intended to discover the mutuallyconsistent semantic topics via appropriate interactions between model factors. Nevertheless,topic model based approach generally requires complicated computations.

In the work of [45], a supervised coupled dictionary learning with group structures formulti-modal retrieval (SliM2) was proposed. It can be seen as a constrained dictionary learn-ing problem. In [52] J. Zhou et al. proposed a approach named Latent Semantic SparseHashing (LSSH) to perform cross-modal similarity search by employing Sparse Coding andMatrix Factorization. In [48], a discriminative coupled dictionary hashing (DCDH) methodwas proposed. It aimed to preserve not only the intra-similarity but also inter-correlationamong multimodal data.

The recent success of deep learning advances the study of modeling multimodal datawith deep neural network. Actually, the powerful ability of deep learning has been proved inmany fields including image classification analysis [16], speech recognition [9], text detec-tion [13], video classification [44] and multimodal data modeling [8, 24, 25, 34, 36, 41].In [24] authors proposed to correlate image and text caption with deep canonical correla-tion analysis (DCCA), it proved canonical correlation is a very competitive objective, notonly for shallowly learnt features, but also in the context of deep learning. In [25] J.Ngiamet.al applied multimodal deep learning approach in audio-visual speech classification bygreedily training restricted boltzmann machine (RBM) and deep autoencoder. N.Srivastavaet al. [36] proposed a deep boltzmann machine (DBM) based approach to extract a unifiedrepresentation from different data modalities. The evaluation results show that this represen-tation is useful in addressing both classification and information retrieval problem. In [41]stacked autoencoder was extended to multimodal stacked autoencoder(MSAE). Compare toprevious works, it is an effective mapping mechanism but requires little prior knowledge.In work [34], a weakly-shared Deep Transfer Networks (DTNs) is proposed to translatecross-domain information from text to image. Based on two basic uni-modal autoencoders,a correspondence autoencoder (Corr-AE) [8] was proposed and shown that the combinationof representation learning is very effective.

Besides CCA, topic model, hashing and deep learning-based approaches, some otherapproaches were also proposed to tackle multimodal/cross-modal problem. J. Yu et al. [47]designed a cross-modal retrieval system that considering image-text statistical correlations.In [49], X.Hua Zhai et al. proposed a cross-modality correlation propagation (CMCP) algo-rithm. It aimed to exploit both positive and negative correlations for cross-modal retrieval.In another work that conducted by K.Y. Wang et al. [40] in which a method of combiningcommon subspace learning and coupled feature selection for cross-modal matching problemwas proposed. Through selecting features with l21-norm from coupled modalities, coupledlinear regression was used to project data to a common space. Recent research work [43]introduced an approach called Bi-directional Cross-Media Semantic Representation Model(Bi-CMSRM). It performs bi-directional ranking by learning a latent space for both image-query-text retrieval and text-query-image retrieval. One of the most recent works to processcross-modal retrieval is Local Group based Consistent Feature Learning (LGCFL) [15] inwhich a local group prior knowledge was introduced. for improving block based imagefeature (HOG,GIST [26]).


Unfortunately, on one hand, the features used in previous work are low-level or sophis-ticated low-level features. We found that most of works adopted BOVW (bag of visualwords)(e.g. 128-D [15, 18, 40, 41, 50, 52], 500-D [45], 512-D [23], 1000-D [43, 45, 48]and 4096-D [27] BOVW) to represent image and LDA-based 10-D feature vector to rep-resent text(limited to Wikipedia Dataset). Although some other features such as GIST [26]and PHOW [4, 38] were explored in multimodal/cross-modal scenario, none of those worksconsidered deep CNN feature (highly abstracted semantic features) [13, 16] as image repre-sentation. For text representation, commonly two features were adopted in those works. Thefirst one is BOW(bag of words) feature which generally has high dimensionality. The otherone is LDA based topic distribution which can be computed with prior parameter of pre-trained model for modeling new document. On the other hand, most of projection modelsare shallow models. These methods are restrictive for exploring the high-level semantic cor-relations across modalities. In this paper, we leverage deep CNN feature and topic featureas visual and textual representation respectively, and train a joint deep model for capturingthe correlations between visual and textual representation at higher abstract level.

3 Learning architecture

An overview of proposed architecture is shown in Fig. 2. The overall framework consists ofthree components: (1) a visual model for image representation learning, (2) a textual modelfor text representation learning and (3) a joint model for multimodal representation learn-ing. Components (1) and (3) involve deep neural network. In representing image and textmodality as semantic feature, the first step is to train visual Mv and textual Mt models sepa-rately. Then with pre-trained models we can represent image-text pair document as semanticfeature pair. For a given training dataset S that contains N documents S = {D1,D2...DN },each image-text pair D = {Ir , Tr } can then be represented as feature level pair D = {I, T },where I and T denote visual and textual semantic feature that extracted from raw imageinput Ir and Tr with visual model Mv and Mt respectively:

Mv(Ir ) → I Mt(Tr ) → T . (1)

Fig. 2 Deep Semantic Multimodal Learning Framework. The input image-text pair D = {Ir , Tr } are repre-sented as 4096-D deep CNN feature I ∈ �I and 20-D topic or 2000-D tag features T ∈ �T respectively.W(l) denotes the weights between l and l + 1 layer. Both visual and textual features are available in networklearning as well as multimodal retrieval phase. In cross-modal retrieval, only single modality is available andthe other one is initialized with zero values as inputs, e.g. with pre-trained joint model J , to map image to �

can be performed by J (I) = J (I, 0) → πi


Multimodal representation can be understood as a joint representation where featuresfrom multiple modalities are combined. To learn multimodal representation, we proposeda regularized deep neural network (RE-DNN) for fusing visual and textual feature. Theintention of RE-DNN is to learn a joint model J ,

J : ΨI , ΨT → Π (2)

ΨI and ΨT represent image and text semantic feature spaces. Π denotes the common seman-tic space learned by RE-DNN for correlating different modalities. Then we can learn theshared representation π ∈ Π for given image-text pair {I, T } by

J ((I, T ),W, b) → πi (3)

where I ∈ ΨI , T ∈ ΨT are extracted visual and textual feature respectively. W and b

are weights and biases that learned by RE-DNN with training data. In cross-modal map-ping with J , since there is only one modality available, similar to [8, 25], we set the othermodality as zero:

J ((I, 0),W, b) → πi J ((0, T ),W, b) → πt . (4)

where πi and πt represent the projected features in common semantic space from visualfeature I and textual feature T respectively. In our work, the πi and πt are pre-activationvalue of the 5-th layer. Since πi ∈ Π and πt ∈ Π in same feature space, it is straightforwardto calculate the similarity between different modalities.

4 Methodology

This section elaborates the training and optimization procedures of our proposed RE-DNNfor learning image-text multimodal joint model.

In our proposed 5-layer DNN, one-third of network for image modality only, one-third ofnetwork for text modality only and the last one-third of network for multimodal joint model-ing. The whole learning process can be divided into two phases: (1)supervised pre-trainingfor intra-modal regularization and (2) whole network training. Intra-modal regularization ismainly responsible for learning optimal parameters for the first three layers of DNN. Y. Ben-gio et al. [2] pointed out that greedily pre-training basically works better than that withoutpre-training. Since our DNN has inputs from two different modalities, we adopted super-vised pre-training for estimating weights and biases for each modality respectively. Thenthe training of whole network is initialized with optimized weights and biases. Formally,given N training samples, each of them are preprocessed as a set of {visual feature, textualfeature, ground truth label}. Then training set D is denoted as D = {v(n), t (n), y(n)}Nn=1,

where v = {vp}NI

p=1, t = {tq}NT

q=1 and y = {yk}Kk=1. It denotes single input vector v ∈ RNI

,

t ∈ RNT

and y ∈ RK . Let xi denote the input vector at layer l-1, the pre-activation value z

of j th unit at l layer can be formulated as

z(l)j =

N(l−1)∑

i=1

W(l−1)ij xi + b(l−1) (5)

where N(l−1) represents the number of units at l-1 layer. Wij is the weight between ith andj th units, b is the bias unit. Then j th neural unit at layer l is computed by

y(l)j = f (l)(z

(l)i ) l > 1 (6)


In our work, sigmoid function f (x) = 1/(1 + e−x) is used as activation function for allhidden layers (l = 2, 3, 4), and softmax function f (x) = e(x−ε)/

∑Kk=1 e(xk−ε) is used

for activating output layer (l = 5). where ε = max(xk). The intra-modal regularizationproblem is to minimize overall error of training samples for each modality, that described as

argminθ∗

Cm = 1

2N

N∑

n=1

‖ y(n)m − y(n) ‖2 + λ

2

Lm−1∑

l=1

‖ Wl ‖2F . (7)

where parameter space θ∗ = {W (l)m , b(l)

m }, m ∈ {v, t} and Lm = 3.In back propagation we use ground truth semantics to regulate weights and to find opti-

mal parameters W(1)t , W

(2)t , b

(1)t and b

(2)t for text modality, W

(1)v , W

(2)v , b(1)

v and b(2)v for

image modality at layer l2 and l3 respectively. The intra-modal regularization is introducedto reduce noise feature and reserve intrinsic and representative feature for each modal-ity before fusing them at the shared hidden layers. The second part of equation (7) isweight decay term that used for preventing overfitting during training, λ is the weightdecay parameter. The training of whole network performs by initializing with optimalweights and biases θ∗ learned from intra-modal regularization. We randomly initialize W (3),W (4), b(3) and b(4) as standard training steps. The objective is to learn a parameter spaceΩ∗ = {W (l)

m , W (l), b(l)m , b(l)} by

argminΩ∗

C(θ∗) = Cm + 1

2N

N∑

n=1

‖ y(n) − y(n) ‖2 + λ

2

L−1∑

l=1

‖ Wl ‖2F (8)

The back propagation of whole network differs from standard propagation by which thederivation of weights and biases for different modality should be considered separately. Theprocedure of supervised pre-training for intra-modal regularization can be summarized inAlgorithm 1, and the procedure for whole network learning is described in Algorithm 2.


5 Experiments

This section introduces the experiments we conducted on two benchmark datasets:Wikipedia dataset1 and MIR Flickr 25K.2 We also compared the proposed approach withthe state-of-the-art methods for multimodal and cross-modal retrieval.

5.1 Experiment setups

5.1.1 Dataset descriptions

The first dataset utilized in our experiments is Wikipedia dataset [32], which is chosen from“featured articles” of Wikipedia. This is a continually updated collection of 2866 labeled

1http://www.svcl.ucsd.edu/project/crossmodal/2http://press.liacs.nl/mirflickr/


documents, each of them is composed of an image and corresponding text description. Thewhole dataset covers 10 semantic categories such as sport, art and history. The dataset israndomly split into two parts, of which 2173 documents are selected as training set and693 documents are selected as test set. The author of [32] also published extracted featuredata, that is, 10-D topic features for text representation and 128-D SIFT features for imagerepresentation. In this paper, we represented text as 20-D topic feature which is derived fromLDA, and represented image as 4096-D deep CNN features.

The second applied dataset is MIR Flickr 25K[11]. It has been recently introducedfor image retrieval and multimodal/cross-modal retrieval evaluation. It consists of 25000images that were downloaded from Flickr.3 Each image is accompanied with correspondingtags, the average number of tags per image is 8.94. However, some tags are weak labeledwhich leads those tags actually not relevant to image. This dataset covers 38 semantic cat-egories such as sky, flower and food. It should be noted that each image may belong tomultiple categories. In this work, 15K images were randomly selected for training, the restof 10K images were selected for test, 5K images of the test set were randomly selectedas query database, another 1K non-overlap images of the test set are randomly selected asqueries. For representation, we adopted the 2000-D tag features as text feature in [36] andgenerated 4096-D deep CNN features as image representation.

For images in both datasets, we adopted Caffe reference model4 [14] that trained withImageNet ILSVRC12 dataset [6] (1.2M training images) for deep CNN feature extraction.The features were extracted from the 7th layer (fully connected layer) of the caffe referencemodel. For further verify the effectiveness of our proposed approach and compare with deepmodels, we also conducted a group of experiments on MIR Flickr 25 K. We replaced deepCNN features by 3857-D features (a combination of PHOW[4], GIST [26] and MPEG-7descriptors [22] etc.) that used in [36].

5.1.2 Configurations

For visual feature extraction, we applied Caffe framework on Ubuntu 12.04 with NvdiaGTX 780 GPU with 3G memory. Our textual model was trained on Ubuntu 12.04 withIntel 3.20 GHz × 4 CPU and 8G RAM. To learn multimodal representation, we trainedthe joint model by using Matlab on Windows 8 platform, Intel 3.20 GHz × 4 CPU and 8GRAM.

In DNN learning, the first three layers are designed for intra-regularization forboth image and text modality. The unimodal networks are designed as [N/100/M](N: the dimension of image or text feature input, M: the number of categories, e.g.image network is designed as [4096/100/10] for deep CNN features). The last threelayers are set as [2M/100/M] for exploring inter-modal correlation across modali-ties. In our experiments, learning rate α = αm = 0.001, momentum = 0.9achieved the best performance on Wikipedia dataset, and α = αm = 0.01, momen-tum = 0.9 achieved the best performance on MIR Flickr 25K. According to thescale of our training data (2173 training samples), we adopted the mini batch gra-dient descent with batch size 41 for Wikipedia and 100 for MIR Flickr 25K. Forboth datasets the epoch number fixed at K = 200 and weight decay parameterλ = 10−4.

3https://www.flickr.com/4https://github.com/BVLC/caffe/tree/master/models/


5.1.3 Evaluation metrics

In order to compare with previous work, we also adopt mean average precision(mAP)to evaluate the retrieval performance. For each query, assume N documents are retrievedamong Ntest relevant documents. We compute the average precision by

AP = 1

Ntest

N∑

i=1

p(i)r(i) , (9)

where p(i) represents the precision of top i retrieved documents. r(i) denotes the rele-vance of a given rank, r(i) = 1 means the i-th retrieved document is relevant to query andr(i) = 0 otherwise. The mAP over Nq queries then can be computed by

mAP = 1

Nq

Nq∑

n=1

APn. (10)

Because different modalities can be projected to common semantic space by deep archi-tecture, it is natural to adopt distance metric for multimodal and cross-modal retrieval. Forthe distance calculation, we adopted four different distance metrics as in [27]: Euclideandistance (Euclidean), Kullback-Leibler divergence (KL), Cosine distance (Cosine), Normal-ized Correlation(NC).

5.2 Experiments on wikipeidia dataset

5.2.1 Effects of distance metrics

This set of experiments aim to explore the effect of distance metric to retrieval perfor-mance. Table 1 shows the mAP performance of different query types on Wikipedia dataset.Here we primarily consider three query types: multimodal query(QI+T ), image query(QI )and text query(QT ). We also consider the average performance of image and text query(QI +QT

2 ). Since NC achieved the best performance on Wikipedia dataset, we thus appliedNC as distance metric in the comparison with recent works.

5.2.2 Multimodal retrieval

We further tested our approach for multimodal image retrieval. Similar to [28], the textdata within document are served as semantic complementary information for improvingimage retrieval performance. Our work is in line with previous work that we also adoptedmean average precision(mAP) and Precision-Recall curve(P-R curve) to evaluate the overallperformance. Figure 3 shows the mAP performance for each semantic category by using

Table 1 mAP of differentdistance metrics(Wikipedia)

QI+T QI QTQI +QT

2

KL 0.6243 0.3439 0.272 0.308

Cosine 0.6384 0.3439 0.3175 0.3307

Euclidean 0.6146 0.3403 0.2581 0.2992

NC 0.6395 0.3404 0.3526 0.3465


Fig. 3 Per-category mAP (Wikipedia)

RE-DNN with different distance metrics. Table 2 presents the mAP of our approach andsome related work on multimodal retrieval task. The best result of RE-DNN is 0.6395 byusing NC, which is comparable to the state-of-the-art result on Wikipedia dataset.

5.2.3 Unimodal retrieval

As mentioned before, in unimodal retrieval, there is only one modality available in theretrieval phase. The query image(text) and text(image) database are mapped to commonsemantic feature space respectively. Our experiments consider: (1) use text query image and(2) vise versa.

Figure 4 presents the mAP performance against retrieval scope. In this experiment, weintended to observe the change of mAP along with increasing of retrieval scope, that is, thevalue of kt in image query and ki in text query. We initialized kt = ki = 2 and increased to693 (all test samples) with kt = ki = 2. In text query, we note that NC distance metricshows better performance in improving mAP with increasing ki compare to other distancemetrics. In image query, all distance metrics show similar performance as increasing kt .It demonstrated that NC is more appropriate to handle high-level semantic feature basedcross-modal retrieval.

Table 3 summaries the comparison result between our approach and recent work at tworetrieval scopes kt = ki = 8 and kt = ki = 50. At kt = ki = 8, we achieved comparableresult to PFAR [23], where we outperform PFAR [23] on image query and the average mAPscore. The best mAP performance obtained by using NC distance metric. Similarly, wecompared RE-DNN to SliM2 [45], in which retrieval scope was set to ki = kt = 50. Thebest results of SliM2 [45] are 0.2548 for image query and 0.2025 for text query. At that scale,

Table 2 mAP of multimodalretrieval(Wikipedia)

QI+T

Multi-Modal SGM RF [46] 0.641

Multi-Modal SGM Gaussian [46] 0.581

RIS [28] 0.356

TTI [10] 0.323

RE-DNN 0.6395


Fig. 4 mAP performance against retrieval scope (Wikipedia)

our image query result range from 0.2803(Euclidean) to 0.2854(KL), it outperforms SliM2.The text query obtained mAP score are 0.1455(KL), 0.1983(Cosine), 0.1428(Euclidean)and 0.2416(NC) respectively. RE-DNN outperforms SliM2 for all query types with a certainmargin. Overall, the proposed RE-DNN with NC metric achieves sate-of-the-art result atboth retrieval scopes. Figure 5a and b present Precision-Recall (P-R) Curves for image queryand text query separately. In both cases we considered the whole retrieval scope (ki = kt =693) and compared with SCM [27], CMTC [47] and CMCP [49].5 Note that for both imageand text query tasks, our approach obtained better precision at most of recall levels compareto CMCP and CMTC. Although RE-DNN doesn’t show better performance compare toSCM regarding image query, it performs better than SCM in terms of text query. Moreover,RE-DNN has higher precision at almost all recall levels in text query.

Table 4 describes the further comparison results between RE-DNN and SCM [32],CMTC [47], CMCP [49]. LCFS [40], Bi-CMSRM [43] and Corr-Full-AE [8] in terms ofmAP performance. Our approach achieved competitive results in text query (0.3526) andaverage mAP score(0.3465). It should be noted that it is difficult to make an exact com-parison with LCFS [40], Bi-CMSRM [43] and Corr-Full-AE [8], because the division oftraining/test dataset in those works are different to ours. In this work, we strictly followedthe scheme as [32, 47, 49] that with 2173 training/693 test documents. However [40] used1300/1566 as training/test dataset, [43] used 1500/500/866 as training/validation/test datasetand Corr-Full-AE [8] used 2173/462/231 as training/validation/test dataset. Besides, theretrieval scope in [8] was set to ki = kt = 50 which is different to our experiment setting(ki = kt = 693). The redivided datasets, unfortunately, were not published. Thus the directcomparison between our approach and [40, 43] is not possible. Nevertheless, our approachachieved similar result to [8] and a large margin comparing to [40, 43].

5.3 Experiments on MIR Flickr 25K dataset

This section presents the comparison with deep models regarding multimodal and unimodalretrieval. We first compare our approach to related work. Then, we conduct a group of

5The P-R values read from graph


Table 3 mAP performance comparison at different retrieval scope (Wikipedia)

kt = ki = 8 kt = ki = 50

QI QTQI +QT

2 QI QTQI +QT

2

PFAR [23] 0.298 0.273 0.286 − − −SliM2 [45] − − − 0.2548 0.2025 0.2287

RE-DNN 0.3519 0.2300 0.291 0.2815 0.2416 0.2616

experiments where we replaced deep CNN features with the image features used in [36]to verify the effectiveness of proposed RE-DNN without deep CNN features. The featuresused in [36] are 3857-D features that mixed from 2000-D Pyramid Histogram of Words(PHOW) features, 960-D Gist features, 256-D Color Structure Descriptor features and soon. Here, we refer to the mixed features as “PHOW feature” due to its ratio in the mixedfeatures. Since some image belongs to more than one category, in retrieval procedure, if aquery and retrieved items have overlap category label, then we assume the retrieved item isrelevant to the query [36].

5.3.1 Effects of distance metrics

First of all, we also report our approach (RE-DNN with deep CNN features) by using dif-ferent distance metrics. This is to explore the sensitivity of learned joint features to differentmetrics. As shown in Table 5, in multimodal query, Euclidean and KL perform sightly poorthan Cosine and NC. In unimodal query, all distance metrics achieved similar results. Tomake a fair comparison, we selected the Cosine distance metric as in [36]. Therefore, in therest of experiments, we select Cosine as distance metric for comparison purpose.

Fig. 5 Precision-Recall and average mAP performance comparison


Table 4 Comparsion of mAPperformance (Wikipedia)

QI QTQI +QT

2

CCA [27] 0.21 0.174 0.192

SM [27] 0.350 0.249 0.300

SCM [27] 0.362 0.273 0.318

CMTC [47] 0.293 0.232 0.266

CMCP [49] 0.326 0.251 0.289

LCFS [40]∗ 0.2798 0.2141 0.2470

Bi-CMSRM [43]∗ 0.2528 0.2123 0.2326

Corr-Full-AE [8]∗ 0.335 0.368 0.352

RE-DNN 0.3404 0.3526 0.3465

5.3.2 Multimodal retrieval

This subsection gives the comparison between RE-DNN and other deep models such asAutoencoder [25], DBM( Deep Boltzmann Machines) [36] and DBN (Deep Belief Net-work) [36] in multimodal retrieval. Differs to those works, our architecture trains jointmodel in a supervised manner.

Figure 6a shows the precision recall curve of RE-DNN and compared approaches in mul-timodal retrieval. We note that our approach consistently improves precision at all recallscales with large margin comparing to other deep models. From Table 6a, we can see thatRE-DNN achieved mAP 0.719 which significantly outperforms other deep models andachieves state-of-the-art result in multimodal retrieval task on MIR Flickr 25K dataset.Since Autoencoder and DBM are unsupervised architectures, our experiment also provedthat supervised DNN architecture is more capable of capturing the relationships acrossmodalities and learning joint representation without additional data for pre-training.

5.3.3 Unimodal retrieval

The comparison between RE-DNN with deep models DBN and DBM in [25] in termsof unimodal retrieval performance is described as in Fig. 6b . It illustrates RE-DNN con-sistently improves the precision in image query. The best mAP performance of RE-DNNis 0.677, which significantly outperforms image-DBN(0.578), image-DBM (0.587) andmultimodal-DBM (0.614).

Table 5 mAP of differentdistance metrics(MIR Flickr25K) QI+T QI QT

QI +QT

2

KL 0.7048 0.6548 0.5919 0.6234

Cosine 0.7191 0.6771 0.5760 0.6269

Euclidean 0.7029 0.6663 0.5816 0.624

NC 0.7178 0.6789 0.5755 0.6272


Fig. 6 Precision Recall on MIR Flickr 25K

5.3.4 Experiment with PHOW feature

Because deep CNN feature is used as image feature for learning a joint model to capturethe correlation across modalities. In order to verify the effectiveness and generality of pro-posed RE-DNN. This experiment is to explore the performance of RE-DNN architecture ontraditional features. As mentioned, we adopted the features used in [36] which we refer toit as “PHOW”feature. We replaced the 4096-D deep CNN feature by 3857-D PHOW fea-ture and keep the other configuration same to previous experiments (the only difference isthe input layer of image network, which is set as 3857). We re-trained image network andthe joint model for mapping different modalities to common semantic space. Our experi-mental results are reported in Fig. 6a for multimodal query and Fig. 6b for unimodal query.From precision recall curve in Fig. 6a, we find that RE-DNN with PHOW has higher pre-cision at most of recall scale than DBM but inferior to RE-DNN with deep CNN featurein multimodal query. Similarly, in unimodal query, RE-DNN with PHOW shows sightlyimprovements compare to DBM. The performance comparison as shown in Table 6, with

Table 6 Comparison of mAPwith deep models Methods mAP

(a) Multimodal query

DBN [36] 0.609

Autoencoder [25] 0.612

DBM [36] 0.622

RE-DNN (PHOW feature) 0.648

RE-DNN (CNN feature) 0.719

(b) Unimodal query

Image-DBN [36] 0.578

Image-DBM [36] 0.587

Multimodal-DBM [36] 0.614

RE-DNN (PHOW feature) 0.632

RE-DNN (CNN feature) 0.677


Fig. 7 The illustrative examples of multimodal image retrieval. Textual data is seen as auxiliary and semanticcomplementary information in image retrieval. Each row presents an example, where the query images aremarked by red bounding box, and the top-4 retrieved results are placed subsequently

PHOW feature, RE-DNN achieves mAP 0.648 for multimodal query and 0.632 for imagequery. This also further suggests the capability of deep CNN feature in mutlimodal andunimodal retrieval that rarely explored by previous work.

5.4 Illustrative examples

This subsection gives some examples of multimodal retrieval on Wikipedia and MIR Flickr25K dataset (see Figs. 7 and 8) and cross-modal retrieval on Wikipedia dataset(see Fig. 9).

Fig. 8 Multimodal queries on MIR Flickr 25K. The query images are marked with red bounding box, thetop-4 retrieved results are shown subsequently


Fig. 9 The illustrative examples of image and text query. Top-left: image or text query and correspond-ing semantic category probabilities distribution. Top-right: ground-truth image or text of query and withcorresponding semantic category probabilities distribution. Middle: retrieved top-4 images or texts. Bottom:semantic category probabilities distribution of retrieved images or texts


In Fig. 7, two exemplary results of multimodal image retrieval are illustrated. Differentto traditional content-based image retrieval in which only visual feature is involved, ourapproach can explore the deeper semantic relation between different modalities. It elevatesthe retrieval at “semantic level” by leveraging textual information as semantic complemen-tary. We can see that some visually dissimilar images are found because they share similarhidden topics or concepts with the query image. More specifically, the first one share theconcept in biology, while images share concepts in war, warfare in the second example.

Figure 8 presents two multimodal query examples. We can see that for top 4 retrievedresults, they share similar visual or textual concept as query. For example, all entries in thefirst query share concepts flower, nature and color. As mentioned before, we observed thatsome ground truth tags are not exactly related to corresponding image, such as d50, d80and nikond50. It further demonstrates that our approach is able to retain intrinsic essentialinformation and remove noise information in feature learning phase.

Figure 9 presents some examples for image query and text query. In both of them, top 4most relevant results are returned. We use red histogram to represent queries, and green his-togram to represent ground truth text corresponding to image query. Wine color histogram isadopted to represent retrieved results. We also displayed the probabilities score for queries,ground truth and retrieved results. Each histogram is computed using features that derivedfrom the output of 4th layer of RE-DNN. For image query, we note that all displayed textsare correctly retrieved due to all of them are perceived as belonging to semantic category9 (sport). Similarly, all results for text query also are correct because the probabilities ofall results on semantic category 3 (geography) are around 0.6, which consistent with thesemantics of both query text and ground truth image.

5.5 Discussion

5.5.1 Relations to shallow models

Different to previous shallow models in [23, 27, 40, 43, 45, 47, 48], RE-DNN captures tworelevant relationships: intra-modal relationship ( between feature and semantic) and inter-modal relationship( between image and text) with deep neural network. This work is alsoan attempt to adopt deep CNN feature for cross-modal problem at higher semantic level.The image representation used in previous work are hand crafted features such as SIFT[21], GIST [26], PHOW [4] etc. We propose to apply robust CNN model trained on largescale dataset base such as ImageNet (1.2M images) to extract image features from highersemantic level (7th layer). Visual and textual features are then fused in a supervised way,which achieved the state-of-the-art result in both evaluation datasets.

Another advantage of RE-DNN is that it is robust to handle modality missing problemwhile the most compared approaches suffer difficultly in processing unpaired data. We sug-gest that by setting the other modality (e.g. text) to zero (refers to section 3), we are ableto initialize the whole network without this modality. By doing this, for unimodal input,RE-DNN can inference the missing modality with a pre-trained joint model where the cor-relations between different modalities have been stored. By taking the propery of RE-DNN,we perform cross-modal retrieval and achieved to-date best result.

5.5.2 Relations to deep models

Differs to recently introduced deep models in [25], which focuses on video-audio datafusion, our work focuses on very different modalities, image-long text(or text tags) data


fusion problem. Furthermore, we train our network in supervised manner while works of [8,25, 36] are unsupervised architectures. For instance, in [36], image network is designed as[3857/1024/1024], text network is designed as [2000/1024/1024], the joint layer has 2048units. However, our architecture is designed as follows: image network is [N/100/M], textnetwork[N/100/M] where N presents the dimensionality of input feature and M denotes thenumber of categories. Two joint layers are designed as [2M/100/M], more details about net-work configuration refer to Section 5.1.2. Comparing our network to the network structurein [36], we use less hidden units which leads to learn less parameters. Overall, our approachis superior in both model complexity and retrieval performance.

6 Conclusion

We have introduced a unified DNN framework for multimodal representation learning. Byextracting 4096-D visual features with deep CNN and 20-D(or 2000-D) textual features, thecorrelation across modalities is explored with the proposed RE-DNN. By imposing super-vised pre-training, RE-DNN can capture both intra-modal and inter-modal relationshipsat higher semantic level. Our experimental studies on open benchmarks shown that RE-DNN outperforms the alternative approaches and achieves state-of-the-art performance onmultimodal image retrieval and cross-modal retrieval.

Our future work will focus on the optimization problems of learning DNN for variousmultimodal modeling tasks. A more reasonable and effective framework will be developedbased on current study. Besides, considering that our framework can be easily applied toother multimodal scenarios, it would be also interesting to apply RE-DNN on bi-modalfusion or cross-model mapping for video-text, audio-video modality pairs.

References

1. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis ImageUnderst 110(3):346–359

2. Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deepnetworks. Adv Neural Inf Process Syst 19:153

3. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–10224. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE

11th international conference on computer vision, ICCV 2007. IEEE, pp 1–85. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–756. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image

database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–2557. Escalante HJ, Hernadez CA, Sucar LE, Montes M (2008) Late fusion of heterogeneous methods for

multimedia image retrieval. In: Proceedings of the 1st ACM international conference on multimediainformation retrieval. ACM, pp 172–179

8. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedingsof the ACM international conference on multimedia. ACM, pp 7–16

9. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, SainathTN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views offour research groups. IEEE Signal Process Mag 29(6):82–97

10. Hoffman J, Rodner E, Donahue J, Darrell T, Saenko K (2013) Efficient learning of domain-invariantimage representations. arXiv:1301.3224

11. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACMinternational conference on multimedia information retrieval. ACM, pp 39–43

12. Jacob L, Vert J-p, Francis R Bach. (2009) Clustered multi-task learning: a convex formulation. AdvNeural Inf Process Syst, pp 745–752


13. Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: ECCV. Springer, Berlin,pp 512–528

14. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe:convolutional architecture for fast feature embedding. arXiv:1408.5093

15. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modalmultimedia retrieval. IEEE Trans Multimedia 17(3):370–381

16. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neuralnetworks. Adv Neural Inf Process Syst, pp 1097–1105

17. Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsuper-vised learning of hierarchical representations. In: Proceedings of the 26th annual international conferenceon machine learning. ACM, pp 609–616

18. Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In:Proceedings of the 7th ACM international conference on Web search and data mining. ACM, pp 493–502

19. Lienhart R, Romberg S, Horster E (2009) Multilayer plsa for multimodal image retrieval. In: Proceedingsof the ACM international conference on image and video retrieval. ACM, p 9

20. Liu D, Lai K-T, Ye G, Chen M-S, Chang S-F (2013) Sample-specific late fusion for visual cate-gory recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE,pp 803–810

21. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

22. Manjunath BS, Ohm J-R, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE TransCircuits Syst Video Technol 11(6):703–715

23. Mao X, Lin B, Cai D, He X, Pei J (2013) Parallel field alignment for cross media retrieval. In:Proceedings of the 21st ACM international conference on Multimedia. ACM, pp 897–906

24. Mikolajczyk FYK Deep correlation for matching images and text. In: 2015 IEEE conference oncomputer vision and pattern recognition (CVPR). IEEE, p 2015

25. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedingsof the 28th international conference on machine learning (ICML-11), pp 689–696

26. Oliva Aude, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatialenvelope. Int J Comput Vis 42(3):145–175

27. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On therole of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal MachIntell 36(3):521–535

28. Pereira JC, Vasconcelos N (2014) Cross-modal domain adaptation for text-based regularization of imagesemantics in image retrieval systems. Comput Vis Image Underst 124:123–135

29. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Pro-ceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007). IEEE,pp 1–8

30. Pham T-T, Maillot NE, Lim J-H, Chevallet J-P (2007) Latent semantic fusion model for image retrievaland annotation. In: Proceedings of the sixteenth ACM conference on Conference on information andknowledge management. ACM, pp 439–444

31. Pulla C, Jawahar CV (2010) Multi modal semantic indexing for image retrieval. In: Proceedings of theACM international conference on image and video retrieval. ACM, pp 342–349

32. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) Anew approach to cross-modal multimedia retrieval. In: Proceedings of the international conference onmultimedia. ACM, pp 251–260

33. Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminativelatent space. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).IEEE, pp 2160–2167

34. Shu X, Qi G-J, Tang J, Wang J (2015) Weakly-shared deep transfer networks for heterogeneous-domainknowledge propagation. In: Proceedings of the 23rd annual ACM conference on multimedia conference.ACM, pp 35–44

35. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at theend of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380

36. Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. Adv NeuralInf Process Syst, pp 2222–2230

37. Thompson B (2005) Canonical correlation analysis. Encyclopedia of statistics in behavioral science38. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In:

Proceedings of the international conference on multimedia. ACM, pp 1469–1472


39. Vincent Pascal, Larochelle Hugo, Lajoie Isabelle, Bengio Yoshua, Manzagol P-A (2010) Stacked denois-ing autoencoders: learning useful representations in a deep network with a local denoising criterion. JMach Learn Res 11:3371–3408

40. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modalmatching. In: IEEE international conference on computer vision (ICCV). IEEE, pp 2088–2095

41. Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014) Effective multi-modal retrieval based on stackedauto-encoders. Proceedings of the VLDB Endowment 7(8):649–660

42. Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the ACM international conference on multimedia. ACM, pp 307–316

43. Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM international conference on multimedia.ACM, pp 877–886

44. Wu Z, Jiang Y-G, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships withdeep neural networks for video classification. In: Proceedings of the ACM international conference onmultimedia. ACM, pp 167–176

45. Wu F, Zhang Y, Lu WM, Zhuang YT, Wang YF (2013) Supervised coupled dictionary learningwith group structures for multi-modal retrieval. In: Twenty-Seventh AAAI Conference on ArtificialIntelligence

46. Xie L, Pan P, Lu Y (2013) A semantic model for cross-modal and multi-modal retrieval. In: Proceedingsof the 3rd ACM conference on international conference on multimedia retrieval. ACM, pp 175–182

47. Yu J, Cong Y, Qin Z, Wan T (2012) Cross-modal topic correlations for multimedia retrieval. In:Proceedings of the 21st international conference on pattern recognition (ICPR). IEEE, pp 246–249

48. Yu Z, Wu F, Yang Y, Tian Q, Luo Jiebo, Zhuang Y (2014) Discriminative coupled dictionary hashing forfast cross-media retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research& development in information retrieval. ACM, pp 395–404

49. Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. In:IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2337–2340

50. Zhai X, Peng Y, Xiao J (2013) Cross-media retrieval by intra-media and inter-media correlation mining.Multimedia Systems 19(5):395–406

51. Zhang Y, Yeung D-Y (2012) A convex formulation for learning task relationships in multi-task learning.UAI

52. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In:Proceedings of the 37th international ACM SIGIR conference on research & development in informationretrieval. ACM, pp 415–424

Cheng Wang is currently a Ph.D. student at Chair of Internet Technologies and Systems, Hasso-Plattner-Institute (HPI), University of Potsdam, Germany. Prior to HPI, he received master of science degree fromSichuan University, in China 2013 and bachelor of management from Shandong Jianzhu University, in China2010. In 2015, he had short-term research visits in Dept. of CS, University of Cape Town, South Africa andIsrael Institute of Technology, Israel. His research interests including deep learning, artificial intelligence,machine learning and multimedia retrieval.


Haojin Yang received the Diploma Engineering degree at the Technical University Ilmenau, in Germany2008. In 2013, he received the doctorate degree at the Hasso-Plattner-Institute for IT-Systems Engineering(HPI) at the University of Potsdam, in Germany. His current research interests revolve around multime-dia analysis, information retrieval, deep learning technologies, computer vision, content based video searchtechnologies.

ChristophMeinel studied mathematics and computer science at Humboldt University in Berlin. He receivedthe doctorate degree in 1981 and was habilitated in 1988. After visiting positions at the University ofPaderborn and the Max-Planck-Institute for computer science in Saarbrcken, he became a full professor ofcomputer science at the University of Trier. He is now the president and CEO of the Hasso-Plattner-Institutefor IT-Systems Engineering at the University of Potsdam. He is a full professor of computer science with achair in Internet technologies and systems. He is a member of acatech, the German National Academy ofScience and Engineering, and numerous scientific committees and supervisory boards. His research focuseson IT-security engineering, teleteaching, and telemedicine, multimedia retrieval. He has published more than500 papers in high-profile scientific journals and at international conferences.

8. A DEEP SEMANTIC FRAMEWORK FOR MULTIMODALREPRESENTATION LEARNING

180

9

Automatic Lecture Highlighting

In this paper, we propose a novel solution to highlight the online lecture videos in

both sentence- and segment-level, just as is done with paper books. The solution

is based on automatic analysis of multimedia lecture materials, such as speeches,

transcripts, and slides, in order to facilitate the online learners in the current era

of e-learning - especially with MOOCs. With the ground truth created by massive

users, an evaluation process shows the general accuracy can reach 70%, which is

reasonably promising. Finally, we also attempt to find the potential correlation

between these two types of lecture highlights.





• Maintainer of the software project

9.2 Manuscript

181

1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 1

Automatic Online Lecture HighlightingBased on Multimedia AnalysisXiaoyin Che, Haojin Yang and Christoph Meinel, Member, IEEE

Abstract—Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solutionto highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based onautomatic analysis of multimedia lecture materials, such as speeches, transcripts and slides, in order to facilitate the online learners inthis era of e-learning – especially with MOOCs. Sentence-level lecture highlighting basically uses acoustic features from the audio andthe output is implemented in subtitle files of corresponding MOOC videos. In comparison with ground truth created by experts, theprecision is over 60%, which is better than baseline works and also welcomed by user feedbacks. On the other hand, segment-levellecture highlighting works with statistical analysis, mainly by exploring the speech transcripts, the lecture slides and their connections.With the ground truth created by massive users, an evaluation process shows the general accuracy can reach 70%, which is fairlypromising. Finally we also attempt to find potential correlation between these two types of lecture highlights.

Index Terms—Lecture Highlighting, Acoustic Analysis, Statistical Analysis, MOOC

F

1 INTRODUCTION

Many people like using a marker to highlight books whilereading, especially students with textbooks in hand [1]. Researchshows that properly highlighted contents indeed support under-standing [2]. Perhaps this is the reason why quite a lot of bookauthors already highlight the key concepts, features or equations intheir books, and more are requested to do so [3]. Generally, thereare two types of highlighting: content highlighting and table ofcontents highlighting (see Fig. 1). The former mostly emphasizessentences, while the latter works in a larger scale, indicating whichsection should be given special attention.

Not only a widespread practice with traditional paper books,the highlighting function is also welcomed in the era of e-books[4, 5]. It is widely implemented in many e-book applications. Inthis case, while a marker is no longer required, highlighting is stillbased on the book-like textual materials. However, what if there isnothing textual, such as attending a lecture without textbook, doesthat make sense to highlight the lecture?

We believe the answer is yes. In a lecture, there are alwayssome key-points, such as definitions, illustrations, functions, appli-cations, etc., which are more important to students than other con-tents in the lecture. Fortunately, good teachers always know thesekey-points in their lectures and emphasize them while teaching [6].These emphases may draw attention from students through tonechanges in speech and further improve the teaching performance[7]. Once captured and presented to students, particularly to theself-learning students, they could be very helpful. A good teacher’semphasis should also be the students’ learning focus [8].

In recent years, with the rapid development of distancelearning technology, especially in form of the MOOC (MassiveOpen Online Course), numerous lectures are recorded in videos,uploaded to Internet and can be accessed freely online. However,research shows that for MOOC learners, the median engagementtime when watching a lecture video is at most 6 minutes [9]. Un-fortunately, many online lectures are much longer than that. After

All authors are with Hasso Plattner Institute, can be contacted by emails:{xiaoyin.che, haojin.yang, christoph.meinel}@hpi.de, or by post: Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany.

6 minutes, the learners become less concentrated or effective, andsometimes even close the video without finishing it, causing thephenomenon of “in-video dropout” [10].

Lecture highlighting may help in this situation. We could offersome highlighted sentences with alerts, which might work asrefreshments when the learners gradually get distracted. Perhapsthese alerts cannot keep the learners concentrated all the time, butat least when the key-points are presented in the video, the learnersknow. We could also highlight some video segments coveringspecific subtopics emphasized by the teacher, which would makeit easy for learners to directly jump to these key segments before“in-video dropout” occurs. With this effort, as least they wouldencounter the most important knowledge in this video beforequitting. And if the key segment arouses the interest of somelearners, they may even decide not to drop out at all.

Meanwhile, the big improvement achieved in video displayingand lecture recording systems makes the potential implementa-tion of key sentences and segments much easier. Enabling livetranscripts or subtitles, or addressed as CC (Closed Caption),is popular not only with Internet video service providers likeYouTube and Vimeo, but also in traditional TV service, such asARD or ZDF in Germany [11], as well as the majority of MOOCplatforms [12, 13]. These additional synchronized textual data arevery suitable for sentence-level highlight implementation. Fig. 2shows a screenshot of a slide-inclusive lecture recorded by tele-TASK system [14]. In this kind of modern recording, slide contentcan be thoroughly obtained and transition detection can be furtherapplied to logically segment the lecture video [15]. With the visualnavigation bar in bottom-right of Fig. 2, key segments can beeasily marked up by simply adding a sign or changing the color.Besides, the left-bottom textual segment list is also applicable.Some other online lecture archives (e.g. LectureVideo.NET) canoffer similar functions.

Based on all above motives, we propose two technically inde-pendent but practically related approaches to highlight the onlinelectures in both sentence- and segment-level automatically. Pleasenote that we intend to finish the highlighting before the lecture




(a) (b)

Fig. 1. Two examples of book highlighting. (a) Content highlighting, copyright of the image belongs to Maryellen Weimer onhttp://www.facultyfocus.com; (b) Table of contents highlighting, image published on http://backtoluther.blogspot.de, copyright belongs to originalauthor(s)

officially goes online, so both the real-time MOOC users andarchive lecture learners could benefit from it. However, enablinguser to make personalized highlighting is not the topic in thispaper, we concentrate on “we highlight for you”.

Sentence-level highlighting focuses on acoustic emphasis de-tection and the result will be presented in lecture transcripts, ormore specifically, subtitles, and evaluated by both subjective andobjective standards. The major technical contribution of sentence-level approach is that we innovatively manifest speaking rate bysyllable duration and pause rate in sentence-level and involveit together with frequently used features, pitch and energy, intoa general decision scheme. But more importantly, as far as weknow, we are the first to automatically detect acoustic-basedhighlights in lecture videos and implement practical applicationin educational domain. Segment-level highlighting, on the otherhand, mainly depends on exploring correlation between speechand slides, with a brand new form of statistical analysis orientedto the characteristics of online lecture videos. User feedbacks andforum threads are used for evaluation.

The rest of this paper will be organized as follow: Section2 discusses related work. Section 3 and Section 4 introducethe sentence-level lecture transcript highlighting and segment-level lecture video highlighting in detail respectively. Section 5compares key sentence and key segments and attempts to seek aconnection between them. This is followed by conclusion.

2 RELATED WORK

Detecting emphasis in speech is a long term research topic. Earlyattempts aimed to segment speech recordings or summarize spo-ken discourses based on the detected acoustic emphasis [16, 17].From then on, almost all approaches took pitch as the indispens-able feature in this task, since it is widely acknowledged that thepitch value will change as the speaker’s status changes [18, 19].However, Kochanski et al. argued that loudness and duration aremore crucial than pitch in classifying acoustic prominences, andfinished an experiment on syllable-level with positive result tosupport their argument [20]. After that, more approaches prefer totake all pitch, loudness and duration into general consideration.

Syllable-level prominence detection is fundamental in thistopic. As the most microscopic linguistic element, the stress ofa syllable in a word could be decisive in stress languages likeEnglish: “re-cor-d” and “re-cor-d” can be semantically differ-ent. Therefore, there are already many successful systems toautomatically classify them in different languages [21–23]. Thenthe research interest moves upwards from syllable-level to word-level. Acoustically there is no new feature introduced, althoughdiscussions have been made about whether to sample on syllablesor directly on whole words [24, 25]. Meanwhile, lexical featuresstart to be included in word-level [26, 27].

It is natural to make a similar extension from words to utter-ances or, as we say, sentences. Fairly promising results have beenreported in locating “hot spots” of meeting recordings, which is akind of conversational speech [28, 29]. However, lecture speech inour case is generally a kind of solo speech. As far as we know, theperformance of sentence-level emphasis on solo speech has notbeen reported before, and we would like to be the first to do so.

Although the research of sentence-level speech emphasisdetection is quite limited, there is a highly related and well-researched topic: speech emotion recognition [30, 31]. It sharessame research foundation with emphasis detection approachesby using same features (pitch, energy, duration, etc.) [32], andemotions are believed to be more suitable in describing pho-netic structures in longer time frames than emphases [33]. Morespecifically, observation finds that some acoustic phenomena ofspeech emphasis are highly similar to the widely used emotionstates “happy/joy” and “angry/anger”, such as higher pitch andenergy [34]. Sometimes emphasis is even taken as an independentemotion state “emphatic” [35]. These facts suggest that not onlythe technical insights, but also the experimental results fromemotion recognition approaches could be referable in our purposeof emphasis detection.

Another related research topic is social tendency analysis oflanguage. Speech emphasis is considered as a preliminary elementby some researchers to construct social signals [36, 37]. And as wealready introduced before, teachers use emphasis in their lecturespeeches in order to draw attention from the listeners. This behav-




Fig. 2. A screenshot of a slide-inclusive lecture recorded by tele-TASK system. Visual navigation bar with slide preview can be seen in the bottomarea of the “desktop stream” on the right, while the textual segment list is under the “lecturer stream” on the left.

ior is functionally a typical manifestation of high “extraversion” inclassic “Big Five” personality model [38]. Social tendencies canbe classified by text analysis [39], as well as emotions [40]. Theselexical approaches may also provide baseline result for us.

Once forwarding from sentence-level to segment-level, videobecomes the major carrier in emphasis analyzing research, andthe term “highlight” is more frequently mentioned to address keyvideo segment. Highlight detection in broadcasting sports videos isthe most well-researched subarea, but the features they used, suchas specific scenes, commentator’s keywords or replay sessions, areonly available in context of sports video [41, 42]. For other typesof video, highlight detection is generally taken as a step in videosummarization or abstraction. The video will be deconstructedinto shots, from which key-frames will be extracted and furtherevaluated according to their visual similarity, timing information,and features of synchronized audio signal [43–45].

Lecture video, on the other hand, is something different[46, 47]. It has very limited scene changes, which makes almostall the key-frames extracted visually similar. However, lecturevideo sometimes includes external multimedia data, such as slides,which enabled an early attempt by He et al. on slide-inclusivelectures [48]. They put audio features, slide transition informationand user statistics into the model, but focused more on contentintegrity than segment importance for the purpose of lecture videoabstraction. Taskiran et al. also contributed to the summarizationof lectures [49]. They used pauses in speech to segment thevideo and calculated importance scores for segments by detectingword co-occurrence based on transcripts. Inspired by these ideas,we plan to further explore the connection between slides andtranscripts in segment-level lecture video highlighting.

3 SENTENCE-LEVEL LECTURE HIGHLIGHTING

3.1 Sentence Units AcquisitionThe development and popularization of online lectures, especiallyin form of MOOCs, contribute a lot in breaking the geographicalbarrier of knowledge dissemination. Subtitles, no matter if trans-lated into other languages or only available in original language,

are considered the best breaker of the language barrier by far[50, 51]. To meet this need of the learners, many course providersoffer subtitles as supplementary material or facilitate the potentialintegration of subtitles, as already mentioned in Section 1. Inthis case, the subtitles are manually generated by professionalproducing teams or volunteering groups, fully punctuated, wellsynchronized and properly segmented into subtitle items in a user-friendly way. We can directly take these subtitle items as sentenceunits in our purpose.

However, if there are no existing subtitle files in a course,we can create them automatically. Starting with ASR (AutomatedSpeech Recognition), unpunctuated or under-segmented tran-scripts of the lecture videos can be achieved. Then SBD (SentenceBoundary Detection) can be employed, in which a model of deepneural network would be applied to classify whether a punctuationmark should be inserted after the k-th word of a continuous n-words sequence, with word vectors and pauses as major features[52, 53]. Then the transcripts with punctuation marks restoredcan be reasonably segmented into subtitle items to complete theautomatic subtitle generation procedure.

Unfortunately, errors can hardly be avoided in such automat-ically generated subtitles, such as improperly recognized words.But since the further process within this section focuses onacoustic features in audio, rather than lexical information in text,the potential negative influence could be minimized. In the end,a sentence unit contains a segment of audio, which is obtainedby the time tags of the corresponding subtitle item, along with itstextual content.

3.2 Voiced/Unvoiced Sound ClassificationLecture speech is typically solo speech. Specifically when prepar-ing the videos for MOOC, many lecturers prefer to talk to acamera in a studio, rather than in a classroom with real students[54, 55]. These phenomena make the audio signal of lecture videosin high quality with quite low level of noise. Therefore, we thinka denoising process is not necessary and all acoustic informationcan be considered as deriving from the speaker. By taking sentence




Fig. 3. The short-term energy and zero-crossing rate of the example sentence unit, along with its voiced/unvoiced deconstruction result.

units as input, our analyzing process starts with voiced/unvoicedsound classification.

Typically, speech consists of three categories of elements:voiced sound (V), unvoiced sound (U) and silence (S). Forexample, the pronunciation of English word “Breakfast” shouldhave the structure of “U-V-U-V-U-U” theoretically, correspondingto “b-rea-k-fa-s-t”. And in a sentence of actual speech, “breakfast”is very likely to be surrounded by two “S”. In many speechanalysis tasks, V/U/S classification is an important pre-processingstep [56], emphasis detection is no exception. In this work, wewill classify voiced and unvoiced sound with Short-Time Energyand Zero-Crossing Rate.

Short-Term Energy (address as energy or E afterwards) isa basic acoustic feature, commonly used to measure the instan-taneous loudness of the audio signal. Zero-Crossing Rate (ZCRor Z) is the rate of sign-changes along the signal, which can beseen as a simple measurement of frequency within a small timewindow. It is widely acknowledged that voiced sounds have highenergy and low ZCR, while unvoiced sounds are on the opposite:low energy but high ZCR [57–59]. The silence fragment is easy toclassify because both energy and ZCR approach 0.

Fig. 3 shows the energy and ZCR level of an example sentenceunit, with the content “this is the speed, with which the machineis working.” This example comes from the MOOC “In-MemoryData Management” in 20121, with the sample rate of its audiosignal as 48 kHz. In this work, both energy and ZCR are sampledwith the sampling window size of 0.02s and the step size of 0.01s,which means each sample covers 960 sampling points in total andthe average value is applied. Both of them are extracted by Yaafetoolkit2.

We use a heuristic-adaptive decision scheme for the V/U/Sclassification. First the average energy value of the whole sentence(E) is calculated. For i-th sample in the sentence unit, if Ei > Eand Ei > Zi, it will be taken as a voiced sample. Then adjacentvoiced samples will be connected to form voiced sounds, whilesingle independent voiced samples will be considered as accidentand abandoned.

Unvoiced sound classification is more complicated. The chal-lenge is both with voiced sound and noisy silence. After theobservation of the speech signals with several different lecturers,we set the following requirements:

1. https://open.hpi.de/courses/imdb20122. http://yaafe.sourceforge.net/

� This sample is NOT a voiced sample.� Ei < max{E, Zi}� Zi > Z × 1.5 or Ei + Zi < E

These requirements demand Ei to be comparatively small, butnot too small, and Zi to be large. Only when a sample meetsall three requirements, it would be taken as an unvoiced sample.Similarly, continuous unvoiced samples are gathered together asunvoiced sounds. All samples, which are neither involved by anyvoiced sounds nor unvoiced sounds, will be considered as silence,although some of them might be independent voiced or unvoicedsamples.

The voiced/unvoiced deconstruction result of the examplesentence unit can be also found in Fig. 3. Theoretically, thereshould be 12 voiced sounds and 7 unvoiced sounds based on thetextual content. Our deconstruction scheme successfully classifies11 voiced sounds and 6 unvoiced sounds, missing the unvoiced“-d” in “speed” and mistaking voiced “ma-” in “machine” asunvoiced sound. Generally, this scheme can keep the accuracyaround 85∼90% when the lecturer speaks calmly and fluently. Inpurpose of emphasis analysis, this accuracy is basically accept-able.

3.3 Acoustic Emphasis Analysis

We believe when a speaker emphasizes something, he/she willspeak louder and/or raise the tone of the voice. Some acousticfeatures, such as pitch and loudness, will be affected directly.Moreover, the speaker definitely wants the audience to clearlycatch every single word that is emphasized and might give theaudience some extra time for response, which may result inlonger pauses between words. In order to catch these possibleclues, we measure the following features to analyze the emphasesacoustically:

• Loudness. Same as in V/U/S classification, we measurethe loudness of a sentence unit by the short-term energy,but the method is different. Only the samples in all voicedsounds are included to calculate an average value E. Eachsample in calculation is treated equally, regardless of itsposition in the voiced sound it belongs to or the positionof the voiced sound in the sentence unit. E represents thelevel of loudness of certain sentence unit and the averageenergy of each voiced sound will not be calculated.




• Pitch. Similar to loudness, we calculate the average pitchof the sentence unit (addressed as P ) only with all voicedsounds. It is widely believed that males often speak at 65to 260 Hz, while females speak in 100 to 525 Hz range.But in our experiments, the pitch value within the unvoicedsound period can easily reach 1000 Hz, which has to beexcluded from the calculation. Pitch level in our work isextracted by Aubio toolkits3.

• Syllable Duration. It is reasonable to measure speakingrate by words when the speech sample is long enough.However, in our task we have already segmented thespeech into sentence units with only several words. Inthis case, the length of a word matters more. For ex-ample, the German words “Ja” and “Immatrikulations-bescheinigung”, which mean “yes” and “confirmation ofenrollment” respectively, should not be counted as oneword equally. A better measurement is with syllables,where “Ja” has only 1 syllable and “Immatrikulations-bescheinigung” has 11. Ideally, syllables in the transcriptshould match the voiced sounds in speech one by one [37].But in practice, because of the hesitation “eh. . . ” by thespeaker or the mistake mentioned in Fig. 3 with “ma-chine”, the numbers may differ. Therefore, we calculateboth the average syllable duration and average voicedsound duration and apply the smaller one. It is addressedas D.

• Pause Rate. The pause rate (Rp) is the percentage ofsilences in the whole sentence unit. It is supposed to belarger when the speaker emphasizes speech elements withextra pauses, as described previously. Practically speaking,we sum up the total time of previously classified voicedand unvoiced sounds, and then deduct it from the sentenceunit duration. It can be seen as an additional feature ofspeaking rate.

With all above features, an acoustic importance value Aj ofthe j-th sentence unit in a lecture video with n sentence units intotal can be defined as

Aj = E × (1 +j − 1

n− 1× 0.1) + P × λ+D×µ+Rp× η (1)

where λ, µ and η are the weights to balance the influences ofdifferent features. They are necessary because the absolute valueranges of the features differ a lot – based on our observation, forinstance, E ∈ [0, 0.5) while P ∈ (100, 500). Practically, wewill calculate the average values of the features for each lectureand using these average values as benchmark to tune λ, µ and ηper lecture, in order to make the influences of all four featuresbasically the same. The amendation of E is designed becauseas the lecture proceeds, the speaker will gradually get tired andthe loudness level will also decrease unconsciously. The timeline-based amendation could compensate this phenomenon of generalenergy decay and give a fairer chance to those sentence units inthe later phase of the lecture to be detected as emphasis.

In this section, we aim to highlight a certain proportionof sentences with acoustic emphases from a lecture, with thepurpose of facilitating learning. Thus there is no need to set ahard threshold to decide whether a sentence unit is acousticallyemphasized or not. Instead, all sentence units of a lecture willbe sorted by their importance values calculated in descending

3. https://aubio.org/manpages/latest/aubiopitch.1.html

order, where the top ones will be marked as emphasized. Sincethe sentence unit might not be a complete sentence, if only onepart of the sentence is considered as highlight, the highlight shouldbe extended to the complete sentence.

3.4 Experimental ImplementationAfter confirming which sentence units should be highlighted, thenext step is to figure out how to highlight them in a user-friendlyappearance when implemented in the MOOC context. Actually wecould literally do the “transcript highlighting,” by re-formattingthe subtitle file into a pure textual transcript file, highlightingthose selected sentence units with bold font, background coloror underline, just as what we do with traditional paper books,and making it downloadable. However, watching the video whilechecking external reading material simultaneously would be anunpleasant experience for the learners, therefore we have no reasonto be optimistic about how it would work.

Alternatively we need to implement highlighted sentences in amore easily accessible way, which is better simultaneous with thevideo displaying. Beeping or flashing could be an option, since itis the common way to arouse attention in various scenarios, but weare afraid that it could be too aggressive in an educational setting.We consider the best way is to implement some visual sign for thehighlighted sentences in the subtitle file. Such signs would makethe user instantly aware of these sentences.

Since we have no previous experiences about how to designsuch signs, we could only experimentally make an attempt. Eachhighlighted sentence will be surrounded by a pair of star pentagonswith solid fill, as shown in Fig. 4. Additionally, a pair of emptystar pentagons will be used to mark the previous subtitle item ofa highlighted sentence unit as a reminder. We will collect userfeedback for this type of implementation in a later chapter.

The importance analysis method we apply in sentence-level ismostly based on acoustic features, which is theoretically languageindependent. There is also no problem to do the importanceanalysis with the original teaching language and offer its result ina translated target language, such as Fig. 4, which is a screenshotof the MOOC “Internetworking.” This MOOC is instructed inEnglish but offered to Chinese-speaking users with subtitles insimplified Chinese4.

3.5 EvaluationIn order to evaluate the performance of the proposed scheme, weoffered the highlighting result in lecture 4.5 and 4.6 of “Internet-working.” It is a comparatively limited scale evaluation because wedid not know the acceptance of this new “we highlight for you”feature among learners. The evaluation consists of three aspects.First we demonstrate a few highlighted examples and explainthe rationality behind them, then analyze the precision based onground-truth created by multiple experts, and finally present userfeedbacks. In this experiment, the coefficients in equation (1) areset as follow: λ = 0.001, µ = 0.02, η = 1, according to theprinciple introduced in section 3.3 and the selection proportion isaround 1/6.

3.5.1 Example DemonstrationThe first example is a fraction of lecture 4.5 which talks about“Neighbor Discovery Protocol (NDP)”. The total length is 7:28

4. https://openhpi.cn/courses/internetworking2016




Fig. 4. The “highlighted” subtitles in MOOC environment, with the sign of star pentagon.

and it is segmented into 58 sentence units in the subtitle file. 10units among them are acoustically highlighted, plus 2 added by the“complete sentence” policy. Here we extract the textual content astranscript and mark the highlighted part with bold font (pleasenote that it is different from what learners actually get):

“. . . And the first, I want to mention is the neighbordiscovery protocol. The task of the neighbor discoveryprotocol, NDP, is to facilitate the interaction betweenadjacent nodes. What are the adjacent nodes? Theyare neighbor nodes. In IPv6 nodes are consideredadjacent, if they are located on the same ‘link’. Andthe IPv6 link is the network area that is bounded by arouter . . . ”

Grammatically, the highlighted content in this example isthe answer of a “hypophora,” or addressed as “anthypophora.”In plain English, hypophora is a self-answering question, whichis generally believed to be used to draw attention or arousecuriosity from the audience [60, 61], in order to heighten theeffect of what is being spoken, in other word, to create emphasis.This phenomemon widely exists in educational context [62, 63].Semantically, the content of highlighted sentences is the expla-nation of an important technical term – neighbor node – in alecture talking about NDP. All these facts strongly suggest thathighlighting this sentence is highly logical.

As we mentioned before, “Internetworking” is a MOOCrecorded in English but offered in Chinese. The subtitle preparedis completely in simplified Chinese and the above example isactually presented as:

. . .首先我要提的是邻机发现协议。邻机发现协议NDP的任务是协助邻近节点之间的互动，什么是邻近节点？这这这些些些是是是邻邻邻近近近节节节点点点。。。在在在IPv6中中中如如如果果果节节节点点点位位位于于于相相相同同同的的的“链链链路路路”，，，则则则它它它们们们被被被认认认为为为是是是邻邻邻近近近节节节点点点。。。IPv6链路是指一台路由器覆盖的网络区域. . .

Here we quote the Chinese text because, due to the con-sideration of word order and fluency, the second and the thirdhighlighted sentence units reverse their positions when translatedfrom English to Chinese (For non-Chinese speakers, please focuson the different positions of the quotation marks in the example).However, it is not influenced because of the “complete sentence”policy.

The second example derives from lecture 4.6, which talksabout the “Dynamic Host Configuration Protocol” under frame-work of IPv6 (DHCPv6). This lecture lasts for 7:45, with 66sentence units in total, 12 of them acoustically highlighted and 1added for “complete sentence”. This example actually correspondsto Fig. 4:

“. . . In IPv4, this was only possible with the DHCPprotocol, the Dynamic Host Configuration Protocol. TheDHCP protocol was responsible to dynamically allo-cate IP address to the host, to allocate the host names,to provide information about default gateway, andinformation about responsible DNS server (Domainname service). See DHCP protocol works in a statefulmode. That means the respective DHCP server knowswhich host uses which configuration and keeps track ofall the interactions . . . ”

By comparing with the slide recorded in the right section ofFig. 4, we can find that the highlighted part in this example isthe same as what is written in the slide. People use slides as theoutline of the talk, in other words, the slide is the collection ofimportant terms the speaker wants to mention. Detecting them asthe key content seems to be a good option.

Although with these successful examples, we clearly know thatour result is far from perfect. Therefore we prepare a precisionanalysis in a more general way.




TABLE 1Precision Analysis on Sentence-Level Highlighting

Method All Sentences Highlighted SentencesNum Ave Num Ave Hit Precision

ToneAnalyzer 124 0.86 22 0.86 8 36.4%Vokaturi 124 0.86 22 1.11 12 54.5%Proposed 124 0.86 22 1.14 14 63.6%

3.5.2 Precision Analysis

In this subsection we would like to evaluate the general accuracyof the automatically highlighted sentences in a fairly objectiveway. Unlike typical classification question, it is impossible toget absolute objective ground-truth in this case, because peoplealways have different measurements on whether a sentence in thespeech is more important. There is only recommending or not,no right and wrong. Alternatively, we invited multiple experts oncorresponding topics of the test lectures to give their opinions onpotential lecture highlights, and then the comprehensive opinionwould be further taken as the ground-truth.

The testing videos are still lecture 4.5 and 4.6 of “Internet-working.” We asked 10 different experts in total, 5 for each lecture,who graduated from IT-related majors in different universitiesand still work in this profession, to rate the importance of eachsentence unit with three levels: 2 (recommend as highlight), 1(neutral) and 0 (not important). Then an importance score for eachsentence unit is calculated by averaging these 5 rates. If the valueof a sentence unit is greater than 1 (not including equal), it will betaken as a ground-truth key sentence. Since we didn’t set a restrictto the experts that how many sentences could be recommendedas highlights, the total number of ground-truth key sentences issignificantly larger than the number of automatically highlightedsentences (38 vs 22), so we only measure the precision in thisexperiment, not the recall rate.

Unfortunately, research effort in sentence-level emphasis de-tection in educational domain is still limited. Therefore we failedto find any publicly available toolkit to test our data or any publicdataset to run our approach on. However, as already mentionedin Section 2, emphasis is one of the fundamental elements forclassifying “happy/joy” and “angry/anger” in emotion analysis,and functionally similar to social tendency “extraversion”. Thuswe would run state-of-the-art audio-based emotion detection ap-proach Vokaturi5 and highly reputable linguistic-based IBM Wat-son ToneAnalyzer6 for comparison. For each baseline approach,the measurement of emphasis or highlight is defined as the sum ofthe respective confident values of being classified as “happy/joy”,“angry/anger” and, when applicable, “extraversion”. Similar toour method, all sentence units are sorted by this measurementin descending order and top 1/6 will be selected as highlights.

Please note that “complete sentence” policy is not appliedhere for the proposed approach. As can be seen in Table 1,“Ave” represents the average importance score geven by expertsof all corresponding sentences, while “Hit” means the number ofsentence units which are both highlighted by our acoustic analysisand rated as highlights by the experts. Result shows that 14 of22 highlighted sentence units are correct by our method, with theprecision as 63.6%, which is higher than both baselines. But wehave to claim again that both Vokaturi and ToneAnalyzer are not

5. https://developers.vokaturi.com/downloads/sdk6. https://tone-analyzer-demo.mybluemix.net/

TABLE 2Statistics about the Survey

Q1: Do you think the feature “we highlight for you”is meaningful in context of MOOC? Count Ratio

(1) Yes 56 76.7%(2) No 6 8.2%(3) I’m not sure 11 15.1%Q2: Have you noticed the highlighted sentences inprevious lectures? Count Ratio

(1) Yes 56 77.8%(2) No 16 22.2%Q3: Do you think our current implementation, withstar polygon pairs, is appropriate? Count Ratio

(1) Yes, it’s completely appropriate. 32 47.1%(2) It’s OK, but the reminder is unecessary. 14 20.6%(3) It’s OK, but the sign should be more obvious. 13 19.1%(4) It’s OK, but the sign is too garish. 5 7.4%(5) No, it’s terrible. 4 5.9%Q4: Please rate the current accuracy of the highlightsoffered. (5-star is the highest and 1 is the lowest) Count Ratio

(1) ? ? ? ? ? 20 28.2%(2) ? ? ? ? 22 31.0%(3) ? ? ? 21 29.6%(4) ? ? 1 1.4%(5) ? 7 9.9%Q5: With current level of accuracy, do you want us toformally apply “we highlight for you” in followinglectures and courses?

Count Ratio

(1) Yes 52 76.5%(2) No 7 10.3%(3) I’m not sure 9 13.2%

specifically designed for emphasis detection, so the undergoingtask may not maximize their potentials.

3.5.3 User FeedbackBesides the subjective and objective evaluation from developer’sside, opinions directly from user’s side are also crucial. In MOOCsonly when a new feature is welcomed and used by learners, mayit actually be beneficial. We set up a survey about the generalacceptance of proposed sentence-level “we highlight for you”approach in form of highlighted subtitles. Since all survey itemsare optional in principle and independent from each other, the totalnumber of replies per item could be different.

When talking about the prospect of newly developed tech-niques, Technology Acceptance Model (TAM) is frequently refer-enced [64], in which “perceived usefulness” and “perceived ease-of-use” are considered as the basic reactions for the users whoencounter new technical stuff, and then form the “attitude towardsusing” and finally affect actual use. As illustrated in Table 2,76.7% of survey respondents acknowledge the positive meaningof proposed “we highlight for you” feature (Q1), while 77.8% didnotice the existence of highlighted sentences (Q2). These numbersprove the potential usefulness and the convenience of accessingour work.

Regarding technical detailed, users expressed different andsomehow contradictory opinions about the way how we imple-mented lecture highlights (Q3), and many of them are basicallysatisfactory with current accuracy (Q4): an average rate of 3.66 isachieved and can be transformed into 66.5% in percentage, whichis similar to the objective precision obtained (63.6%). Finally,76.5% of users explicitly indicated their “Yes” attitude towardsour new feature by encouraging us to formally adopt subtitles withhighlighted items in follow-up lectures and courses, while another13.2% are not against this idea either (Q5).




Generally speaking, the users accept the feature “we highlightfor you” and are basically content with the technical aspectof proposed sentence-level highlighting approach. However, weclearly know that the scale of our survey is limited, not onlybecause of the comparatively small number of the participants,but also because of their homogenization – all of them are nativeChinese speakers. Moreover, the structure of above survey and themethodology of feedback analysis are also relatively simple. Wewould definitely attempt to improve when having the chance.

4 SEGMENT-LEVEL LECTURE HIGHLIGHTING

4.1 Segment Units PreparationIn segment-level lecture highlighting, the first task is to definethe segment. Lecture video segmentation has been researchedfor years [65–67]. It is different from traditional natural videosegmentation, because generally there are only very few scenechanges in the lecture videos, which is the most important featurein natural video segmentation [68]. However, many lecturersuse slides as additional teaching materials [69, 70]. The slidetransitions can be detected and applied as the boundaries of lecturesegments [71, 72]. In this section, we work only with slide-inclusive videos and address the lecture segments as SUs (SlideUnits).

Each SU has beginning and ending time tags, and a textualoutline can be created based on the corresponding screenshottedslide image by OCR (Optical Character Recognition) and TOG(Tree-Structure Outline Generation) [73, 74]. If the digital slidefile (in .pdf or .pptx format) is available, the slide content can alsobe parsed from the file with better accuracy, which further im-proves the quality of the textual outline for each SU. Meanwhile,the subtitle files of the lecture videos can also be split by the timetags and each SU will possess a paragraph of the lecture transcript.Now there are following direct parameters available for each SU:

� Type: T-SU (pure textual slide), NT-SU (except for thetitle, there is no text in the slide but only illustrations, suchas chart, image, etc.) and HT-SU (mixed).

� Duration (d): counted in second.� O-Words (WO): total number of words in the slide outline.� O-Items (I): total number of textual items in the slide

outline, including title, topics and subtopics.� S-Words (WS): total number of words in speech para-

graph.� Co-Occur (C): total number of words shared by both slide

outline and speech paragraph.

Based on these direct parameters, we define several indirectparameters to better represent the characteristic of the SUs, whichinclude:

� Speaking Rate: RS = WS/(d/60)

� Matching Rate: RM = C/WO

� Explanation Rate: RE = WS/WO

� Average O-Item Length: LI = WO/I

� Average O-Item Duration: dI = d/I

With slide-inclusive videos as input, the complete segmentpreparation process is fully automated. For those extra-long lec-tures, which contain too many SUs, they will be first cut intoseveral clips by exploring the inter-slides logic [75]. Each clipwill be further considered as an independent lecture.

Fig. 5. Ascending trend of RE while SU duration increases.

4.2 Importance Analysis of T-SU

For educational purposes, Slides generally serve as the outlineof the textbook. In these slides the lecturer lists the titles of thesubtopics one by one, in addition to some short explanations. Inour approach, the textual slide based SU would be considered asT-SU. However, the lecturer should offer some extra informationin the lecture speech. Otherwise learners could simply read theslides by themselves. The importance evaluation of T-SU wouldmainly focus on the connections between the information involvedin speech and the slide. This would involve the following factors:

4.2.1 Expected Explanation Rate

The idea here is based on a simple assumption. The lecturer willexplain in more details when talking about something important.Naturally, a T-SU in such conditions will have a comparativelyhigher explanation rate. Meanwhile, we notice that the absolutevalue ofRE might not suitable to be directly taken as the measure-ment because when we collect data from a complete course (“WebTechnologies”), there is an apparent ascending trend of RE withthe increase of SU duration. Fig. 5 illustrates this trend clearly.Therefore we introduce the concept of expected explanation rate,which is estimated by the SU duration d, based on the lineal trendline fitted in Fig. 5 and addressed as RE(d).

Similarly, another expected explanation rate could be esti-mated based on the course-scale observation of the SU parameter“average item length (LI )”. Smaller LI refers to more key-wordsor key-phrases in the slide, while larger LI indicates that theremight be more complete sentences. Then it is quite understandablethat a lecturer needs to add more extra information in the speechwhen the LI is decreasing. Fig. 6 captured this trend with thedescending trend line, by which we could calculate the secondexpected explanation rate RE(LI).

Now we could take the difference between the expectedexplanation rates and the actual one as the measurement. The firstevaluation factor of T-SU, fE , would be calculated by

fE = RE −RE(d) + RE(LI)

2(2)




Fig. 6. Descending trend of RE while LI increases.

fE might be either positive or negative, and we expect the valueof fE could be large and positive if the content of the relevantT-SU is important.

4.2.2 Hypothesis on Speaking Rate and Matching RateAs already mentioned in section 3.2, lecture speech, especiallyfor MOOCs, is generally solo speech and recorded in a studiowith very limited interference. In such scenarios, the lecturer isuninterrupted and it is easy to keep the speaking rate stable.However, the well-experienced teachers know that a lecture shouldnot be given like a lullaby. When, where and how to makeemphasis is very important to improve teaching quality. Whenemphasizing, intentional slowing down is a frequently used andeffective trick [76, 77].

But we cannot simply take the low speaking rate as evidenceof emphasizing. Pauses from hesitation may also result in lowspeaking rate [78]. Unfortunately, it is very difficult to avoid, evenfor experienced teachers. In this case, we cannot distinguish aslow-down event to be either intentional or accidental by simplychecking the speaking rate.

However, we make a hypothesis that when the lecturer slowsdown intentionally, the content of the speech should be closelyrelated to the slide because, just like an outline of a textbook, theslide should only include the key-points, which are the potentialemphasizing targets. In this case, the matching rate should becomparatively high. Based on this hypothesis, we introduce thesecond evaluation factor fH for T-SU:

fH =(RM − RM )× 100− (RS − RS)

2(3)

where RM and RS refer to the average values of RM and RS ofthe whole course. We also expect fH of important SUs to be largeand positive.

4.2.3 Overview BonusMany MOOCs are designed for the purpose of popularizingscience. A certain lecture in one of such courses is very likelybe an initial introduction about a specific topic, not academicallyadvanced, but covering as many subtopics as possible. For exam-ple, if a lecture is about a first glance of programming languages,

it may introduce C, C++, Java, Python, etc., separately and briefly.For these lectures, there is probably an overview slide placedat the beginning of the corresponding video, which works as anabstract and is actually the most important part in this video. Inour approach if a video clip is not long (less than 10 minutes),contains only few slide pages (less than 10 pages) and the firstslide is an independent slide, which is defined by discontinuoustitles with the second slide, then we acknowledge the first slide ofthis video as an overview page and give the corresponding SU abonus (BO).

By now we can summarize all 3 factors to a final importancevalue of T-SU: VT = fE + λ × fH + BO + µ, where λ is aweight to adjust the influence of fH and µ is a course-based fixedoffset to make VT always positive. Certainly we suppose VT ofSUs with higher importance to be larger.

4.3 Importance Analysis of NT-SUIn a NT-SU, the slide structure is generally simple: a title anda full-page illustration. It might be a chart, a diagram, an imageor in IT-related courses, a code block. Since O-Words (WO) ismeaningless in such slides, several features we used in T-SU areno longer available, such as the explanation rate and the matchingrate. Thus we adopt a simple measure here: the total amount ofinformation contained by a NT-SU, which depends on the S-Words(WS). We suppose that if a full-page illustration is introducedfor a key procedure in a technique or a significant exhibitionof important system, the lecturer would explain in detail in thespeech, with a large WS logically. And for those illustrationswhich the lecturer just briefly mentions in a few words, weconsider them as less important. We simply define the importancevalue of NT-SU: VNT = WS .

4.4 Importance Analysis of HT-SUThe situation of HT-SU is in between of T-SU and NT-SU. Withillustrations occupying half the page, there is still a considerableportion of text, which makes all SU parameters available. But itis very difficult to quantify the proportion of information carriedby text and illustrations, explanation rate becomes much lessconvincing. Alternatively, we implement the average item duration(dI ) as the measurement of how detailed the lecturer teachingwithin the certain HT-SU. One illustration would be counted asone additional item in the slide.

On the other hand, similar to NT-SU, we also suggest thatthe importance of a HT-SU is positively related to the amount ofinformation the lecturer gives, including both WS and WO. Theimportance value of a HT-SU will be set as

VHT =WS +WO − C

2+ dI (4)

where C is the co-occurrence, which we intend to remove asredundancy. VHT shall be large when the HT-SU is a key segment.

4.5 Ground-Truth AcquisitionIn order to evaluate whether the highlighted segments by ourapproach are correct, we added survey questions in the self-testsof the MOOC “Web Technologies,” which is a 6-week courseinstructed in English on the openHPI platform7. 10022 learnersenrolled in this course during the opening time, 1328 participants

7. https://open.hpi.de/courses/webtech2015




(a) T-SU (b) NT-SU (c) HT-SU

Fig. 7. General trend illustration for all 3 types of SUs.

took the final exam and 1179 of them successfully earned thecertificate.

We set survey questions in 43 video clips with a total lengthof 632 minutes. 348 SUs are obtained automatically from thesevideos. In the survey we asked our learners to select one segmentas the most important one in the correlated lecture video. Over5000 replies are received for the first video, and as users droppingout, there were still over 1000 users who took part in the surveyof last video.

For the i-th SU in a video, which has n SUs in total, if uiusers choose it as the most important segment, its basic importancefactor IFi will be set as

IFi =ui∑nj=1 uj

× n (5)

By this calculation, the importance factor can better represent theextent how important the corresponding SU is in users’ point ofview. It bases on the proportion of users who select certain SU asmost important, not the absolute number, which could avoid thenegative influence of varying numbers of survey participants. It isalso related to the total number of SUs in the videos. Obviouslyearning 33% of votes in a 10-SUs video is already high enoughfor a SU, but earning 33% in a 3-SUs video is just on the average.This feature is also well represented in (5). Mathematically, theaverage value of IFi in either a lecture or the whole course is 1.

Moreover, we also followed the course-attached discussionforum. We believe that the important part of the lecture wouldintrigue learners to ask more questions. So we counted the totalnumber of questions related to certain SU by content. Each relatedquestion earns a small bonus for the importance factor of certainSU, and the bonus is also balanced since there are obviously morequestions in the early stage of the course than the later stage. Ifthe i-th SU has qi related questions, the final importance factorIF ′i is set to:

IF ′i = IFi +qi√∑nj=1 uj

× η (6)

where η is a coefficient to keep the bonus value proper and canonly be set manually based on how many forum threads arecreated. For “Web Technologies”, η is set to 10, which makeseach question worth 0.1∼0.2. Since this coefficient is only validin evaluation, it will not affect the automatic process of detectinghighlights. In the end, IF ′i will be taken as the ground-truth fori-th SU in following evaluation.

Fig. 8. A ratio of 1/k means when sorting all SUs with the calculatedimportance descending, top 1/k SUs will be selected as “key segments”.Precision generally increases as the selection rate gets smaller.

TABLE 3Precision Analysis on Top 1/6 Selection

Type All Segments Selected Key Segments (Top 1/6)Num A-IF ′ Num A-IF ′ Correct Accuracy

T-SU 268 1.15 44 1.58 31 70.5%NT-SU 42 0.82 7 1.40 5 71.4%HT-SU 38 0.83 6 1.13 4 66.7%

All 348 - 57 - 40 70.2%

4.6 EvaluationBased on the data collected from “Web Technologies,” we setλ = 0.1 to keep the influences of fE and fH on same levelfor T-SUs, and µ = 3.5 to shift all VT beyond zero. SinceVT , VNT and VHT have different definitions, we first show theirevaluation result separately. As shown in Fig. 7, with calculatedimportance value (VT , VNT or VHT ) in x-axis and the ground-truth importance IF ′ in y-axis, the ascending trend, or the positiverelation between the two, is sound.

More specifically for T-SUs, which are the majority in all SUs,the hypothesis mentioned in section 4.2.2 (fH ) does not meet ourinitial expectations independently. But it is effective to eliminatethe T-SUs with low matching rate and high speaking rate. Theassumption about the explanation rate in section 4.2.1 acts as thefoundation of the final result and the overview bonus in section4.2.3 is proven to be a positive boost.

After the general evaluation, we also pay attention in potentialapplication. If we just sort all SUs with descending order of theircalculated importance value, select a certain portion from thetop as the highlighted segments and offer them to the learners,




precision is the vital measurement. We define correctness asfollows: if a highlighted segment has a ground-truth IF ′ greaterthan 1, then it is correct. Again, three types of SUs will be treatedseparately and we take T-SU as an example. Fig. 8 shows theprecisions when the selection ratio changes.

We further take 1/6 as the selection ratio, just as in sentence-level highlighting, to see more details. As listed in Table 3, theprecisions for selected T-SUs, NT-SUs and HT-SUs are 70.5%,71.4% and 66.7% respectively. Ave-IF ′ represents the averageIF ′ of the SUs included. It is obvious that the average ofhighlighted segments is larger than the average of all – no matterfor which class. The general precision is 70.2%, which is quitepromising to us.

5 POTENTIAL CONNECTION BETWEEN KEY SEN-TENCES AND SEGMENTS

Since we have achieved fairly positive result in both sentence-leveland segment-level lecture highlighting, it is quite natural to ask: isthere any connection between the highlighted key sentences andkey segments? In order to figure it out, we run the sentence-levelanalysis introduced in Section 3 on all the testing data used inSection 4. The sentence units are distributed to the slide unitsbased on the time tags, averagely 18 to 1, and the followingfeatures are collected per SU:

• The normalized mean acoustic importance value. Firstwe get the acoustic importance value for each sentenceunit and then calculate a simple mean value (Ai forthe i-th SU in the lecture), with each sentence unitwith same weight. Since the importance values for SUsfrom different lectures may differ a lot, due to possiblerecording condition or lecturer status difference, we couldnot simply apply the absolute value of Ai as measure-ment. Alternatively, we further calculate a lecture-averageacoustic importance value (A), and define the normalizedAi = Ai/A.

• The standard deviation of acoustic importance value.The calculation is carried out exactly according to themathematical definition, addressed as Di

• Highlighting Rate. A simple ratio of the highlightedsentence units in all sentence units, addressed as Ri.

If there is a positive relation between key segments and keysentences, a highlighted segment should be supported by morehighlighted sentence units, which means larger values of Ai andRi. Meanwhile, emphasis needs comparison between peak andtrough and would result in a larger value ofDi. Again we use somestatic coefficients to balance the influences of these features basedon their course-range average values, according to the principleintroduced in section 3.3, and then sum them up by VA = Ai +λ ×Di + Ri + µ, and then project the sum in Fig. 9 against theground-truth importance factor IF ′, with λ = 10 and µ = −1.5.However, the data points in general are randomly distributed andno ascending trend line can be found. Based on the data in “WebTechnologies,” we must claim that there is no evidence to supportthe theory that the key segments are constructed by key sentences.

Although this result is not ideal, it is actually logical. Acous-tically selected key sentences in fact derive from acoustic promi-nences. No matter whether it concerns the loudness, tone or speak-ing rate, the prominence is a short-time phenomenon and focuses

Fig. 9. The distribution of data points and simulated linear trend line ofacoustic importance value and ground-truth importance factor.

only on local context. Generally, it comes from the lecturer’ssubconscious reaction when attempting to draw attention.

A segment, however, on the average consists of 18 sentenceunits and lasts for 109 seconds in “Web Technologies.” Duringsuch a long period of time, if a speaker consistently offers acousticprominences, it would sound over excited and quickly lead tofatigue, which is what an experienced teacher would like to avoidin a lecture. Therefore, it is understandable that the key segmentsare not acoustically significant and thus not positively related withkey sentences. Key segments, on the other hand, mainly dependon high explanation rate, large information amount and overviewbonus, as explained in section 4.6. All of them are structuralelements and originate from thoughtful decisions when the teacherprepares the lecture beforehand.

As introduced in Section 1, although both sentence-level andsegment-level lecture highlighting have the purpose of improvingonline learning experiences for learners, the detailed goals aredifferent. Sentence highlighting in subtitles, or in some other kindof real-time transcripts, works as a reminder to keep learnersfocused. Highlighted segments are more like a selector, givinglearners a better navigation. In this point of view, the acoustic-based key sentences and the structural-related key segments couldaccomplish their tasks seperately.

6 CONCLUSION

In this paper we proposed to highlight online lectures in bothsentence- and segment-level. In sentence-level approach, we de-compose the audio signal into voiced/unvoiced sounds and applya new scheme to integrate different acoustic features into animportance value. Sentences with larger importance values arehighlighted and implemented in subtitle files. Based on experts-generated ground-truth, the general precision reached 63.6%,better than baselines. Some example demonstrations and userfeedbacks also support the outcomes.

Segment-level highlighting is based on a novel form of sta-tistical analysis. Slide-transition is taken to create SUs and theyare further distinguished into three types according to the slide




content. Taking structural features like explanation rate, matchingrate and speaking rate as measurements, the connection betweenlecture speech and slides is explored and the segment-level im-portance value is further calculated. Evaluation based on massiveuser created ground-truth shows quite promising result, with theprecision reaching 70.2%.

Beside the achievements, there are also some limitations to bediscussed. Our attempt to seek correlation between highlightedsentences and segments remains unsuccessful. The evaluationscale of sentence-level approach is quite limited, due to thedifficulty of ground-truth creation. Perhaps we could also provideinteractive highlighting tool for the voluntary users to collectground-truth from users’ perspective. On segment-level, experi-ments on different courses with different lecturers are needed totest the robustness of proposed approach.

In the future, we intend to further improve the quality ofthe highlights. Since currently the majority of parameters in theapproach are set heuristically, it is possible to further optimizethem as the research proceeds. An attempt with machine learning,especially with deep learning, could also be made when there aremore labelled data available. If possible, we also want to quan-titatively evaluate how these highlighted contents could actuallyhelp the online learners by setting experiment and control groupsin certain context.

ACKNOWLEDGMENT

The authors would like to thank all the participants in the surveys,especially the experts who helped with generating ground-truth insentence-level highlighting.

REFERENCES

[1] K. E. Bell and J. E. Limber, “Reading skill, textbookmarking, and course performance,” Literacy Research andInstruction, vol. 49, no. 1, pp. 56–67, 2009.

[2] R. L. Fowler and A. S. Barker, “Effectiveness of highlightingfor retention of text material.” Journal of Applied Psychol-ogy, vol. 59, no. 3, p. 358, 1974.

[3] R. V. Hogg and J. Ledolter, Applied statistics for engineersand physical scientists. Macmillan New York, 1992, vol. 59.

[4] J. R. Huffman, R. D. Cruickshank, S. N. Jambhekar,J. Van Myers, and R. L. Collins, “Electronic book havinghighlighting feature,” Sep. 2 1997, uS Patent 5,663,748.

[5] E. H. Chi, L. Hong, M. Gumbrecht, and S. K. Card, “Scen-thighlights: highlighting conceptually-related sentences dur-ing reading,” in Proceedings of the 10th international con-ference on Intelligent user interfaces. ACM, 2005, pp. 272–274.

[6] P. Scott, “Teacher talk and meaning making in scienceclassrooms: A vygotskian analysis and review,” Studies inScience Education, vol. 32, no. 1, pp. 45–80, 1998.

[7] L. Pickering, “The role of tone choice in improving itacommunication in the classroom,” TESOL Quarterly, pp.233–255, 2001.

[8] G. Gibbs and M. Coffey, “The impact of training of uni-versity teachers on their teaching skills, their approach toteaching and the approach to learning of their students,”Active learning in higher education, vol. 5, no. 1, pp. 87–100, 2004.

[9] P. J. Guo, J. Kim, and R. Rubin, “How video productionaffects student engagement: An empirical study of mooc

videos,” in Proceedings of the first ACM conference onLearning@ scale conference. ACM, 2014, pp. 41–50.

[10] J. Kim, P. J. Guo, D. T. Seaton, P. Mitros, K. Z. Gajos,and R. C. Miller, “Understanding in-video dropouts andinteraction peaks in online lecture videos,” in Proceedingsof the first ACM conference on Learning@scale conference.ACM, 2014, pp. 31–40.

[11] A. Kurch, N. Malzer, and K. Munch, “Qualitatsstudie zulive-untertitelungen–am beispiel des “tv-duells”,” 2015.

[12] A. Ng and J. Widom, “Origins of the modern mooc(xmooc),” Hrsg. Fiona M. Hollands, Devayani Tirthali:MOOCs: Expectations and Reality: Full Report, pp. 34–47,2014.

[13] N. Mamgain, A. Sharma, and P. Goyal, “Learner’s perspec-tive on video-viewing features offered by mooc providers:Coursera and edx,” in MOOC, Innovation and Technology inEducation (MITE), 2014 IEEE International Conference on.IEEE, 2014, pp. 331–336.

[14] F. Grunewald, H. Yang, E. Mazandarani, M. Bauer, andC. Meinel, “Next generation tele-teaching: Latest record-ing technology, user engagement and automatic metadataretrieval,” in Human Factors in Computing and Informatics.Springer, 2013, pp. 391–408.

[15] H. Yang, C. Oehlke, and C. Meinel, “An automated analysisand indexing framework for lecture video portal,” in Interna-tional Conference on Web-Based Learning. Springer, 2012,pp. 285–294.

[16] B. Arons, “Pitch-based emphasis detection for segmentingspeech recordings.” in ICSLP, 1994.

[17] F. R. Chen and M. Withgott, “The use of emphasis toautomatically summarize a spoken discourse,” in Acoustics,Speech, and Signal Processing, 1992. ICASSP-92., 1992IEEE International Conference on, vol. 1. IEEE, 1992,pp. 229–232.

[18] K. E. A. Silverman, “The structure and processing of fun-damental frequency contours,” Ph.D. dissertation, Universityof Cambridge, 1987.

[19] J. Hirschberg and B. Grosz, “Intonational features of localand global discourse structure,” in Proceedings of the work-shop on Speech and Natural Language. Association forComputational Linguistics, 1992, pp. 441–446.

[20] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner, “Loud-ness predicts prominence: Fundamental frequency lends lit-tle,” The Journal of the Acoustical Society of America, vol.118, no. 2, pp. 1038–1054, 2005.

[21] R. Silipo and S. Greenberg, “Automatic transcription ofprosodic stress for spontaneous english discourse,” in Proc.of the XIVth International Congress of Phonetic Sciences(ICPhS), vol. 3, 1999, p. 2351.

[22] F. Tamburini, “Automatic prosodic prominence detection inspeech using acoustic features: an unsupervised system.” inINTERSPEECH, 2003.

[23] G. Christodoulides and M. Avanzi, “An evaluation of ma-chine learning methods for prominence detection in french.”in INTERSPEECH, 2014, pp. 116–119.

[24] M. Heldner, E. Strangert, and T. Deschamps, “A focus detec-tor using overall intensity and high frequency emphasis,” inProc. of ICPhS, vol. 99, 1999, pp. 1491–1494.

[25] R. Fernandez and B. Ramabhadran, “Automatic explorationof corpus-specific properties for expressive text-to-speech: Acase study in emphasis,” in 6th ISCA Workshop on Speech




Synthesis, 2007.[26] J. M. Brenier, D. M. Cer, and D. Jurafsky, “The detection

of emphatic words using acoustic and lexical features.” inINTERSPEECH, 2005, pp. 3297–3300.

[27] S. Kakouros, J. Pelemans, L. Verwimp, P. Wambacq, andO. Rasanen, “Analyzing the contribution of top-down lexicaland bottom-up acoustic cues in the detection of sentenceprominence,” Interspeech 2016, pp. 1074–1078, 2016.

[28] L. S. Kennedy and D. P. Ellis, “Pitch-based emphasis detec-tion for characterization of meeting recordings,” in AutomaticSpeech Recognition and Understanding, 2003. ASRU’03.2003 IEEE Workshop on. IEEE, 2003, pp. 243–248.

[29] B. Wrede and E. Shriberg, “Spotting ”hot spots” in meetings:human judgments and prosodic cues.” in INTERSPEECH,2003.

[30] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotionin speech,” in Spoken Language, 1996. ICSLP 96. Proceed-ings., Fourth International Conference on, vol. 3. IEEE,1996, pp. 1970–1973.

[31] S. G. Koolagudi and K. S. Rao, “Emotion recognition fromspeech: a review,” International journal of speech technol-ogy, vol. 15, no. 2, pp. 99–117, 2012.

[32] L.-c. Yang and N. Campbell, “Linking form to meaning: theexpression and recognition of emotions through prosody,” in4th ISCA Tutorial and Research Workshop (ITRW) on SpeechSynthesis, 2001.

[33] O. Niebuhr, “On the phonetics of intensifying emphasis ingerman,” Phonetica, vol. 67, no. 3, pp. 170–198, 2010.

[34] D. Ververidis and C. Kotropoulos, “Emotional speech recog-nition: Resources, features, and methods,” Speech communi-cation, vol. 48, no. 9, pp. 1162–1181, 2006.

[35] D. Erickson, O. Fujimura, and B. Pardo, “Articulatory corre-lates of prosodic control: Emotion and emphasis,” Languageand Speech, vol. 41, no. 3-4, pp. 399–417, 1998.

[36] A. Pentland, “Social dynamics: Signals and behavior,” inInternational Conference on Developmental Learning, SalkInstitute, San Diego, CA, 2004.

[37] W. T. Stoltzman, “Toward a social signaling framework:Activity and emphasis in speech,” Ph.D. dissertation, Mas-sachusetts Institute of Technology, 2006.

[38] M. R. Barrick and M. K. Mount, “The big five personality di-mensions and job performance: a meta-analysis,” Personnelpsychology, vol. 44, no. 1, pp. 1–26, 1991.

[39] Y. R. Tausczik and J. W. Pennebaker, “The psychologi-cal meaning of words: Liwc and computerized text analy-sis methods,” Journal of language and social psychology,vol. 29, no. 1, pp. 24–54, 2010.

[40] Y. Wang and A. Pal, “Detecting emotions in social media:A constrained optimization approach.” in IJCAI, 2015, pp.996–1002.

[41] N. H. Bach, K. Shinoda, and S. Furui, “Robust highlightextraction using multi-stream hidden markov models forbaseball video,” in IEEE International Conference on ImageProcessing 2005, vol. 3. IEEE, 2005, pp. III–173.

[42] Y.-F. Huang and W.-C. Chen, “Rushes video summarizationby audio-filtering visual features,” International Journal ofMachine Learning and Computing, vol. 4, no. 4, p. 359,2014.

[43] Y. Zheng, G. Zhu, S. Jiang, Q. Huang, and W. Gao, “Visual-aural attention modeling for talk show video highlight detec-tion,” in 2008 IEEE International Conference on Acoustics,

Speech and Signal Processing. IEEE, 2008, pp. 2213–2216.[44] F. Wang and B. Merialdo, “Multi-document video summa-

rization,” in 2009 IEEE International Conference on Multi-media and Expo. IEEE, 2009, pp. 1326–1329.

[45] S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding basedfeature learning for video summarization,” IEEE Transac-tions on Multimedia, vol. 16, no. 6, pp. 1497–1509, 2014.

[46] H.-P. Chou, J.-M. Wang, C.-S. Fuh, S.-C. Lin, and S.-W. Chen, “Automated lecture recording system,” in SystemScience and Engineering (ICSSE), 2010 International Con-ference on. IEEE, 2010, pp. 167–172.

[47] A. R. Ram and S. Chaudhuri, “Media for distance education,”in Video Analysis and Repackaging for Distance Education.Springer, 2012, pp. 1–9.

[48] L. He, E. Sanocki, A. Gupta, and J. Grudin, “Auto-summarization of audio-video presentations,” in Proceedingsof the seventh ACM international conference on Multimedia(Part 1). ACM, 1999, pp. 489–498.

[49] C. M. Taskiran, Z. Pizlo, A. Amir, D. Ponceleon, andE. J. Delp, “Automated video program summarization us-ing speech transcripts,” IEEE Transactions on Multimedia,vol. 8, no. 4, pp. 775–791, 2006.

[50] T. Beaven, A. Comas-Quinn, M. Hauck, B. de los Arcos,and T. Lewis, “The open translation mooc: creating onlinecommunities to transcend linguistic barriers,” Journal ofInteractive Media in Education, vol. 2013, no. 3, 2013.

[51] X. Che, S. Luo, C. Wang, and C. Meinel, “An attempt atmooc localization for chinese-speaking users,” InternationalJournal of Information and Education Technology, vol. 6,no. 2, p. 90, 2016.

[52] X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuationprediction for unsegmented transcript based on word vector,”in Proceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016), 2016.

[53] X. Che, S. Luo, H. Yang, and C. Meinel, “Sentence boundarydetection based on parallel lexical and acoustic models,”Interspeech 2016, pp. 2528–2532, 2016.

[54] D. Garcia, M. Ball, and A. Parikh, “L@s 2014 demo: bestpractices for mooc video,” in Proceedings of the first ACMconference on Learning@ scale conference. ACM, 2014,pp. 217–218.

[55] W. Krauth, “Coming home from a mooc,” Computing inScience & Engineering, vol. 17, no. 2, pp. 91–95, 2015.

[56] Y. Qi and B. R. Hunt, “Voiced-unvoiced-silence classifica-tions of speech using hybrid features and a network classi-fier,” IEEE Transactions on Speech and Audio Processing,vol. 1, no. 2, pp. 250–255, 1993.

[57] B. Atal and L. Rabiner, “A pattern recognition approachto voiced-unvoiced-silence classification with applicationsto speech recognition,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 24, no. 3, pp. 201–212,1976.

[58] H. Deng and D. O’Shaughnessy, “Voiced-unvoiced-silencespeech sound classification based on unsupervised learning,”in 2007 IEEE International Conference on Multimedia andExpo. IEEE, 2007, pp. 176–179.

[59] R. Bachu, S. Kopparthi, B. Adapa, and B. D. Barkana,“Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy,” in Advanced Techniques in Com-puting Sciences and Software Engineering. Springer, 2010,




pp. 279–282.[60] O. Luzanova, “Means of speech for the creation of a positive

image of man in a panegyric discourse. typical deviationsin english speech, made by non-native speakers (consideringenglish personal advertisements),” 2014.

[61] A. Crines and T. Heppell, “Rhetorical style and issue em-phasis within the conference speeches of ukip’s nigel farage2010–2014,” British Politics, 2016.

[62] H. Pedrosa-de Jesus and B. da Silva Lopes, “Exploringthe relationship between teaching and learning conceptionsand questioning practices, towards academic development,”Higher Education Research Network Journal, p. 37, 2012.

[63] M. Li and X. Jiang, “Art appreciation instruction and changesof classroom questioning at senior secondary school in visualculture context,” Cross-Cultural Communication, vol. 11,no. 1, p. 43, 2015.

[64] F. D. Davis, “Perceived usefulness, perceived ease of use, anduser acceptance of information technology,” MIS quarterly,pp. 319–340, 1989.

[65] M. Onishi, M. Izumi, and K. Fukunaga, “Blackboard seg-mentation using video image of lecture and its applications,”in Pattern Recognition, 2000. Proceedings. 15th Interna-tional Conference on, vol. 4. IEEE, 2000, pp. 615–618.

[66] M. Lin, J. F. Nunamaker Jr, M. Chau, and H. Chen,“Segmentation of lecture videos based on text: a methodcombining multiple linguistic features,” in System Sciences,2004. Proceedings of the 37th Annual Hawaii InternationalConference on. IEEE, 2004, pp. 9–pp.

[67] T. Tuna, M. Joshi, V. Varghese, R. Deshpande, J. Subhlok,and R. Verma, “Topic based segmentation of classroomvideos,” in Frontiers in Education Conference (FIE), 2015.32614 2015. IEEE. IEEE, 2015, pp. 1–9.

[68] I. Koprinska and S. Carrato, “Temporal video segmentation:A survey,” Signal processing: Image communication, vol. 16,no. 5, pp. 477–500, 2001.

[69] A. Hill, T. Arford, A. Lubitow, and L. M. Smollin, ““i’mambivalent about it” the dilemmas of powerpoint,” TeachingSociology, vol. 40, no. 3, pp. 242–256, 2012.

[70] D. G. Levasseur and J. Kanan Sawyer, “Pedagogy meetspowerpoint: A research review of the effects of computer-generated slides in the classroom,” The Review of Communi-cation, vol. 6, no. 1-2, pp. 101–123, 2006.

[71] K. Li, J. Wang, H. Wang, and Q. Dai, “Structuring lecturevideos by automatic projection screen localization and anal-ysis,” IEEE Transactions on Pattern Analysis & MachineIntelligence, vol. 37, no. 6, pp. 1233–1246, 2015.

[72] H. J. Jeong, T.-E. Kim, H. G. Kim, and M. H. Kim,“Automatic detection of slide transitions in lecture videos,”Multimedia Tools and Applications, vol. 74, no. 18, pp.7537–7554, 2015.

[73] H. Yang and C. Meinel, “Content based lecture videoretrieval using speech and video text information,” IEEETransactions On Learning Technologies, vol. 7, no. 2, pp.142–154, 2014.

[74] X. Che, H. Yang, and C. Meinel, “Adaptive e-lecture videooutline extraction based on slides analysis,” in Advances inWeb-Based Learning–ICWL 2015. Springer, 2015, pp. 59–68.

[75] ——, “Lecture video segmentation by automatically analyz-ing the synchronized slides,” in Proceedings of the 21st ACMinternational conference on Multimedia. ACM, 2013, pp.

345–348.[76] U. Natke, J. Grosser, and K. T. Kalveram, “Fluency, funda-

mental frequency, and speech rate under frequency-shiftedauditory feedback in stuttering and nonstuttering persons,”Journal of Fluency Disorders, vol. 26, no. 3, pp. 227–241,2001.

[77] H. Quene, “Multilevel modeling of between-speaker andwithin-speaker variation in spontaneous speech tempo,” TheJournal of the Acoustical Society of America, vol. 123, no. 2,pp. 1104–1113, 2008.

[78] D. O’Shaughnessy, “Timing patterns in fluent and disfluentspontaneous speech,” in Acoustics, Speech, and Signal Pro-cessing, 1995. ICASSP-95., 1995 International Conferenceon, vol. 1. IEEE, 1995, pp. 600–603.

Xiaoyin Che was born on January 2, 1987 inBeijing, China. He entered university in 2005and received his bachelor degree from Collegeof Computer Science, Beijing University of Tech-nology (BJUT) in 2009, majored in computer sci-ence and technology. Then he started his mas-ter program in Multimedia and Intelligent Soft-ware Technology Laboratory, College of Com-puter Science, BJUT and received the degree in2012. He is currently a PhD student in the Chairof Internet Technologies and Systems, Hasso

Plattner Institute, Potsdam, Germany. He previously researched in videocoding standards and image processing. Now his research interestsinclude multimedia analysis, natural language processing and their ap-plications in e-Learning.

Haojin Yang received the Diploma Engineeringdegree at the Technical University Ilmenau, inGermany 2008. In 2013, he received the doc-torate degree at the Hasso-Plattner-Institute forIT-Systems Engineering (HPI) at the Universityof Potsdam. His current research interests re-volve around multimedia analysis, informationretrieval, computer vision and deep learningtechnology.

Christoph Meinel was born in 1954. He studiedmathematics and computer science at the Hum-boldt University of Berlin from 1974 to 1979. In1981 he received his PhD degree with the titleDr. rer. nat. From 1981 to 1991 he served asresearch assistant at Humboldt University andat the Institute for Mathematics at the BerlinAcademy of Sciences. He completed his habil-itation in 1988, earning the title Dr. Sc. nat.

He is the Scientific Director and CEO of theHasso Plattner Institute for Software Systems

Engineering GmbH (HPI), Potsdam, Germany, since 2004. Previouslyhe worked in University of Saarbrucken and University of Paderborn,and became full professor (C4) for computer science at the University ofTrier. His areas of research focus on Internet and Information Security,Web 3.0, Semantic Web, Social and Service Web and the domains of e-learning, tele-teaching and tele-medicine. Christoph Meinel is author/co-author of 9 books and 4 anthologies, as well as editor of various con-ference proceedings. More than 400 of his papers have been publishedin high-profile scientific journals and at international conferences. Prof.Meinel is a member of acatech, the German “National Academy ofScience and Engineering” and also a member of the IEEE.

9. AUTOMATIC LECTURE HIGHLIGHTING

196

10

Medical Image Semantic

Segmentation

In this work, we introduce a fully automatic conditional generative adversarial

network (cGAN) for medical image semantic segmentation. The proposed frame-

work consists of three components: a generator, a discriminator and a refinement

model. Three models are trained jointly, and the final segmentation masks are

composed by the output of the three models. Our experimental results prove that

the proposed framework successfully applied to different types of medical images

of varied sizes.





10.2 Manuscript

197

Multimedia Tools and Applicationshttps://doi.org/10.1007/s11042-019-7305-1

Recurrent generative adversarial network for learningimbalancedmedical image semantic segmentation

Mina Rezaei1 ·Haojin Yang1 ·Christoph Meinel1

Received: 1 October 2018 / Revised: 18 December 2018 / Accepted: 29 January 2019 /

© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractWe propose a new recurrent generative adversarial architecture named RNN-GAN to miti-gate imbalance data problem in medical image semantic segmentation where the number ofpixels belongs to the desired object are significantly lower than those belonging to the back-ground. A model trained with imbalanced data tends to bias towards healthy data which isnot desired in clinical applications and predicted outputs by these networks have high preci-sion and low recall. To mitigate imbalanced training data impact, we train RNN-GAN withproposed complementary segmentation mask, in addition, ordinary segmentation masks.The RNN-GAN consists of two components: a generator and a discriminator. The gen-erator is trained on the sequence of medical images to learn corresponding segmentationlabel map plus proposed complementary label both at a pixel level, while the discriminatoris trained to distinguish a segmentation image coming from the ground truth or from thegenerator network. Both generator and discriminator substituted with bidirectional LSTMunits to enhance temporal consistency and get inter and intra-slice representation of the fea-tures. We show evidence that the proposed framework is applicable to different types ofmedical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, andLiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacyof our approach.

Keywords Imbalanced medical image semantic segmentation · Recurrent generativeadversarial network

1 Introduction

Medical imaging plays an important role in disease diagnosis, treatment planning, and clini-cal monitoring [4, 24]. One of the major challenges in medical image analysis is imbalancedtraining sample where desired class pixels (lesion or body organ) are often much lower in

� Mina [email protected]

Extended author information available on the last page of the article.

Multimedia Tools and Applications

numbers than non-lesion pixels. A model learned from class imbalanced training data isbiased towards the majority class. The predicted results of such networks have low sensitiv-ity, showing the ability of correctly predicting non-healthy classes. In medical applications,the cost of miss-classification of the minority class could be more than the cost of miss-classification of the majority class. For example, the risk of not detecting tumor could bemuch higher than referring to doctors a healthy subject.

The problem of the imbalanced class has been recently addressed in diseases classifica-tion, tumor localization, and tumor segmentation. Two types of approaches being proposedin the literature: data-level approaches and algorithm-level approaches.

At data-level, the objective is to balance the class distribution through re-sampling thedata space [35, 52], by including SMOTE (Synthetic Minority Over-sampling Technique)of the positive class [10] or by under-sampling of the negative class [23]. However, theseapproaches often lead to remove some important samples or add redundant samples to thetraining set.

Algorithm-level based solutions address class imbalance problem by modifying thelearning algorithm to alleviate the bias towards majority class. Examples are cascade train-ing [8, 11], training with cost-sensitive function [47], such as Dice coefficient loss [11, 13,41], and asymmetric similarity loss [18] that modifying the training data distribution withregards to the miss-classification cost.

In this paper, we mitigate imbalanced training samples: In data-level, we explore theadvantage of training network with inverse class frequency segmentation masks, namedcomplementary segmentation masks in addition to ground truth segmentation masks (ordi-nary masks) which can then be used to improve the overall prediction of the quality ofthe segmentation. Assume, Y is true segmentation label annotated by expert and Y is syn-thesized pair of corresponding images with a complementary label. In the complementarymasks Y , the majority and minority pixels value are changed to skew bias from majoritypixels where the negative label for the major class and a positive label for the c − 1 class.Then, our network train with both ordinary segmentation mask Y and complementary seg-mentation masks Y at the same time but in multiple loss. The final segmentation masksrefine by considering ordinary and complementary mask prediction.

In algorithm-level, we study the advantage of mixing adversarial loss with categoricalaccuracy loss compared to traditional losses such as �1 loss. Hence, image segmentationis an important task in medical imaging that attempts to identify the exact boundaries ofobjects such as organs or abnormal regions (e.g. tumors). Automating medical image seg-mentation is a challenging task due to the high diversity in the appearance of tissues amongdifferent patients, and in many cases, the similarity between healthy and non-healthy tissues.Numerous automatic approaches have been developed to speed up medical image segmen-tation [32]. We can roughly divide the current automated algorithms into two categories:those based on generative models and those based on discriminative models.

Generative probabilistic approaches build the model based on prior domain knowledgeabout the appearance and spatial distribution of the different tissue types. Traditionally,generative probabilistic models have been popular where simple conditionally independentGaussian models [14] or Bayesian learning [33] are used for tissue appearance. On the con-trary, discriminative probabilistic models, directly learn the relationship between the localfeatures of images [3] and segmentation labels without any domain knowledge. Traditionaldiscriminative approaches such as SVMs [2, 9], random forests [27], and guided randomwalks [12] have been used in medical image segmentation. Deep neural networks (DNNs)are one of the most popular discriminative approaches, where the machine learns the hier-archical representation of features without any handcrafted features [26, 51]. In the field of


medical image segmentation, Ronneberger et al. [38] presented a fully convolutional neuralnetwork, named UNet, for segmenting neuronal structures in electron microscopic stacks.

Recently, GANs [15] have gained a lot of momentum in the research fraternities. Mirzaet al. [28] extended the GANs framework to the conditional setting by making both the gen-erator and the discriminator network class conditional. Conditional GANs (cGANs) havethe advantage of being able to provide better representations for multi-modal data genera-tion since there is a control over the modes of the data being generated. This makes cGANssuitable for image semantic segmentation task, where we condition on an observed imageand generate a corresponding output image.

Unlike previous works on cGANs [22, 29, 48], we investigate the 2D sequence of med-ical images into 2D sequence of semantic segmentation. In our method, 3D bio-medicalimages are represented as a sequence of 2D slices (i.e. as z-stacks). We use bidirectionalLSTM units [16] which are an extension of classical LSTMs and are able to improve modelperformance on sequence processing by enhancing temporal consistency. We use time dis-tribution between convolutional layers and bidirectional LSTM units on bottleneck of thegenerator and the discriminator to get inter and intra-slice representation of features.

Summarizing, the main contributions of this paper are:

– We introduce RNN-GAN, a new adversarial framework that improves semantic seg-mentation accuracy. The proposed architecture shows promising results for smalllesions segmentation as well as anatomical regions.

– Our proposed method mitigates imbalanced training data with biased complementarymasks in task of semantic segmentation.

– We study the effect of different losses and architectural choices that improve semanticsegmentation.

The rest of the paper is organized as follows: in the next section, we review recentmethods for handling imbalanced training data and semantic segmentation tasks. Section 3explains the proposed approach for semantic segmentation, while the detailed experimen-tal results are presented in Section 4. We conclude the paper and give an outlook on futureresearch in Section 5.

2 Related work

This section briefs the previous studies carried out in the area of learning from imbalanceddatasets, generative adversarial networks, and medical image semantic segmentation mostlyin recent years.

Handling imbalanced training dataset. Cascade architecture [8] and ensembleapproaches [43] provided best performance on highly imbalanced medical dataset likeLiTS-2017 for segmentation of very small lesion(s). Some have focused on balancingrecall and precision with asymmetric loss [18], others used accuracy loss [41] andweighted the imbalanced class according to its frequency in the dataset [8, 36]. Similarto some recent work [39, 41], we mitigate the negative impact of the class imbalanced,by mixing adversarial loss and categorical accuracy loss and training deep model withcomplementary masks.

Learning with complementary labels. Recently, the complementary labels in context ofmachine learning [21] has been used by assuming the transition probabilities are identicalwith modifying traditional one-versus-all and pairwise-comparison losses for multi-class


classification. Ishida et al. [21] theoretically prove that unbiased estimator to the clas-sification risk can be obtained by complementary labels. Yu et al. [50] study learningfrom both complementary labels and ordinary labels can provide a useful application formulti-class classification task. Inspired by recent success [21, 50], we train the proposedRNN-GAN with both complementary labels and ordinary labels for the task of semanticsegmentation to skew the bias from majority pixels.

Generative Adversarial Network. Previous works [22, 54] show the success of condi-tional GANs as a general setting for image-to-image translation. Some recent worksapplied GANs unconditionally for image-to-image translation by forcing generator topredict desired output under �1 [48] or �2 [31, 53] regression. Here, we study the mixingof adversarial loss in conditional setting with traditional loss and accuracy loss motivatedto attenuate imbalanced training dataset. Our method also differs from the prior works[22, 25, 29, 55] by the architectural setting of the generator and the discriminator, weuse bidirectional LSTM units on top of the generator and discriminator architecture tocapture temporal consistency between 2D slices.

Medical image semantic segmentation. The UNet has achieved promising results in med-ical image segmentation [38] since it allows low-level features concatenated withhigh-level features which provided better learning representation. Later, UNet with com-bination of residual network [6], in cascade of 2D and 3D [20] were used for cardiacimage segmentation or heterogeneous liver segmentation [8]. The generator network inRNN-GAN, is modified UNet where high resolution features are concatenated with up-sampled of global low-resolution features to help the network learn both local and globalinformation.

3 Method

In this section we present the recurrent generative adversarial network for medical imagesemantic segmentation. To tackle with miss-classification cost and mitigate imbalancedpixel labels, we mixed adversarial loss with categorical accuracy loss Section 3.1. More-over, we explain our intuition for skewing the biased from majority pixels with proposedcomplementary labels Section 3.2.

3.1 Recurrent generative adversarial network

In a conventional generative adversarial network, generative model G tries to learn a map-ping from random noise vector z to output image y; G : z → y. Meanwhile, a discriminativemodel D estimates the probability of a sample coming from the training data xreal ratherthan the generator xf ake. The GAN objective function is two-player mini-max game withvalue function V (G,D) :

mG

in mD

ax V (D, G) = Ey[logD(y)] + Ez[log(1 − D(G(z)))] (1)

In a conditional GAN, a generative model learns the mapping from the observed imagex and a random vector z to the output image y; G : x, z → y. On the other hand theD attempts to discriminate between generator output image and the training set images.According to the (2), in the cGANs training procedure both G and D are conditioned ondesired output y.

mG

in mD

ax V (D, G) = Ex,y[logD(x, y)] + Ex,z[log(1 − D(x, G(x, z)))] (2)


More specifically, in our proposed RNN-GAN network, a generative model learns themapping from a given sequence of 2D medical images xi to the semantic segmentation ofcorresponding labels yiseg ; G : xi, z → {yiseg } (where i refers to 2D slices index between1 and 20 from a total 20 slices acquired from ACDC-2017). The training procedure for thesemantic segmentation task is similar to two-player mini-max game (3). While the gener-ator predicted segmentation in pixel level, the discriminator takes the ground truth and thegenerator’s output to determine whether predicted label is real or fake.

Ladv ← mG

in mD

ax V (D, G) = Ex,yseg [logD(x, yseg)]+Ex,z[log(1−D(x, G(x, z)))] (3)

We mixed the adversarial loss with �1 distance (4) to minimize the absolute differencebetween the predicted value and the existing largest value. Hence the �1 objective functiontakes into account CNN features and differences between the predicted segmentation andthe ground truth, resulting in less noise and smoother boundaries.

LL1(G) = Ex,z ‖ yseg − G(x, z) ‖ (4)

L�acc (G) = 1

c

∑

j=1

∑

i=1

yijseg ∩ G(xij , z)

yijseg ∪ G(xij , z)(5)

where j and i indicate the number of semantic classes and the number of 2D slices for eachpatients respectively.

Moreover, we mixed categorical accuracy loss �acc, (5), in order to mitigate imbalancedtraining data by assigning a higher cost to the less represented set of pixels, boosting itsimportance during the learning process. Categorical accuracy loss checks whether the max-imal true value is equal to the maximal predicted value regarding each category of thesegmentation.

Then, the final adversarial loss for semantic segmentation task by RNN-GAN iscalculated through (6).

LRNN−GAN(D,G) = Ladv(D,G) + LL1(G) + L�acc (G) (6)

In this work, similar to the work of Isola et al. [22], we used Gaussian noise z in thegenerator alongside the input data x. As discussed by Isola et al. [22], in training procedureof conditional generative model from conditional distribution P(y|x), that would be better,a trained model produces more than one sample y, from each input x. When the generatorG, takes plus input image x, random vector z, then G(x, z) can generate as many differentvalues for each x as there are values of z. Specially for medical image segmentation, thediversity of image acquisition methods (e.g., MRI, fMRI, CT, ultrasound), regarding theirsettings (e.g., echo time, repetition time), geometry (2D vs. 3D), and differences in hardware(e.g., field strength, gradient performance) can result in variations in the appearance of bodyorgans and tumour shapes [19], thus learning random vector z with input image x makesnetwork robust against noise and act better in the output samples. This has been confirmedby our experimental results using datasets having a large range of variation.

3.2 Complementary label

In order to mitigate the impact of imbalanced pixels labels on medical images, the proposedRNN-GAN as described in Fig. 1, is trained with complementary mask (Fig. 2, third col-umn) in addition of the ordinary masks (Fig. 2, columns 4–6). Similar to Yu et al. [50],we assumed transition probabilities are identical then the adversarial loss (i.e. categoricalcross entropy loss) provides an unbiased estimator for minimizing the risk. Since we have


Fig. 1 The architecture of RNN-GAN consists of two deep networks: a generative network G and a discrim-inative network D. G takes sequence of 2D images as a condition and generates the sequence of 2D semanticsegmentation outputs, D determines whether those outputs are real or fake. RNN-GAN captures inter andintra-slice feature representation with bidirectional LSTM units on bottleneck of both G and D network.Here, G is modified UNet architecture and D is fully convolutional encoder

the same assumption we skip the proof of theoretical side and here we experimentally showthat complementary labels in addition of ordinary losses are able to provide more accurateresults for a task of semantic segmentation.

3.3 Network architecture

The proposed architecture is shown in Fig. 1, where the generator network G in the leftfollowed by the discriminator network D in the right side of the figure. We design bidi-rectional LSTM units on circumvent bottleneck of both G and D, to capture the non-linearrelationship between previous, current, and next 2D slices which is important key to processsequential data.

3.3.1 Recurrent generator

The recurrent generator takes a random vector z plus sequence of 2D medical images. Sim-ilar to the UNet architecture, we added skip connections between each layer r and thecorresponding layer t − 1 − r , where t represents the total number of layers. Each skip con-nection simply concatenates all channels at layer r with those at layer t − 1 − r . Feature

Fig. 2 The chest MR image, from ACDC-2017 after pre-processing. The first column is semantic segmen-tation mask correspond to MR images in second column. Columns 3-6 present complementary labels mask,right ventricle, myocardium vessel, and left ventricle where we map 2D images from second column intofour segmentation masks presented in columns 3-6


Fig. 3 The cardiac MR image, from ACDC 2017 after pre-processing left side image shows end of systolicsample and right side is end of diastolic phase. We extracted complementary mask from inverse of groundtruth file annotated by medical expert, presented in the second and seventh column. Other binary masksextracted from ground truth file in columns 3-5 and 8-10 respectively are right ventricles, myocardium vessel,and left ventricles which they are used by the discriminator. The first and sixth columns are an example inputof the generator

maps from the convolution part in the down-sampling step are fed into the up-convolutionpart in the up-sampling step. The generator is trained on a sequence input images from samepatient and same acquisition plane. We use the convolutional layer with kernel size 5 × 5and stride 2 for down-sampling, and perform up-sampling by the image resize layer with afactor of 2 and convolutional layer with kernel size 3 × 3 and stride 1.

3.3.2 Recurrent discriminator

The discriminator network is a classifier and has similar structure as an encoder of thegenerator network. Hierarchical features are extracted from fully convolutional encoder ofdiscriminator and used to classify between the generator segmentation output and groundtruth. More specifically, the discriminator is trained to minimize the average negative cross-entropy between predicted and the true labels.

Then, two models are trained through back propagation corresponding to a two-playermini-max game (see (3)). We use categorical cross entropy [30] as an adversarial loss.In this work, the recurrent architecture selected for both discriminator and generator is abidirectional LSTM [16].

4 Experiments

We validated the performance of RNN-GAN on three recent public medical imaging chal-lenges: real patient data obtained from the MICCAI 2017, automated cardiac MRI segmen-tation challenge (ACDC-2017) [5], CT liver tumour segmentation challenge (LiTS-2017),and the 2016 whole-heart and great vessel segmentation challenge (HVSMR).

4.1 Datasets and pre-processing

Our experiments are based on three independent datasets consisting of two cardiac MRimages, and an abdomen CT dataset that all segmented manually by radiologists at pixellevel.

ACDC. The ACDC dataset1 comprised of 150 patients with 3D cine-MR images acquiredin a clinical routine. The training database was composed of 100 patients. For all thesedata, the corresponding manual references were given by a clinical expert. The testingdatabase consisted of 50 patients without manual references. Figure 3 shows a cardiacMR images from the ACDC dataset.

1https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html


Fig. 4 The abdomen CT image, from LiTS-2017. The first and second columns show before and after pre-processing. Our generator takes after pre-processing slices (second column) and learns to map third andfourth columns by getting feedback from discriminator

HVSMR. Thirty training cine MRI scans from 10 patients were provided by the organizersof the HVSMR challenge.2 Three images were provided for each patient: a completeaxial cine MRI, the same image cropped around the heart and the thoracic aorta, and acropped short-axis reconstruction.

LiTS. In third experiment, we applied the LiTS-2017 benchmark3 that comprised of 130 CTtraining and 70 test subjects. The examined patients were suffering from different livercancers. The challenging part is segmentation of very small lesion target on a high unbal-anced dataset. Here, pre-processing is carried out in a slice-wise fashion. We appliedHounsfield unit (HU) values, which were windowed in the range of [100, 400] to excludeirrelevant organs and objects as shown in Fig. 4. Furthermore, we applied histogramequalization to increase the contrast for better differentiation of abnormal liver tissue.

Pre-processing of MR images. The gray-scale distribution of MR images is dependent onthe acquisition protocol and the hardware. This makes learning difficult since we expectto have the same data distribution from one subject to another. Therefore, pre-processingis an important step toward bringing all subjects under similar distributions. We applieda bias field correction on the MR images from HVSMR and ACDC datasets to correctthe intensity non-uniformity using N4ITK [42]. Lastly, we applied histogram matchingnormalization on the all 2D slices from sagittal, coronal, and axial planes.

4.2 Implementation and configuration

The RNN-GAN architecture is implemented based on Keras [7] and TensorFlow [1] library.The implemented code is available on the author GitHub.4 All training was conducted on aworkstation equipped with NVIDIA TITAN X GPU.

The model was trained for up to 120 epochs with batch size 10, iteration 450 and initiallearning rate 0.001 on ACDC dataset. Similarly, in HVSMR, we had initial learning rate0.001, batch size 10, iteration 2750, and 100 epochs where we used all 2D slices fromcoronal, sagittal, and axial planes with size 256 × 256. The generator and discriminator forall layers use the tanh activation function except the output layer which uses softmax. We usecategorical cross-entropy as an adversarial loss mixed with categorical accuracy and �1. TheRMSprop optimizer was used in both the generator and the discriminator. The RMSpropdivides the learning rate by an exponentially decaying average of squared gradients.

2http://segchd.csail.mit.edu/3https://competitions.codalab.org/competitions/170944https://github.com/HPI-DeepLearning/Recurrent-GAN


Table 1 Comparison of the achieved accuracy in term of Dice metric on ACDC benchmark with relatedapproaches and top-ranked methods where the best performance in each cardiac phase and regions of interesthave been bold

Methods Phases Left ventricle Right ventricle Myocardium

RNN-GAN ED 0.968 0.940 0.933

ES 0.951 0.919 0.925

cGAN ED 0.934 0.906 0.899

ES 0.918 0.874 0.870

Isensee et al. [20] ED 0.955 0.925 0.865

ES 0.905 0.834 0.882

Wolterink et al. [46] ED 0.96 0.92 0.86

ES 0.91 0.84 0.88

Rohe et al. [37] ED 0.94 0.96 0.90

ES 0.92 0.95 0.90

Zotti et al. [56] ED 0.96 0.94 0.89

ES 0.94 0.87 0.90

U-Net [38] ED 0.96 0.88 0.78

ES 0.92 0.79 0.76

Poudel et al. [34] 0.90 − −

The network was trained with both the ground truth and complementary masks and adversarial loss wasmixed with �1 and categorical accuracy

Table 2 Comparison of achieved accuracy in term of Hausdorff distance on ACDC benchmark with top-ranked participant approaches and related work where the best performance in each cardiac phase and regionsof interest have been bold


RNN-GAN ED 6.82 8.95 8.08

ES 8.02 12.17 8.69

cGAN ED 8.62 12.16 9.04

ES 9.44 13.2 9.50

Isensee et al. [20] ED 7.38 10.12 8.72

ES 6.90 12.14 8.67

Wolterink et al. [46] ED 7.47 11.87 11.12

ES 9.6 13.39 10.06

Rohe et al. [37] ED 7.04 14.04 11.50

ES 10.92 15.92 13.03

Zotti et al. [56] ED 5.96 13.48 8.68

ES 6.57 16.66 8.99

U-Net [38] ED 6.17 20.51 15.25

ES 8.29 21.20 17.92

Here, RNN-GAN was trained with the ground truth and complementary masks and adversarial loss wasmixed with �1 and categorical accuracy


Fig. 5 The cardiac segmentation results at test time by RNN-GAN from ACDC 2017 benchmark onPatient084. The red, green, and blue contour present respectively right ventricle, myocardium, and leftventricle region. The top two rows show the diastolic phase from different slices from t=0 till t=9 circle.Respectively the third and fourth rows present systolic cardiac phase from t=0 till t=9 circle

The training took eight hours on ACDC for a total of 120 epochs on parallel NVIDIATITAN X GPUs and with same configuration, it was 12 hours on HVSMR dataset. Withthis implementation, we are able to produce a cardiac segmentation mask between 500-700ms per patient on same cardiac phase from ACDC dataset on an axial plane.

The proposed approach is trained on 75% training data released by the HVSMR-2016and LiTS-2017 benchmarks. We used all provided images from three axes of sagittal, coro-nal, and axial for training, validation and testing. We trained our system on 75 exams fromaxial, coronal, and sagittal plane and validated it on the remaining 25 exams for the ACDCdataset.

In both the training and testing phase, the mini-batch consists of 2D images from thesame patient, the same acquisition plane and same cardiac phase. We initially normalizethe inputs where the mean and variance are computed on a specific patient from the sameacquisition plane and from all available images in the same cardiac phase (ED, ES). Thisnormalization helps to restrict the effect of outliers. With batch norm, we normalized theinputs (activations coming from the previous layer) going into each layer using the meanand variance of the activations for the entire mini-batch.


Table 3 Dice-scores for different losses, evaluated on ACDC benchmark for segmentation of cardiac MRimages


RNN-GAN ED 0.968 0.940 0.933

(adv + �1 + acc + CL) ES 0.951 0.919 0.925

RNN-GAN ED 0.965 0.938 0.933

(adv + CL) ES 0.950 0.917 0.921

RNN-GAN ED 0.961 0.931 0.927

(adv + �1) ES 0.949 0.913 0.917

RNN-GAN ED 0.952 0.94 0.929

(adv + acc) ES 0.946 0.907 0.913

cGAN ED 0.934 0.906 0.899

(adv) ES 0.918 0.874 0.870

The best performance achieved when the RNN-GAN trained with complementary labels (CL) in addition �1and accuracy (acc) losses

Let us mention that Wolterink’s method (using an ensemble of six trained CNNs) took 4seconds to compute predictions mask per patient with a system equipped NVIDIA TITANX GPU in ACDC benchmark as reported in [46], while the RNN-GAN took 500 ms inaverage per patient with a system equipped single of NVIDIA TITAN X GPUs.

4.3 Evaluation criteria

The evaluation and comparison performed using the quality metrics introduced by eachchallenge organizer. Semantic segmentation masks were evaluated in a five-fold cross-validation. For each patient, a corresponding images for the End Diastolic (ED) instantand for the End Systolic (ES) instant has provided. As described by ACDC-2017, cardiacregions are defined as right-ventricle region labeled 1, 2 and 3 representing respectivelymyocardium and left ventricles. In order to optimize the computation of the different errormeasures, the Dice coefficient (7) and Hausdorff distance (8) python script code wereobtained from the ACDC for all participants.

The average distance boundary (ADB) in addition Dice and Hausdorff considered forevaluating the blood pool and myocardium in HVSMR-2016 and similarly, for validatingof liver lesions segmentation on LiTS-2017. Besides these parameters, we calculated sensi-tivity and specificity since they are a good indicator for miss-classified rate (false positivesand false negatives) (see Tables 5 and 6).

Dice(P, T ) ← | P ∧ T |(| P | + | T |)/2

(7)

Haus(P, T ) ← max{sup inf d( P, T ) , sup inf d( T , P ) } (8)

where P and T indicates predicted output by our proposed method and ground truthannotated by medical expert respectively.

4.4 Comparison with relatedmethods and discussion

As shown in Table 1, our method outperforms other top-ranked approaches from the ACDCbenchmark. Based on Table 1, in Dice coefficient, our method achieved slightly better than


Fig. 6 The ACDC 2017 challenge results using RNN-GAN and cGAN architecture. The left figure showsDice coefficient in two cardiac phase as follows the right sub figure presents Hausdorff distance. The y-axisshows the Dice metrics and x-axis shows segmentation performance based on cGAN and RNN-GAN in EDand ES cardiac phase. In each sub figure, the mean is presented in red. The ACDC 2017 challenge resultsusing RNN-GAN and cGAN architecture. The sub figure (b) y-axis codes the Hausdorff distance in mm andx-axis presents segmentation performance based on cGAN and RNN-GAN in ED and ES cardiac phase

the Wolterink et al. [46] on ACDC challenge in left ventricle and myocardium segmentation.However, Rohe et al. [37] achieved outstanding performance for right ventricle segmen-tation since they applied the multi-atlas registration and segmentation at the same time.Poudel et al. [34] achieved competitive results on left ventricle segmentation with overallDice 0.93, based on recurrent fully convolutional networks.

Based on Tables 1 and 2, the right ventricle is a difficult organ for all the participantsmainly because of its complicated shape, the partial volume effect close to the free wall, and


Table 4 Comparison of Segmentation results on HVSMR dataset in terms of Dice metric and averagedistance boundaries with other participant where the best performance in each metrics have been bold

Methods Dice1 Dice2 Adb1 Adb2

RNN-GAN 0.86 0.94 0.92 0.84

cGAN 0.74 0.91 1.19 1.07

Yu et al. [49] 0.84 0.93 0.99 0.86

Wolterink et al. [45] 0.80 0.93 0.89 0.96

Shahzad et al. [40] 0.75 0.89 1.10 1.15

U-Net [38] 0.68 0.81 2.04 1.82

For all columns, index 1 is myocardium and 2 blood pool

intensity of homogeneity. Our achieved accuracy in term of Hausdorff distance, in averageis 1.2 ± 0.2mm lower than other participants. This is a strong indicator for precision ofboundary that RNN-GAN architecture substituted with bidirectional LSTM units is suitablesolution for capturing the temporal consistency between slices. Compared to cGAN (Tables 1and 2) RNN-GAN provides better results when the network is trained with complementarysegmentation mask and even sensitivity and precision.

Compared to the expert annotated file on the original ED phase instants, individual Dicescores of 0.968 for the left ventricle (LV), 0.933 for the myocardium (MYO), and 0.940 forthe right ventricle (RV) (see Table 1) were achieved in test time on 25 patients. Qualitatively,the RNN-GAN segmentation results are promising (see Fig. 5 and 7) where we can seerobust and smooth boundaries for all substructures.

We report the effect of different losses for RNN-GAN in Table 3. As we expected,the best performance obtained when the network was trained with mixing of categoricalcross-entropy (as adversarial loss) with �1 and categorically accuracy. Using an �1 lossencourages the output respect the input, since the �1 loss penalizes the distance betweenground truth outputs, which match the input and synthesized outputs. Using categoricalaccuracy force the network to assign a higher cost to less represented set of objects, byboosting its importance during the learning process.

As depicted on Fig. 5 and Table 1 right ventricle is complex organ to segment. Themost failure happened in systolic phase. Based on Fig. 5 the achieved accuracy in the testtime on ACDC benchmark, we observed that the average results in diastolic phase (firstand second rows) are better than the average results on systolic phase (third and fourth

Table 5 Comparison of Segmentation errors in HVSMR dataset in terms of Hausdorff distance, sensitivity,and specificity with other participant approaches where the best performance in each metrics have been bold

Methods HD1 HD2 Sen1 Sen2 Spec1 Spec2

RNN-GAN 5.84 6.35 0.89 0.92 0.97 0.99

cGAN 6.79 9.2 0.82 0.88 0.94 0.99

Yu et al. [49] 6.41 7.03 − − − −Wolterink et al. [45] 6.13 7.07 − − − −Shahzad et al. [40] 6.05 7.49 − − − −U-Net [38] 8.86 11.2 0.78 0.74 0.91 0.99

For all columns, index 1 is myocardium and 2 blood pool


Fig. 7 The cardiac segmentation results in test time by RNN-GAN from HVSMR 2016 benchmark. The toprow shows the predicted output by RNN-GAN and the second row presents the corresponding ground truthannotated by medical expert. The contour with cyan colour describes blood pool and dark blue shows themyocardium region

rows). We evaluated quantitatively the results using Hausdorff distance and Dice as shownin Fig. 6. As expected, the achieved Dice score on left ventricle (median of 6.82/8.02 for theED/ES frames) tend to be lower than for the two other regions of interest with myocardiumat 8.08/8.69 and right ventricle at 8.95/12.07 for ED/ES.

Based on Tables 4, 5 and Fig. 7, the results show good relation to the ground truth forthe blood pool. The average value of the Dice index is around 0.94. The main source oferror here is the inability of the method to completely segment all the great vessels wherethe average Dice score is 0.86. Regarding the results on Tables 4 and 5, by comparing thefirst and second row the achieved accuracy is better when we conditional GAN substitutedwith bidirectional LSTM units. These architecture provide a better representation of featuresby capturing spatial-temporal information in forward and backward dependency. In thiscontext, Poudel et al [34] designed unidirectional LSTMs on top of UNet architecture tocapture inter-intra slice features and achieved competitive results for segmentation of leftventricle.

The qualitative results of liver tumour segmentation are presented in Fig. 8. Based onFig. 8 and Table 6, RNN-GAN is able to detect complex and heterogeneous structure ofall lesions. The RNN-GAN architecture trained with complementary masks yielded betterresults and trade off between Dice and sensitivity. Dice score is a good measure for class

Fig. 8 LiTS-2017 test results for liver tumour(s) segmentation using RNN-GAN. We overlaid predicted livertumour region on CT images shown with blue colour. Compared to the green contour annotated by medicalexpert from ground truth file, we achieved 0.83 for Dice score and 0.74 for sensitivity


Table 6 Quantitative segmentation results of the liver lesions segmentation on the LiTS-2017 dataset

Architecture Dice Sen VOE RVD ASD HD

RNN-GAN 0.83 0.74 14 −6 6.4 40.1

RNN-GAN * 0.80 0.68 20 −2 9.7 52.3

cGAN 0.76 0.57 21 −1 10.8 87.1

UNet [8] 0.72 − 22 −3 9.5 165.7

ResNet+Fusion [6] − − 16 −6 5.3 48.3

H-Dense+ UNet [17] − − 39 7.8 1.1 7.0

FCN [44] − − 35 12 1.0 7.0

The first and second rows show achieved accuracy for the task of liver lesions segmentation when our networkwas trained with (RNN-GAN) and without (RNN-GAN *) complementary segmentation masks respectively

imbalance where indicate the true positive rate by considering false negative and false pos-itive pixels. The effect of class balancing can be seen with comparison of first and secondrow of Table 6. As we expected the RNN-GAN trained with complementary segmentationlabels and binary segmentation masks computed more accurate result with average 3% and6% improvement respectively in Dice and sensitivity.

We compared predicted results by RNN-GAN at test time with other top-ranked andrelated approaches on LiTS-2017 in terms of volume overlap error (VOE), relative volumedifference (RVD), average symmetric surface distance (ASD), and maximum surface dis-tance or Hausdorff distance (HD), as introduced by challenge organizer. As depicted resultsin Table 6 cascade UNet [8] or ensemble network [6, 17] architectures has achieved bet-ter performance compared to trained only with fully convolutional neural network (FCN)[44]. In contrast to prior work such as [6, 8, 17], our proposed method could be general-ized to segment the very small lesion and also multiple organs in medical data in differentmodalities.

5 Conclusion

In this paper, we introduced a new deep architecture to mitigate the issue of imbalancedpixel labels in the task of medical image segmentation task. To this end, we developeda recurrent generative adversarial architecture named RNN-GAN, consists of two archi-tecture: a recurrent generator and a recurrent discriminator. To mitigate imbalanced pixellabels, we mixed adversarial loss with categorical accuracy loss and train the RNN-GANwith ordinary and complementary masks. Moreover, we analyzed the effects of differentlosses and architectural choices that help to improve semantic segmentation results. Ourproposed method shows outstanding results for segmentation of anatomical regions (i.e.cardiac image semantic segmentation). Based on the segmentation results on two cardiacbenchmarks, the RNN-GAN is robust against slice misalignment and different CMRI pro-tocols. Experimental results reveal that our method produces an average Dice score of 0.95.Regarding the high accuracy and fast processing speed, we think it has the potential to usefor the routine clinic task. We validated also the RNN-GAN on tumor segmentation basedon abdomen CT images and achieved competitive results on LiTS benchmark.

The impact of learning from complementary labels from different imbalanced ratio mayalso be useful in the context of semantic segmentation. We will investigate this issue in the


future. In term of application, we plan to investigate the potential of RNN-GAN network forlearning multiple clinical tasks such as diseases classification and semantic segmentation.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.

References

1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, DevinM, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M,Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, SutskeverI, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M,Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems.https://www.tensorflow.org/. Software available from tensorflow.org

2. Afshin M, Ayed IB, Punithakumar K, Law M, Islam A, Goela A, Peters T, Li S (2014) Regional assess-ment of cardiac left ventricular myocardial function via mri statistical features. IEEE Trans Med Imaging33(2):481–494

3. Avola D, Cinque L (2008) Encephalic nmr image analysis by textural interpretation. In: Proceedings ofthe 2008 ACM symposium on applied computing, pp 1338–1342. ACM

4. Avola D, Cinque L, Di Girolamo M (2011) A novel t-cad framework to support medical image analysisand reconstruction. In: International conference on image analysis and processing, pp 414–423. Springer

5. Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng PA, Cetin I, Lekadir K, CamaraO, Ballester MAG et al (2018) Deep learning techniques for automatic mri cardiac multi-structuressegmentation and diagnosis: Is the problem solved? IEEE Transactions on Medical Imaging

6. Bi L, Kim J, Kumar A, Feng D (2017) Automatic liver lesion detection using cascaded deep residualnetworks. arXiv:1704.02703

7. Chollet F et al (2015) Keras8. Christ PF, Ettlinger F, Grun F, Elshaer MEA, Lipkova J, Schlecht S, Ahmaddy F, Tatavarty S, Bickel M,

Bilic P, Rempfler M, Hofmann F, D’Anastasi M, Ahmadi S, Kaissis G, Holch J, Sommer WH, BrarenR, Heinemann V, Menze BH (2017) Automatic liver and tumor segmentation of CT and MRI volumesusing cascaded fully convolutional neural networks. arXiv:1702.05970

9. Ciecholewski M (2011) Support vector machine approach to cardiac spect diagnosis. In: Internationalworkshop on combinatorial image analysis, pp 432–443. Springer

10. Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditionalgenerative adversarial networks. Expert Syst Appl 91:464–471

11. Drozdzal M, Chartrand G, Vorontsov E, Shakeri M, Di Jorio L, Tang A, Romero A, Bengio Y, Pal C,Kadoury S (2018) Learning normalized inputs for iterative estimation in medical image segmentation.Med Image Anal 44:1–13

12. Eslami A, Karamalis A, Katouzian A, Navab N (2013) Segmentation by retrieval with guidedrandom walks: application to left ventricle segmentation in mri. Med Image Anal 17(2):236–253

13. Fidon L, Li W, Garcia-Peraza-Herrera LC, Ekanayake J, Kitchen N, Ourselin S, Vercauteren T (2017)Generalised wasserstein dice score for imbalanced multi-class segmentation using holistic convolutionalnetworks. In: International MICCAI Brainlesion workshop, pp 64–76. Springer

14. Fischl B, Salat DH, Van Der Kouwe AJ, Makris N, Segonne F, Quinn BT, Dale AM (2004) Sequence-independent segmentation of magnetic resonance images. Neuroimage 23:S69–S84

15. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y(2014) Generative Adversarial Networks ArXiv e-prints

16. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and otherneural network architectures. Neural Netw 18(5-6):602–610

17. Han X (2017) Automatic liver lesion segmentation using a deep convolutional neural network method.arXiv:1704.07239

18. Hashemi SR, Salehi SSM, Erdogmus D, Prabhu SP, Warfield SK, Gholipour A (2018) Tversky as aloss function for highly unbalanced image segmentation using 3d fully convolutional deep networks.arXiv:1803.11078

19. Inda Maria-del-Mar RB, Seoane J (2014) Glioblastoma multiforme: A look inside its heterogeneousnature. In: Cancer archive 226-239


20. Isensee F, Jaeger PF, Full PM, Wolf I, Engelhardt S, Maier-Hein KH (2017) Automatic cardiac dis-ease assessment on cine-mri via time-series segmentation and domain specific features. In: Internationalworkshop on statistical atlases and computational models of the heart, pp 120–129. Springer

21. Ishida T, Niu G, Hu W, Sugiyama M (2017) Learning from complementary labels. In: Advances inneural information processing systems, pp 5639–5649

22. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarialnetworks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

23. Jang J, Eo T, Kim M, Choi N, Han D, Kim D, Hwang D (2014) Medical image match-ing using variable randomized undersampling probability pattern in data acquisition. In: 2014international conference on electronics, information and communications (ICEIC), pp 1–2.https://doi.org/10.1109/ELINFOCOM.2014.6914453

24. Kaur R, Juneja M, Mandal A (2018) A comprehensive review of denoising techniques for abdominal ctimages. Multimedia Tools and Applications pp 1–36

25. Kohl S, Bonekamp D, Schlemmer H, Yaqubi K, Hohenfellner M, Hadaschik B, Radtke J, Maier-HeinKH (2017) Adversarial networks for the detection of aggressive prostate cancer. arXiv:1702.08014

26. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–44427. Mahapatra D (2014) Automatic cardiac segmentation using semantic information from random forests.

J Digit Imaging 27(6):794–80428. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.178429. Moeskops P, Veta M, Lafarge MW, Eppenhof KAJ, Pluim JPW (2017) Adversarial training and dilated

convolutions for brain MRI segmentation. arXiv:1707.0319530. Nasr GE, Badr E, Joun C (2002) Cross entropy error function in neural networks: Forecasting gasoline

demand. In: FLAIRS conference, pp 381–38431. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning

by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp2536–2544

32. Peng P, Lekadir K, Gooya A, Shao L, Petersen SE, Frangi AF (2016) A review of heart chamber seg-mentation for structural and functional analysis using cardiac magnetic resonance imaging. Magn ResonMater Phys, Biol Med 29(2):155–195

33. Pohl KM, Fisher J, Grimson WEL, Kikinis R, Wells WM (2006) A bayesian model for joint segmentationand registration. Neuroimage 31(1):228–239

34. Poudel RP, Lamata P, Montana G (2016) Recurrent fully convolutional neural networks for multi-slicemri cardiac segmentation. In: Reconstruction, segmentation, and analysis of medical images, pp 83–94.Springer

35. Prabhu V, Kuppusamy P, Karthikeyan A, Varatharajan R (2018) Evaluation and analysis of data drivenin expectation maximization segmentation through various initialization techniques in medical images.Multimed Tools Appl 77(8):10375–10390

36. Qiu Q, Song Z (2018) A nonuniform weighted loss function for imbalanced image classification. In:Proceedings of the 2018 international conference on image and graphics processing, pp 78–82. ACM

37. Rohe MM, Sermesant M, Pennec X (2017) Automatic multi-atlas segmentation of myocardium withsvf-net. In: Statistical atlases and computational modeling of the heart (STACOM) workshop

38. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmen-tation. In: International conference on medical image computing and computer-assisted intervention, pp234–241. Springer International Publishing

39. Rota Bulo S, Neuhold G, Kontschieder P (2017) Loss max-pooling for semantic image segmenta-tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2126–2135

40. Shahzad R, Gao S, Tao Q, Dzyubachyk O, van der Geest R (2016) Automated cardiovascular segmenta-tion in patients with congenital heart disease from 3d cmr scans: combining multi-atlases and level-sets.In: Reconstruction, segmentation, and analysis of medical images, pp 147–155

41. Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ (2017) Generalised dice overlap as a deeplearning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysisand multimodal learning for clinical decision support, pp 240–248. Springer

42. Tustison NJ, Avants BB, Cook PA, Zheng Y, Egan A, Yushkevich PA, Gee JC (2010) N4itk: improvedn3 bias correction. IEEE Trans Med Imaging 29(6):1310–1320

43. Vorontsov E, Tang A, Pal C, Kadoury S (2018) Liver lesion segmentation informed by joint liversegmentation. In: 15th IEEE international symposium on biomedical imaging (ISBI 2018), pp 1332–1335

44. Vorontsov E, Tang A, Pal C, Kadoury S (2018) Liver lesion segmentation informed by joint liver seg-mentation. In: 15th IEEE international symposium on biomedical imaging (ISBI 2018), pp 1332–1335


45. Wolterink JM, Leiner T, Viergever MA, Isgum I (2016) Dilated convolutional neural networks for car-diovascular mr segmentation in congenital heart disease. In: Reconstruction, segmentation, and analysisof medical images, pp 95–102. Springer

46. Wolterink JM, Leiner T, Viergever MA, Isgum I (2017) Automatic segmentation and disease classifica-tion using cardiac cine mr images. arXiv:1708.01141

47. Xu J, Schwing AG, Urtasun R (2014) Tell me what you see and i will show you where it is. In:Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3190–3197

48. Xue Y, Xu T, Zhang H, Long LR, Huang X (2017) Segan: Adversarial network with multi-scalel1 lossfor medical image segmentation. arXiv:1706.01805

49. Yu L, Yang X, Qin J, Heng PA (2016) 3d fractalnet: dense volumetric segmentation for cardiovascularmri volumes. In: Reconstruction, segmentation, and analysis of medical images, pp 103–110. Springer

50. Yu X, Liu T, Gong M, Tao D (2018) Learning with biased complementary labels. In: The europeanconference on computer vision (ECCV)

51. Zhang YD, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochas-tic pooling for tea category classification on gpu platform. Multimedia Tools and Applications pp1–19

52. Zhang YD, Zhao G, Sun J, Wu X, Wang ZH, Liu HM, Govindaraj VV, Zhan T, Li J (2017) Smartpathological brain detection by synthetic minority oversampling technique, extreme learning machine,and jaya algorithm. Multimedia Tools and Applications pp 1–20

53. Zhou Y, Berg TL (2016) Learning temporal transformations from time-lapse videos. In: Europeanconference on computer vision, pp 262–277

54. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistentadversarial networks. In: The IEEE international conference on computer vision (ICCV)

55. Zhu W, Xie X (2016) Adversarial deep structural networks for mammographic mass segmentation.arXiv:1612.05970

56. Zotti C, Luo Z, Humbert O, Lalande A, Jodoin PM (2017) Gridnet with automatic shape prior registrationfor automatic mri cardiac segmentation. arXiv:1705.08943

Mina Rezaei is currently a Ph.D. student at Chair of Internet Technologies and Systems, Hasso-Plattner Insti-tute (HPI), University of Potsdam, Germany. Prior to HPI, she received master degree in artificial intelligencefrom Shiraz University, in 2013 and bachelor of software engineering from Arak University, 2008. She wasworking more than 5 years as software developer in Statistical Center of Iran. In 2013, she had chance toresearch visits in Dept. of CAMP, Technical University of Munich, Germany and 2017, she had short-termresearch visits in Dept. of CS, University of Cape Town, South Africa and Nanjing University, China. Herresearch interests including deep learning, generative model, learning from imbalanced data, and medicalimage analysis.


Haojin Yang received the Diploma Engineering degree at the Technical University Ilmenau, in Germany2008. In 2013, he received the doctorate degree at the Hasso-Plattner-Institute for IT-Systems Engineering(HPI) at the University of Potsdam, in Germany. His current research interests revolve around multime-dia analysis, information retrieval, deep learning technologies, computer vision, content based video searchtechnologies.

ChristophMeinel studied mathematics and computer science at Humboldt University in Berlin. He receivedthe doctorate degree in 1981 and was habilitated in 1988. After visiting positions at the University of Pader-born and the Max-Planck-Institute for computer science in Saarbr?cken, he became a full professor ofcomputer science at the University of Trier. He is now the president and CEO of the Hasso- Plattner-Institutefor IT-Systems Engineering at the University of Potsdam. He is a full professor of computer science with achair in Internet technologies and systems. He is a member of acatech, the German National Academy ofScience and Engineering, and numerous scientific committees and supervisory boards. His research focuseson IT-security engineering, tele teaching, and telemedicine, multimedia retrieval. He has published more than500 papers in high-profile scientific journals and at international conferences.


Affiliations

Mina Rezaei1 ·Haojin Yang1 ·Christoph Meinel1

Haojin [email protected]

Christoph [email protected]

1 Hasso Plattner Institute, Prof. Dr. Helmert Street 2-3, Potsdam, Germany

10. MEDICAL IMAGE SEMANTIC SEGMENTATION

218

11

Discussion

In the previous chapters, some representative papers of my research work have

been presented. I will give a short discussion of the achievements regarding the

pre-defined research questions described in section 1.1.1.


data annotations? Through synthetic data? Or/and through unsupervised

and semi-supervised learning method?

Q2: How can we perform multiple computer vision tasks with a uniform


Discussion: In [YWBM16], we successfully developed a real-time scene

text recognition system SceneTextReg by taking advantages of both classical

computer vision techniques (e.g., MSERs detector) and high accurate deep

learning models. In order to solve the lacking of training data problem,

we developed a synthetic data engine which can produce large amounts of

text images with a broad range of variety. We achieved similar accuracy

compared to the method trained based on large-scale real-world samples.

Therefore, through this paper, we show that the sufficient model accuracy

can be obtained by using a carefully implemented synthetic data engine.

In the recent development, providing huge advantages the synthetic data

engines thus serve as one of the most efficient solutions for training DL

models.

219

11. DISCUSSION

As mentioned in out paper, synthetic data generation works for some use

cases, such as image text recognition, object detection, and segmentation,

but it doesn’t work for many other tasks, by which a highly accurate data

generator is hard to obtain. Therefore, the semi-supervised, as well as un-

supervised methods play more crucial roles in those use cases.

In the context of scene text recognition, we were thinking about two possible

research ideas: first, are we able to integrate the detection and recognition

task into a uniform neural network, and optimize the whole network end-to-

end? It is obviously feasible to accomplish that by using so-called multi-task

learning techniques in the fully supervised manner. However, are we able to

solve the problem in a semi-supervised way? If solving the whole task using

unsupervised or semi-supervised method is too complicated, could we solve

the intermediate task in a semi-supervised way? This is the motivation of

our work SEE [BYM18]. To our knowledge, SEE is the first work trying to

solve the text detection and recognition task with one end-to-end neural net-

work and train the text detection network part by only using semi-supervised

signal delivered from the text recognition network part. The human vision

system inspires this idea. We never provide fully supervised information like

the exact location and bounding box size of objects to kids, when we teach

the kids to recognize an object. The human vision system can learn to find

the target objects by using weak supervision signals such as context or at-

tention information, etc. In this work, we successfully trained deep models

for end-to-end scene text recognition, focused scene word recognition, and

achieved state-of-the-art results on different benchmark datasets. Every ap-

proach has its own shortcomings, so is ours. The current limitations, as well

as future working directions are discussed in chapter 12.1.


phones, embedded devices, wearable, and IoT devices?

Discussion: State-of-the-art deep models are computationally expensive

and consume large storage space. On the other hand, DL is also strongly

demanded by numerous applications from areas such as mobile platforms,

wearable devices, autonomous robots, and IoT devices. How to efficiently

220

apply deep models on such low power devices becomes a challenging research

problem. In this thesis, we presented two work to address this issue by

following the recently introduced Binary Neural Networks.

We developed BMXNet and published it as open-source software, by which

both the research community and industries can take advantages from. We

conducted an extensive study on training strategy and executive efficiency

of BNN. The results show that BMXNet achieves excellent performance re-

garding inference speed and memory usage and we can easily reduce the

model size with a large compression ratio (32× theoretical compression

rate).

After successfully building the foundation framework, we aimed to address

the general accuracy issue of BNNs further. We thus systematically evalu-

ated different network architectures and hyperparameters to provide useful

insights on how to train a BNN, which can benefit further research. We

introduced the following insights about binary neural networks which have

not been found in previous work:

– We evaluated the importance of removing bottleneck design

– Increasing number of shortcut connections increases accuracy and re-

duces model size

– Changing the clipping threshold can have a significant influence on the

training

– How to increase the bandwidth of the information flow seems to be one

of the most important factors for accuracy gain.

We showed meaningful scientific insights and made our models and codes

publicly available, which can serve as a solid foundation for the future re-

search work. The current biggest issue of BNNs remains the large accuracy

gap to the full precision counterpart. We will describe our idea for further

enhancing the information flow in the future work chapter.


ing tasks?

221

11. DISCUSSION

Discussion: According to this research question, we conducted two sub-

topics: visual-textual feature fusion in multimodal and cross-modal retrieval

task [WYM16a] and visual-language feature learning with its use case Image

Captioning [WYBM16, WYM18].

For the former one, we proposed a hybrid deep model RE-DNN which cap-

tures both the correlations between image and text pairs. In the image

retrieval task, the image representation used in previous work are hand-

crafted features such as SIFT, GIST, PHOW, etc. We propose to apply

high-level CNN features extracted from AlexNet model trained based on

ImageNet. Subsequently, visual and textual features are fused in a super-

vised manner. Overall, we achieved the state-of-the-art results using two

evaluation datasets. Another advantage of RE-DNN is that it is robust

to handle modality missing problem, by contrast, the most comparable ap-

proaches suffer difficulty in processing unpaired data. In our paper, we show

that RE-DNN is quite robust for solving this problem and outperformed the

alternative approaches with state-of-the-art performance on multimodal im-

age retrieval and cross-modal retrieval tasks.

For the latter, the Image Captioning task , we developed an end-to-end

trainable deep Bidirectional LSTM network to capture the semantical corre-

spondences between images and their caption sentences. We studied several

different architectural designs and investigated the corresponding activation

visualizations. This helps us to improve our understanding when we fed both

visual image features and sentence features into one LSTM network and to

learn the fused representations. The effectiveness and generalization ability

of the proposed model have been evaluated using mainstream benchmark

datasets, including Flickr8K [RYHH10], Flickr30K [YLHH14], MSCOCO

[LMB+14], and Pascal1K [RYHH10]. The experimental results show that

our models outperformed related work in both image captioning and image-

sentence retrieval task over almost all the tasks. However, even we could

achieve very promising results on multiple benchmarking datasets, we are

still far away from completely solving multi-modal representation learning

task. The reason is that the academical datasets have a lot of limitations,

222

far from comprehensively approximating the real-world data distribution.

Therefore, there are still a lot of challenges needed to be addressed in the

future work.



Discussion: Regarding this question, we studied the feasibility and prac-

ticality of the developed techniques in two practical use cases: Automatic

Online Lecture Analysis and Medical Image Segmentation.

In the first use case, based on the automatic analysis methods we devel-

oped a solution to highlight the online lecture videos at different levels of

granularity. We analyze a wide range of learning materials including lecture

speeches, transcripts and lecture slide both in the file and video format1.

In this approach, we applied analytical methods as well as DL models for

gathering several different lecture insights, creating highlighting informa-

tion based on them, and analyzing the behavior of learners further. From

our user study, we found that the extracted highlighting information can

facilitate the learners, especially in the context of MOOCs. In the qualita-

tive evaluation, our approach achieves satisfied precision, which outperforms

baseline methods and also welcomed by user feedbacks.

In the second use case, we aimed to investigate the practicability of the

state-of-the-art instance object detection and segmentation techniques from

DL on medical image segmentation task. We proposed a novel end-to-end

DL architecture to address the brain tumor and liver tumor segmentation

task. The proposed architecture achieves promising results on the popular

medical image benchmark datasets and is very well generalized for medical

images in different types and varied sizes.

The applied benchmark datasets include BraTS 2017 dataset for brain tu-

mor segmentation [20117a] (MRI images) and LiTS2017 dataset for liver

cancer segmentation [20117b] (Computed Tomography (CT ) images), and

MDA231 dataset for Microscopic cell segmentation [BE15]. Overall, the

1The slide screen is captured during the presentation by using a dedicated recording system.

223

11. DISCUSSION

achieved result demonstrates a strong generalization ability of the proposed

method for medical image segmentation task.

From our experimental results, we can draw the preliminary conclusion that

there exists a huge application potential of DL technologies in the field of

medical image processing. DL models can achieve outstanding performance

with sufficient and high qualitative labeled training data. In the future,

DL models could be an excellent aid to radiologists, effectively alleviating

the shortage of experienced doctors, and can help the doctors to ensure the

accuracy of the diagnosis as well.

224

12

Conclusion

In this thesis, I have presented several previous as well as ongoing studies on deep

representation learning using multimedia data. The involved research topics show

a broad range of variety, including a typical computer vision problem scene text

recognition using fully supervised and semi-supervised DL methods, multimodal

retrieval and multimodal feature fusion image captioning, binary neural networks

BMXNet and two application use cases: online lecture highlighting and medical

image segmentation using DL technologies. Furthermore, I wrote a relatively

comprehensive overview that summarizes the history, development, vision, and

related technical fundamentals of DL techniques, in order to give readers a better

understanding of the manuscripts included.

There is still a large room for improvement in the current work and several

derived exciting research directions attract me to follow. Therefore a comprehen-

sive outlook of future work is provided in the next section. I will also discuss

some of my opinions on the future development of DL technologies.

12.1 Future Work

In the current scene text recognition research, most of the approaches are focusing

on improving their text localization and word recognition performance using fully

supervised methods. Less work is about the unsupervised or semi-supervised

solution. Our work SEE is trying to open this more challenging but meaningful

direction of research. This goal motivates us to provide our codes, trained models

225

12. CONCLUSION

and compiled datasets as open-source to the research community. We intended

to encourage more researchers to join us in this research direction.

However, it fits our intuitive feeling that semi-supervised method such as SEE

cannot yet offer a strong text localizer as those based on fully supervised methods.

How to further improve both the localization recall and precision by using weakly

supervised methods still remains an open research question. We have achieved

some promising result in semi-supervised text localization on FSNS dataset, how-

ever, the text appearance form in this dataset is somehow monotonous. There is

only a small variety of font size and font style (due to the nature of the “street

sign” dataset), and the degree of geometric distortion of texts is limited. More-

over, in the current state, we note that our models are not fully capable of de-

tecting scene texts in arbitrary locations in the image, as we saw during our

experiments with the FSNS dataset. Currently, our model is also constrained

to a fixed number of maximum words that can be detected with one forward

pass. In the future work, we want to redesign the network in a way that makes

it possible for the network to determine the number of text lines in an image

by itself. Therefore, we still need to prove our approach in the more challenging

scene text localization scenarios. A straightforward idea is to use the additional

weak supervision signal to give more guidance to the localization network.

On the other hand, beyond the specific object type “text”, we also aim to

implement the idea of SEE on general object detection task. There are several

challenging questions needed to be answered for this direction: in order to measure

the probability of a captured image region to be an object, we need a new metric

called Objectness Score, but how do we define and evaluate the objectness of

detected regions? Once we get the regions with solid objectness scores, how can

we assign the most probably classes to them? The recently proposed “Model

Distillation” technique [HVD15] might be a good direction to follow, which can

be combined with SEE network. Model distillation is an effective technique to

transfer knowledge from a teacher to a student network. The typical application

is to transfer from a powerful large network or ensemble to a small network, in

order to meet the low-memory or fast execution requirements. The idea is to use

distillation method and a teacher model to train a student model semi-supervisely,

where the teacher model is trained on different object classes. The model should

226

12.1 Future Work

be guided to learn the latent correlation between two visually correlated objects.

For instance, a football and a pomeranian puppy with the same “while” color,

moving on the lawn. They are obviously two different objects, but seen from

a distance, they are showing some similar visual characteristics for an object

detector or object tracker. Could we apply this visual correlation information for

developing a more general object detector is an exciting research problem.

For BNNs research, almost all the recent methods are focusing on the accu-

racy enhancement. Because this drawback significantly limits its application and

needs special strengthening. In our work [BYBM18, BYBM19], we systemati-

cally evaluated different network architectures and hyperparameters to provide

useful insights on how to train a BNN. Based on that our future work will also be

focusing on how to reduce the precision gap between binary networks and their

full precision counterparts. Specifically, both the network architecture and the

binary layer design should be improved. I will mainly follow two ideas: first,

to beyond the existing approaches, a more efficient method to approximate the

full precision weights and activations using scaling factors is urgently needed;

second, we will try to find a better way to enhance the capacity of information

flow further. Because, based on our study, we found that the information flow

can significantly affect the optimization results. I believe that a better optimized

information flow path is one of the critical factors, which can ease the training

process and significantly gain the overall performance. The starting point here

might be how to further improve the shortcut connections beyond DenseNet.

For multimodal representation learning, we proposed a bidirectional LSTM

model that can generate caption sentences for an image by taking both historical

and future context into account. We studied several deep bidirectional LSTM

architectures to embed image and sentence at the high semantic space for learn-

ing the visual-language model. In this work, we also proved that multi-task

learning with bidirectional LSTM is beneficial to increase model generalization

ability, which is further confirmed by our transfer learning experiments. We

qualitatively visualized internal states of the proposed model to understand how

bidirectional LSTM with multimodal information generates words at consecutive

time steps. The robustness of the proposed models have been evaluated with

227

12. CONCLUSION

numerous datasets on two different tasks: image captioning and image-sentence

retrieval.

As the future direction, we will focus on exploring more efficient language

representation, e.g. word2vec [MSC+13b], and incorporating an attention mecha-

nism [VTBE15] into our model. Furthermore, the multilingual caption generation

is another interesting research problem to address. The proposed bidirectional

models can also be applied to other sequence learning tasks such as text recogni-

tion and video captioning.

In the medical image segmentation paper, we successfully applied a novel deep

architecture on the brain and liver tumor segmentation task. We achieved promis-

ing results in several popular medical imaging challenges. As the future work,

we plan to corporate with the doctors and further evaluate the generalization

ability of the developed framework on more medical image data in the clinical

context. Moreover, we will investigate the applicability of the current model for

learning multiple clinical tasks such as diseases diagnosis beside of the semantic

segmentation.

I briefly presented the basic idea and theoretical foundation of GANs in chap-

ter 2.3.4.2. Although GANs achieved state-of-the-art results on a large variety

of unsupervised learning tasks, training them is considered highly unstable, very

difficult and sensitive to hyperparameters, all the while, missing modes from the

data distribution or even collapsing large amounts of probability mass on some

modes. Successful GAN training usually requires large amounts of human and

computing efforts to fine-tune the hyperparameters, in order to stabilize train-

ing and avoid mode-collapse. People typically rely on their own experience and

tend to publish hyperparameters and recipes instead of a systematic method for

training GANs. In our recent work [MYM18], we extensively studied the mode-

collapse problem of GANs and proposed to incorporate an adversarial dropout

in generative multi-adversarial networks. Our approach forces the single genera-

tor not to constrain its output to satisfy a single discriminator, but, instead, to

fulfill a dynamic ensemble of discriminators. We show that this approach leads

to a more generalized generator, promoting variety in the generated samples and

avoiding the mode-collapse problem commonly experienced with GANs. In the

228

12.2 Some Concerns about DL

future work, we will apply the proposed approach in the medical image segmenta-

tion as well as the language generation task, to show more evidence of its ability

in eliminating mode-collapse and stabilizing training.

12.2 Some Concerns about DL

Regarding the future development, although DL has achieved many break-record

results in a large variety of perception tasks, it is still far from capable for the

common-sense reasoning. Yann LeCun even said that if in his lifetime, DL in

the common-sense reasoning field can reach the level of a mouse, which is enough

to meet his expectations. Some researchers like Gary Marcus believe that DL

should learn more about how human explore the cognitive world and apply more

cognitive representations of objects, datasets, spaces, etc. But, researchers from

DL camp like LeCun claimed that DL does not have to simulate the human

cognition behavior.

I personally, like many others, would like to replace the word Artificial Intelli-

gence with Machine Intelligence. The reason is based on the following thoughts:

steam engine released the human strength, but the steam engine does not mimic

human strength. Cars run faster than human, but cars don’t imitate the hu-

man’s legs. The future intelligent machines may release some of the brain power

of human, but computers don’t think like human’s brain. Machines should have

their own way to think. Moreover, our understanding of human’s brain itself is

extremely limited. Human beings need to learn to respect machine intelligence

technologies and machines should have their unique thinking and logic. Intelli-

gent machines should serve as the assistant for human rather than a replacement

of human beings.

Researchers like Ali Rahimi believe that many of the methods currently used

in ML lack of theoretical understanding, especially in the field of DL. Being able

to understand is undoubtedly a thing of great significance. The interpretable ML

is one of the most important topics in the research community. But, meanwhile,

we also have another important goal, which is to develop new technologies and

applications. I am more inclined to LeCun’s opinion. In the history of science and

technology, engineering products always precede the theoretical understanding:

229

12. CONCLUSION

lenses and telescopes come out before optics theory, steam engines come out

before thermal dynamics, aircraft come out before flight aerodynamics, radio

and data communication come out before information theory, and computers

come out before computer science. The reason is that theoretical researchers

will spontaneously study those “simple” phenomena, they will only divert their

attention to the complex problems when those problems begin to have important

practical implications.

If we want DL to have a longer-lasting and sustainable lifetime in the future,

a close collaboration between industry and academia is required. Generally, the

industry lacks the latest algorithms and the talents for algorithm engineering,

while the academic community lacks large-scale datasets representing the real-

world problems, and computing resources. Thus, good cooperation of those two

groups can significantly promote the overall development. On the other hand, we

should be vigilant about the excessive hype from the media and venture capital-

ists, also some startup companies, which may cause suspicion and resentment of

the entire society to the artificial intelligence industry.

230

PART III:

Appendices and References

231

Appendix A

Ph.D. Publications

• Ph.D. thesis: Haojin Yang, Automatic Video Indexing and Retrieval Us-

ing Video OCR Technology, Hasso-Plattner-Institute (HPI), Uni-Potsdam,

2013. Grade: “summa cum laude”12

• In Journals (3):

– Haojin Yang and Christoph Meinel, Content Based Lecture Video Re-

trieval Using Speech and Video Text Information. IEEE Transactions

on Learning Technologies (TLT), DIO: 10.1109/TLT.2014.2307305, on-

line ISSN: 1939-1382, pp. 142-154, volume 7, number 2, Publisher:

IEEE Computer Society and IEEE Education Society, April-June 2014

– Haojin Yang, Bernhard Quehl and Harald Sack, A Framework for Im-

proved Video Text Detection and Recognition. International Journal

of Multimedia Tools and Applications (MTAP), Print ISSN:1380-7501,

online ISSN:1573-7721, Publisher: Springer Netherlands, DOI:10.1007

/s11042-012-1250-6, 2012

– Haojin Yang, Harald Sack and Christoph Meinel, Lecture Video Index-

ing and Analysis Using Video OCR Technology. International Journal

of Multimedia Processing and Technologies (JMPT), Volume: 2, Is-

sue: 4, pp. 176-196, Print ISSN: 0976-4127, Online ISSN: 0976-4135,

December 20111https://de.wikipedia.org/wiki/Dissertation#Bewertungsstufen_einer_Dissertation2https://de.wikipedia.org/wiki/Promotion_(Doktor)#Deutschland

233

https://de.wikipedia.org/wiki/Dissertation#Bewertungsstufen_einer_Dissertation

https://de.wikipedia.org/wiki/Promotion_(Doktor)#Deutschland

A. PH.D. PUBLICATIONS

• In Conferences (10):

– Haojin Yang, Franka Grunewald, Matthias Bauer and Christoph Meinel,

Lecture Video Browsing Using Multimodal Information Resources. 12th

International Conference on Web-based Learning (ICWL 2013), Octo-

ber 6-9, 2013, Kenting, Taiwan. Springer lecture notes

– Franka Grunewald, Haojin Yang, Elnaz Mazandarani, Matthias Bauer

and Christoph Meinel Next Generation Tele-Teaching: Latest Recording

Technology, User Engagement and Automatic Metadata Retrieval. In-

ternational Conference on Human Factors in Computing and Informat-

ics (southCHI), Lecture Notes in Computer Science (LNCS) Springer,

01–03 July, 2013 Maribor, Slovenia

– Haojin Yang, Christoph Oehlke and Christoph Meinel, An Automated

Analysis and Indexing Framework for Lecture Video Portal. 11th Inter-

national Conference on Web-based Learning (ICWL 2012), September

2-4, 2012, Sinaia, Romania. Springer lecture notes, pp. 285–294, Vol-

ume 7558, 2012 (best student paper award)

– Haojin Yang, Bernhard Quehl, Harald Sack, A skeleton based binariza-

tion approach for video text recognition. 13th International Workshop

on Image analysis for multimedia interactive services (WIAMIS 2012),

IEEE Press, pp. 1–4, Dublin Ireland, May. 23-25, 2012

– C. Hentschel, J. Hercher, M. Knuth, J. Osterhoff, B. Quehl, H. Sack,

N. Steinmetz, J. Waitelonis, H-J.Yang (Alphabetical order of author’s

name) :Open Up Cultural Heritage in Video Archives with Mediaglobe.

12th International Conference on Innovative Internet Community Ser-

vices (I2CS 2012), Trondheim (Norway), June. 13-15, 2012 (best pa-

per award)

– Haojin Yang, Franka Grunewald and Christoph Meinel, Automated ex-

traction of lecture outlines from lecture videos: a hybrid solution for

lecture video indexing. 4th International Conference on Computer Sup-

ported Education (CSEDU 2012), SciTePress, Porto Portugal, pp. 13–

22, Publisher: SciTePress, April. 16-18, 2012

234

– Haojin Yang, Bernhard Quehl and Harald Sack, Text detection in video

images using adaptive edge detection and stroke width verification. 19th

International Conference on Systems, Signals and Image Processing

(IWSSIP 2012), IEEE Press, Vienna, Austria, pp. 9–12, April. 11-13,

2012

– Haojin Yang, Maria Siebert, Patrick Lhne, Harald Sack and Christoph

Meinel, Lecture Video Indexing and Analysis Using Video OCR Tech-

nology. 7th International Conference on Signal Image Technology and

Internet Based Systems (SITIS 2011), Track Internet Based Computing

and Systems, IEEE Press, Dijon (France), pp. 54–61, November. 28 -

December. 1, 2011

– Haojin Yang, Maria Siebert, Patrick Lhne, Harald Sack and Christoph

Meinel, Automatic Lecture Video Indexing Using Video OCR Technol-

ogy. IEEE International Symposium on Multimedia 2011 (ISM 2011),

IEEE Press, Dana Point, CA, USA, December. 5-7, 2011

– Haojin Yang, Christoph Oehlke and Christoph Meinel, A Solution for

German Speech Recognition for Analysis and Processing of Lecture Videos.

10th IEEE/ACIS International Conference on Computer and Informa-

tion Science (ICIS 2011) , IEEE Press, ISBN 9783642336423, pp. 285–

294, Sanya, Heinan Island, China, May. 2011

235

A. PH.D. PUBLICATIONS

236

Appendix B

Publications After Ph.D.

• In Journals (5):

– Mina Rezaei, Haojin Yang and Christoph Meinel: Recurrent generative

adversarial network for learning imbalanced medical image semantic

segmentation. International Journal of Multimedia Tools and Appli-

cations (MTAP), Special Issue: “Deep Learning for Computer - aided

Medical Diagnosis”, http://dx.doi.org/10.1007/s11042-019-7305-1

Feb. 2019

– Cheng Wang, Haojin Yang and Christoph Meinel, Image Captioning

with Deep Bidirectional LSTMs and Multi-Task Learning. ACM Trans-

actions on Multimedia Computing, Communications, and Applications

(TOMM), Volume 14 Issue 2s, No. 40, May 2018

– Xiaoyin Che, Haojin Yang, Christoph Meinel, Automatic Online Lec-

ture Highlighting Based on Multimedia Analysis, IEEE Transactions on

Learning Technologies (TLT), Publisher: IEEE Computer Society and

IEEE Education Society, Volume: PP, Issue: 99, Print ISSN: 1939-

1382, 2017

– Cheng Wang, Haojin Yang and Christoph Meinel, A Deep Seman-

tic Framework for Multimodal Representation Learning, International

Journal of Multimedia Tools and Applications (MTAP), online ISSN:1573-

7721, Print ISSN:1380-7501, Special Issue: Representation Learning for

Multimedia Data Understanding, March 2016

237

http://dx.doi.org/10.1007/s11042-019-7305-1

B. PUBLICATIONS AFTER PH.D.

– Xiaoyin Che, Haojin Yang, Christoph Meinel, The Automated Gen-

eration and Further Application of Tree-Structure Outline for Lecture

Videos with Synchronized Slides, International Journal of Technology

and Educational Marketing, Volume 4, Number 1, IGI Global, 2014

• In Conferences (>40):

– 2019

∗ Jonathan Sauder, Xiaoyin Che, Goncalo Mordido, Ting Hu, Hao-

jin Yang and Christoph Meinel, Best Student Forcing: A Novel

Training Mechanism in Adversarial Language Generation, the 57th

Annual Meeting of the Association for Computational Linguistics

(ACL 2019) (under review)

∗ Joseph Bethge, Haojin Yang, Marvin Bornstein, Christoph Meinel,

Back to Simplicity: How to Train Accurate BNNs from Scratch?.

International Conference on Computer Vision (ICCV 2019) (under

review)

∗ Mina Razaei, Haojin Yang, Christoph Meinel, Medical Image Se-

mantic Segmentation using Conditional Refinement Generative Ad-

versarial Networks. IEEE Winter Conference on Applications of

Computer Vision (WACV) IEEE, 2019

∗ Mina Rezaei, Haojin Yang, Christoph Meinel: Learning Imbalanced

Semantic Segmentation through Cross-Domain Relations of Multi-

Agent Generative Adversarial Networks. Accepted by SPIE Medi-

cal Imaging - Computer Aided Diagnosis (SPIE19)

– 2018

∗ Chrisitian Bartz, Haojin Yang, Christoph Meinel SEE: Towards

Semi-Supervised End-to-End Scene text Recognition, the Thirty-

Second AAAI Conference on Artificial Intelligence (AAAI-18), Febru-

ary 27, 2018 New Orleans, Lousiana, USA

∗ Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel

Learning to Train a Binary Neural Network. In: arXiv preprint

arXiv:1809.10463, 2018

238

∗ Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, Christoph

Meinel Training Competitive Binary Neural Networks from Scratch.

In: arXiv preprint arXiv:1812.01965, 2018

∗ Goncalo Mordido, Haojin Yang and Christoph Meinel, Dropout-

GAN: Learning from a Dynamic Ensemble of Discriminators ACM

KDD’18 Deep Learning Day (KDD DLDay 2018), London UK,

2018

∗ Mina Rezaei, Haojin Yang and Christoph Meinel Instance Tumor

Segmentation using Multitask Convolutional Neural Network Inter-

national Joint Conference on Neural Networks (IJCNN) 2018

∗ Mina Rezaei, Haojin Yang, Christoph Meinel Whole Heart and

Great Vessel Segmentation with Context-aware of Generative Ad-

versarial Networks Bildverarbeitung fur die Medizin (BVM) 2018

∗ Christian Bartz, Haojin Yang and Christoph Meinel, LoANs: Weakly

Supervised Object Detection with Localizer Assessor Networks In-

ternational Workshop on Advanced Machine Vision for Real-life

and Industrially Relevant Applications (AMV’18), Perth Australia,

2018

∗ Mina Rezaei, Haojin Yang, Christoph Meinel: voxel-GAN: Adver-

sarial Framework for Learning Imbalanced Brain Tumor Segmen-

tation. BrainLes@MICCAI 2018

∗ Mina Rezaei, Haojin Yang and Christoph Meinel, Generative Ad-

versarial Framework for Learning Multiple Clinical Tasks. Digital

Image Computing: Techniques and Applications (DICTA 2018)

∗ Mina Rezaei, Haojin Yang, Christoph Meinel, ”Automatic Cardiac

MRI Segmentation via Context-aware Recurrent Generative Adver-

sarial Neural Network”, Computer Assisted Radiology and Surgery

(CARS18)

∗ Jonathan Sauder, Xiaoyin Che, Gonalo Mordido, Haojin Yang and

Christoph Meinel. Pseudo-Ground-Truth Training for Adversarial

Text Generation with Reinforcement Learning. Deep Reinforce-

ment Learning Workshop at NeurIPS18

239


∗ Mina Rezaei, Haojin Yang, Christoph Meinel Recurrent Generative

Adversarial Network for Learning Multiple Clinical Tasks. Machine

Learning for Health Workshop at NeurIPS 2018 (ML4H)

– 2017

∗ Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph Meinel,

BMXNet: An Open-Source Binary Neural Network Implementation

Based on MXNet ACM International Conference on Multimedia

(ACM MM), October 23-27, 2017, Mountain View, CA USA

∗ Chrisitian Bartz, Haojin Yang, Christoph Meinel STN-OCR: A

single Neural Network for Text Detection and Text Recognition,

arXiv:1707.08831v1 2017

∗ Xiaoyin Che, Nico Ring, Willi Raschkowski, Haojin Yang and Christoph

Meinel, Traversal-Free Word Vector Evaluation in Analogy Space,

RepEval workshop at EMNLP 17 (Empirical Methods in Natural

Language Processing), September 711, 2017, Copenhagen, Den-

mark

∗ Christian Bartz, Tom Herold, Haojin Yang and Christoph Meinel

Language Identification Using Deep Convolutional Recurrent Neu-

ral Networks, 24th International Conference on Neural Information

Processing (ICONIP 2017), November 14-18, 2017, Guangzhou,

China

∗ Mina Rezaei, Haojin Yang and Christoph Meinel Deep Neural Net-

work with l2-norm Unit for Brain Lesions Detection, 24th Inter-

national Conference on Neural Information Processing (ICONIP

2017), November 14-18, 2017, Guangzhou, China

∗ Xiaoyin Che, Nico Ring, Willi Raschkowski, Haojin Yang and Christoph

Meinel Automatic Lecture Subtitle Generation and How It Helps,

17th IEEE International Conference on Advanced Learning Tech-

nologies (ICALT 2017), July 3-7, 2017, Timisoara, Romania

– 2016

240

∗ Haojin Yang, Cheng Wang, Christian Bartz, Christoph Meinel Scene-

TextReg: A Real-Time Video OCR System, ACM international con-

ference on Multimedia (ACM MM 2016), system demonstration

session, 15-19 October 2016, Amsterdam, The Netherlands

∗ Cheng Wang, Haojin Yang, Christian Bartz, Christoph Meinel Im-

age Captioning with Deep Bidirectional LSTMs, ACM international

conference on Multimedia (ACM MM 2016), full paper (oral pre-

sentation), 15-19 October 2016, Amsterdam, The Netherlands

∗ Xiaoyin Che, Sheng Luo, Haojin Yang and Christoph Meinel, Sen-

tence Boundary Detection Based on Parallel Lexical and Acoustic

Models, INTERSPEECH 2016, San Francisco, California, USA in

September 8-12, 2016

∗ Cheng Wang, Haojin Yang and Christoph Meinel, Exploring Mul-

timodal Video Representation for Action Recognition, the annual

International Joint Conference on Neural Networks (IJCNN 2016),

Vancouver, Canada, July 24-29, 2016

∗ Haojin Yang, Real-Time Video OCR System, system demonstration

at 41st IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP 2016), Show&Tell session, Shanghai

China, 20-25 March 2016

∗ Xiaoyin Che, Cheng Wang, Haojin Yang and Christoph Meinel,

Punctuation Prediction for Unsegmented Transcript Based on Word

Vector, the 10th International Conference on Language Resources

and Evaluation (LREC 2016), Portoro (Slovenia), 23-28 May 2016

∗ Sheng Luo, Haojin Yang, Cheng Wang, Xiaoyin Che, and Christoph

Meinel, Action Recognition in Surveillance Video Using ConvNets

and Motion History Image, International Conference on Artificial

Neural Networks (ICANN 2016), Barcelona Spain, 6th-9th of Septem-

ber 2016

∗ Sheng Luo, Haojin Yang, Cheng Wang, Xiaoyin Che and Christoph

Meinel, Real-time action recognition in surveillance videos using

241


ConvNets, in the 23rd International Conference on Neural Infor-

mation Processing (ICONIP 2016), in Kyoto (Japan), 16th-21th of

October 2016

∗ Hannes Rantzsch, Haojin Yang and Christoph Meinel Signature

Embedding: Writer Independent Offline Signature Verification with

Deep Metric Learning in 12th International Symposium on Visual

Computing (ISVC’16), Las Vegas USA, December 12-14, 2016

∗ Xiaoyin Che, Sheng Luo, Haojin Yang, Christoph Meinel Sentence-

Level Automatic Lecture Highlighting Based on Acoustic Analysis

16th IEEE International Conference on Computer and Information

Technology (IEEE CIT 2016), Shangri-La’s Fijian Resort, Fiji, 7-10

December 2016

∗ Xiaoyin Che, Thomas Staubitz, Haojin Yang and Christoph Meinel,

Pre-Course Key Segment Analysis of Online Lecture Videos, 16th

IEEE International Conference on Advancing Learning Technolo-

gies (ICALT-2016), Austin, Texas, USA, July 25-28, 2016

– 2015

∗ Cheng Wang, Haojin Yang, Christoph Meinel, Deep Semantic Map-

ping for Cross-Modal Retrieval, the 27th IEEE International Con-

ference on Tools with Artificial Intelligence (ICTAI 2015), Vietri

sul Mare, Italy, November 9-11, 2015

∗ Cheng Wang, Haojin Yang and Christoph Meinel, Does Multilevel

Semantic Representation Improve Text Categorization?, the 26th

International Conference on Database and Expert Systems Appli-

cations (DEXA 2015), Valencia, Spain, September 1-4, 2015

∗ Haojin Yang, Cheng Wang, XiaoYin Che, Sheng Luo and Ch.Meinel.

An Improved System For Real-Time Scene Text Recognition, ACM

International Conference on Multimedia Retrieval (ICMR 2015),

system demonstration session, Shanghai, June 23-26, 2015

∗ Cheng Wang, Haojin Yang, Xiaoyin Che and Christoph Meinel,

Concept-Based Multimodal Learning for Topic Generation, the 21st

242

MultiMedia Modelling Conference (MMM2015), Sydney, Australia,

Jan 5-7, 2015

∗ Sheng Luo, Haojin Yang and Christoph Meinel, Reward-based In-

termittent Reinforcement in Gamification for E-learning, 7th Inter-

national Conference on Computer Supported Education (CSEDU),

Lisbon, Portugal, Mai 23-25, 2015

∗ Xiaoyin Che, Haojin Yang and Christoph Meinel, Table Detection

from Slide Images, 7th Pacific Rim Symposium on Image and Video

Technology (PSIVT2015), 23-27 November, 2015, Auckland, New

Zealand

∗ Xiaoyin Che, Haojin Yang and Christoph Meinel, Adaptive E-Lecture

Video Outline Extraction Based on Slides Analysis, the 14th Inter-

national Conference on Web-based Learning (ICWL 2015), Guangzhou,

China, November 5-8, 2015

∗ Cheng Wang, Haojin Yang and Christoph Meinel, Visual-Textual

Late Semantic Fusion Using Deep Neural Network for Document

Categorization, the 22nd International Conference on Neural In-

formation Processing (ICONIP2015), Istanbul, Turkey, November

9-12, 2015

– 2014

∗ Bernhard Quehl, Haojin Yang and Harald Sack, Improving text

recognition by distinguishing scene and overlay text, the 7th Inter-

national Conference on Machine Vision (ICMV 2014), Milan, Italy,

November 19-21, 2014

– 2013

∗ Xiaoyin Che, Haojin Yang, Christoph Meinel, Lecture Video Seg-

mentation by Automatically Analyzing the Synchronized Slides, The

21st ACM International Conference on Multimedia (ACM MM),

October 21-25, 2013, Barcelona, Spain

∗ Franka Grunewald, Haojin Yang, Christoph Meinel, Evaluating the

Digital Manuscript Functionality - User Testing For Lecture Video

Annotation Features, 12th International Conference on Web-based

243


Learning (ICWL 2013), 6 - 9th October 2013, Kenting, Taiwan.

Springer lecture notes, 2013. (best student paper award)

∗ Xiaoyin Che, Haojin Yang, Christoph Meinel, Tree-Structure Out-

line Generation for Lecture Videos with Synchronized Slides, The

Second International Conference on E-Learning and E-Technologies

in Education (ICEEE2013), 23-25th September 2013, Lodz Poland

244

Appendix C

Deep Learning Applications

An incomplete list of DL applications:

• document processing [HS11]

• image classification and recognition [SZ14b, KSH12, HZRS16]

• video classification [KTS+14]

• sequence generation [Gra13]

• text, speech, image and video processing [LBH15]

• speech recognition and spoken language understanding [HDY+12, ZCY+16]

• text-to-speech generation [WSRS+17, ACC+17]

• sentence classification and modelling [Kim14, KGB14]

• premise selection [ISA+16]

• document and sentence processing [LM14]

• generating image captions [VTBE15, WYBM16]

• photographic style transfer [LPSB17]

• natural image manifold [ZKSE16]

• image colorization [ZIE16]

245

C. DEEP LEARNING APPLICATIONS

• visual question answering [AAL+15]

• generating textures and stylized images [ULVL16]

• visual recognition and description [DAHG+15]

• object detection [SMH+11]

• character motion synthesis and editing [HSK16]

• word repreasentation [MCCD13]

• singing synthesis [BB17]

• person identification [LZXW14]

• face recognition and verification [TYRW14]

• action recognition in videos [SZ14a]

• classifying and visualizing motion capture sequences [CC14]

• handwriting generation and prediction [CHJO16]

• machine translation [BCB14, WSC+16]

• named entity recognition [LBS+16]

• conversational agents [GBC+17]

• cancer detection [EKN+17]

• audio generation [VDODZ+16]

• X-ray CT reconstruction [KMY17]

• hardware acceleration [HLM+16]

• robotics [LLS15]

• autonomous driving [CSKX15]

• pedestrian detection [OW13]

246

• internet of things [LOD18]

• signature identification [RYM16]

247

C. DEEP LEARNING APPLICATIONS

248

References

[20117a] 2017, BraTS: BraTS 2017. https://www.med.upenn.edu/sbia/

brats2017.html, 2017 8, 13, 223

[20117b] 2017, LiTS: LiTS 2017 . https://competitions.codalab.org/

competitions/15595, 2017 8, 13, 223

[AAL+15] Antol, Stanislaw ; Agrawal, Aishwarya ; Lu, Jiasen ;

Mitchell, Margaret ; Batra, Dhruv ; Lawrence Zitnick, C

; Parikh, Devi: Vqa: Visual question answering. In: Proceedings

of the IEEE international conference on computer vision, 2015, S.

2425–2433 246

[ABC+16] Abadi, Martın ; Barham, Paul ; Chen, Jianmin ; Chen, Zhifeng

; Davis, Andy ; Dean, Jeffrey ; Devin, Matthieu ; Ghemawat,

Sanjay ; Irving, Geoffrey ; Isard, Michael u. a.: Tensorflow: a

system for large-scale machine learning. In: OSDI Bd. 16, 2016,

S. 265–283 9, 64

[ACC+17] Arik, Sercan O. ; Chrzanowski, Mike ; Coates, Adam ; Di-

amos, Gregory ; Gibiansky, Andrew ; Kang, Yongguo ; Li,

Xian ; Miller, John ; Ng, Andrew ; Raiman, Jonathan u. a.:

Deep voice: Real-time neural text-to-speech. In: arXiv preprint

arXiv:1702.07825 (2017) 245

[AR15] Aubry, Mathieu ; Russell, Bryan C.: Understanding deep

features with computer-generated imagery. In: Proceedings of

the IEEE International Conference on Computer Vision, 2015, S.

2875–2883 70

249

https://www.med.upenn.edu/sbia/brats2017.html

https://www.med.upenn.edu/sbia/brats2017.html

https://competitions.codalab.org/competitions/15595

https://competitions.codalab.org/competitions/15595

REFERENCES

[B+09] Bengio, Yoshua u. a.: Learning deep architectures for AI. In:

Foundations and trends R© in Machine Learning 2 (2009), Nr. 1, S.

1–127 21, 37, 60

[BB+95] Bishop, Chris ; Bishop, Christopher M. u. a.: Neural networks

for pattern recognition. Oxford university press, 1995 24

[BB17] Blaauw, Merlijn ; Bonada, Jordi: A neural parametric singing

synthesizer. In: arXiv preprint arXiv:1704.03809 (2017) 246

[BCB14] Bahdanau, Dzmitry ; Cho, Kyunghyun ; Bengio, Yoshua: Neu-

ral machine translation by jointly learning to align and translate.

In: arXiv preprint arXiv:1409.0473 (2014) 63, 246

[BCC+16] Bojarski, Mariusz ; Choromanska, Anna ; Choromanski,

Krzysztof ; Firner, Bernhard ; Jackel, Larry ; Muller, Urs ;

Zieba, Karol: Visualbackprop: visualizing cnns for autonomous

driving. In: arXiv preprint (2016) xv, 59, 60, 70

[BCNN13a] Bissacco, A. ; Cummins, M. ; Netzer, Y. ; Neven, H.: Pho-

toOCR: Reading Text in Uncontrolled Conditions. In: 2013 IEEE

International Conference on Computer Vision, 2013. – ISSN 1550–

5499, S. 785–792 9, 39

[BCNN13b] Bissacco, Alessandro ; Cummins, Mark ; Netzer, Yuval ;

Neven, Hartmut: PhotoOCR: Reading Text in Uncontrolled Con-

ditions. In: Proceedings of the IEEE International Conference on

Computer Vision, 2013, 785-792 101

[BCV13a] Bengio, Y. ; Courville, A. ; Vincent, P.: Representation

Learning: A Review and New Perspectives. In: IEEE Transactions

on Pattern Analysis and Machine Intelligence 35 (2013), Aug, Nr.

8, S. 1798–1828. http://dx.doi.org/10.1109/TPAMI.2013.50. –

DOI 10.1109/TPAMI.2013.50. – ISSN 0162–8828 7

250

http://dx.doi.org/10.1109/TPAMI.2013.50

REFERENCES

[BCV13b] Bengio, Yoshua ; Courville, Aaron ; Vincent, Pascal: Repre-

sentation learning: A review and new perspectives. In: IEEE trans-

actions on pattern analysis and machine intelligence 35 (2013), Nr.

8, S. 1798–1828 20

[BDTD+16] Bojarski, Mariusz ; Del Testa, Davide ; Dworakowski,

Daniel ; Firner, Bernhard ; Flepp, Beat ; Goyal, Prasoon

; Jackel, Lawrence D. ; Monfort, Mathew ; Muller, Urs ;

Zhang, Jiakai u. a.: End to end learning for self-driving cars. In:

arXiv preprint arXiv:1604.07316 (2016) vi, 5

[BE15] Biological Engineering, Massachusetts Institute of T. o.:

MDA231 human breast carcinoma cell dataset. http://www.

celltrackingchallenge.net/datasets.html, 2015 8, 13, 223

[BGV92] Boser, Bernhard E. ; Guyon, Isabelle M. ; Vapnik, Vladimir N.:

A Training Algorithm for Optimal Margin Classifiers. In: Pro-

ceedings of the Fifth Annual Workshop on Computational Learning

Theory. New York, NY, USA : ACM, 1992 (COLT ’92). – ISBN

0–89791–497–X, 144–152 19

[Blo18] Blog, OpenAI: OpenAI Five, has started to defeat amateur human

teams at Dota 2. 2018 23, 68

[BQ15] Bernhard Quehl, Harald S. Haojin Yang Y. Haojin Yang: Im-

proving text recognition by distinguishing scene and overlay text,

2015, 9445 - 9445 - 5 9

[BYBM18] Bethge, Joseph ; Yang, Haojin ; Bartz, Christian ; Meinel,

Christoph: Learning to Train a Binary Neural Network. In: arXiv

preprint arXiv:1809.10463 (2018) 10, 14, 57, 82, 227

[BYBM19] Bethge, Joseph ; Yang, Haojin ; Bornstein, Marvin ; Meinel,

Christoph: Back to Simplicity: How to Train Accurate BNNs from

Scratch? In: arXiv preprint arXiv:1906.08637 (2019) 10, 14, 82,

227

251

http://www.celltrackingchallenge.net/datasets.html

http://www.celltrackingchallenge.net/datasets.html

REFERENCES

[BYM17a] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: STN-

OCR: A single Neural Network for Text Detection and Text Recog-

nition. In: CoRR abs/1707.08831 (2017). http://arxiv.org/

abs/1707.08831 9

[BYM17b] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: STN-

OCR: A single Neural Network for Text Detection and Text Recog-

nition. In: arXiv preprint arXiv:1707.08831 (2017) 91

[BYM18] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: SEE:

Towards Semi-Supervised End-to-End Scene Text Recognition. In:

Proceedings of the 2018 Conference on Artificial Intelligence, 2018

(AAAI ’18) vi, 6, 9, 14, 73, 81, 220

[CC14] Cho, Kyunghyun ; Chen, Xi: Classifying and visualizing motion

capture sequences using deep neural networks. In: Computer Vi-

sion Theory and Applications (VISAPP), 2014 International Con-

ference on Bd. 2 IEEE, 2014, S. 122–130 246

[CHJO16] Carter, Shan ; Ha, David ; Johnson, Ian ; Olah, Chris: Exper-

iments in handwriting with a neural network. In: Distill 1 (2016),

Nr. 12, S. e4 246

[Cho17] Chollet, Francois: Xception: Deep learning with depthwise sep-

arable convolutions. In: arXiv preprint (2017), S. 1610–02357 58

[Cis17] Cisco: Cisco Visual Networking Index: Forecast

and Methodology, 20162021. Version: 2017. https:

//www.cisco.com/c/en/us/solutions/collateral/

service-provider/visual-networking-index-vni/

complete-white-paper-c11-481360.pdf. Cisco public, 2017, 3

vi, 3

[CLL+15] Chen, Tianqi ; Li, Mu ; Li, Yutian ; Lin, Min ; Wang, Naiyan

; Wang, Minjie ; Xiao, Tianjun ; Xu, Bing ; Zhang, Chiyuan ;

Zhang, Zheng: Mxnet: A flexible and efficient machine learning

252

http://arxiv.org/abs/1707.08831


https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.pdf




REFERENCES

library for heterogeneous distributed systems. In: arXiv preprint

arXiv:1512.01274 (2015) vii, 6, 9, 10, 101

[CLYM16] Che, Xiaoyin ; Luo, Sheng ; Yang, Haojin ; Meinel, Christoph:

Sentence Boundary Detection Based on Parallel Lexical and Acous-

tic Models. In: Interspeech, 2016, S. 2528–2532 12

[CN15] Chiu, Jason P. ; Nichols, Eric: Named entity recognition with

bidirectional LSTM-CNNs. In: arXiv preprint arXiv:1511.08308

(2015) 35

[CPC16] Canziani, Alfredo ; Paszke, Adam ; Culurciello, Eugenio: An

analysis of deep neural network models for practical applications.

In: arXiv preprint arXiv:1605.07678 (2016) xv, 57, 58

[CSKX15] Chen, Chenyi ; Seff, Ari ; Kornhauser, Alain ; Xiao, Jianx-

iong: Deepdriving: Learning affordance for direct perception in

autonomous driving. In: Proceedings of the IEEE International

Conference on Computer Vision, 2015, S. 2722–2730 246

[CUH15] Clevert, Djork-Arne ; Unterthiner, Thomas ; Hochreiter,

Sepp: Fast and accurate deep network learning by exponential

linear units (elus). In: arXiv preprint arXiv:1511.07289 (2015) 44

[CVMG+14] Cho, Kyunghyun ; Van Merrienboer, Bart ; Gulcehre,

Caglar ; Bahdanau, Dzmitry ; Bougares, Fethi ; Schwenk,

Holger ; Bengio, Yoshua: Learning phrase representations us-

ing RNN encoder-decoder for statistical machine translation. In:

arXiv preprint arXiv:1406.1078 (2014) 35

[CWV+14] Chetlur, Sharan ; Woolley, Cliff ; Vandermersch, Philippe

; Cohen, Jonathan ; Tran, John ; Catanzaro, Bryan ; Shel-

hamer, Evan: cudnn: Efficient primitives for deep learning. In:


[Cyb89] Cybenko, George: Approximation by superpositions of a sig-

moidal function. In: Mathematics of control, signals and systems

2 (1989), Nr. 4, S. 303–314 25

253

REFERENCES

[CYM13] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Lecture

video segmentation by automatically analyzing the synchronized

slides. In: Proceedings of the 21st ACM international conference

on Multimedia ACM, 2013, S. 345–348 12

[CYM15] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Adaptive

e-lecture video outline extraction based on slides analysis. In: In-

ternational Conference on Web-Based Learning Springer, 2015, S.

59–68 12

[CYM18] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Automatic

Online Lecture Highlighting Based on Multimedia Analysis. In:

IEEE Transactions on Learning Technologies 11 (2018), Nr. 1, S.

27–40 12, 14, 82

[DAHG+15] Donahue, Jeffrey ; Anne Hendricks, Lisa ; Guadarrama,

Sergio ; Rohrbach, Marcus ; Venugopalan, Subhashini ;

Saenko, Kate ; Darrell, Trevor: Long-term recurrent convo-

lutional networks for visual recognition and description. In: Pro-

ceedings of the IEEE conference on computer vision and pattern

recognition, 2015, S. 2625–2634 246

[DB16] Dosovitskiy, Alexey ; Brox, Thomas: Inverting visual represen-

tations with convolutional networks. In: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2016, S.

4829–4837 70

[DDS+09] Deng, J. ; Dong, W. ; Socher, R. ; Li, L.-J. ; Li, K. ; Fei-Fei,

L.: ImageNet: A Large-Scale Hierarchical Image Database. In:

CVPR09, 2009 17, 19, 62

[DHS11] Duchi, John ; Hazan, Elad ; Singer, Yoram: Adaptive sub-

gradient methods for online learning and stochastic optimization.

In: Journal of Machine Learning Research 12 (2011), Nr. Jul, S.

2121–2159 46

254

REFERENCES

[DT05] Dalal, Navneet ; Triggs, Bill: Histograms of oriented gradients

for human detection. In: Computer Vision and Pattern Recogni-

tion, 2005. CVPR 2005. IEEE Computer Society Conference on

Bd. 1 IEEE, 2005, S. 886–893 19

[EKN+17] Esteva, Andre ; Kuprel, Brett ; Novoa, Roberto A. ; Ko,

Justin ; Swetter, Susan M. ; Blau, Helen M. ; Thrun, Sebas-

tian: Dermatologist-level classification of skin cancer with deep

neural networks. In: Nature 542 (2017), Nr. 7639, S. 115 vi, 5, 63,

246

[Fac18] Facebook: Facebook statistics. http://facebook.com/, 2018 vi,

3

[FH17] Frosst, Nicholas ; Hinton, Geoffrey: Distilling a neural net-

work into a soft decision tree. In: arXiv preprint arXiv:1711.09784

(2017) 71

[Fuk88] Fukushima, Kunihiko: Neocognitron: A hierarchical neural net-

work capable of visual pattern recognition. In: Neural networks 1

(1988), Nr. 2, S. 119–130 28

[FV91] Felleman, Daniel J. ; Van, DC E.: Distributed hierarchical

processing in the primate cerebral cortex. In: Cerebral cortex (New

York, NY: 1991) 1 (1991), Nr. 1, S. 1–47 20

[Gar93] Garofolo, John S.: TIMIT acoustic phonetic continuous speech

corpus. In: Linguistic Data Consortium, 1993 (1993) 62

[GB10] Glorot, Xavier ; Bengio, Yoshua: Understanding the difficulty

of training deep feedforward neural networks. In: Proceedings of

the thirteenth international conference on artificial intelligence and

statistics, 2010, S. 249–256 39, 40

[GBC+17] Ghazvininejad, Marjan ; Brockett, Chris ; Chang, Ming-

Wei ; Dolan, Bill ; Gao, Jianfeng ; Yih, Wen-tau ; Galley,

Michel: A knowledge-grounded neural conversation model. In:


255

http://facebook.com/

REFERENCES

[GBCB16] Goodfellow, Ian ; Bengio, Yoshua ; Courville, Aaron ; Ben-

gio, Yoshua: Deep learning. Bd. 1. MIT press Cambridge, 2016

22, 24, 37

[GBI+14] Goodfellow, Ian ; Bulatov, Yaroslav ; Ibarz, Julian ;

Arnoud, Sacha ; Shet, Vinay: Multi-digit Number Recogni-

tion from Street View Imagery using Deep Convolutional Neural

Networks. In: ICLR2014, 2014 101

[GDDM14] Girshick, Ross ; Donahue, Jeff ; Darrell, Trevor ; Malik,

Jitendra: Rich feature hierarchies for accurate object detection and

semantic segmentation. In: Proceedings of the IEEE conference on

computer vision and pattern recognition, 2014, S. 580–587 65

[GDG+15] Gregor, Karol ; Danihelka, Ivo ; Graves, Alex ; Rezende,

Danilo J. ; Wierstra, Daan: Draw: A recurrent neural network

for image generation. In: arXiv preprint arXiv:1502.04623 (2015)

66

[Gir15] Girshick, Ross: Fast r-cnn. In: Proceedings of the IEEE inter-

national conference on computer vision, 2015, S. 1440–1448 66

[GPAM+14] Goodfellow, Ian ; Pouget-Abadie, Jean ; Mirza, Mehdi ;

Xu, Bing ; Warde-Farley, David ; Ozair, Sherjil ; Courville,

Aaron ; Bengio, Yoshua: Generative adversarial nets. In: Ad-

vances in neural information processing systems, 2014, S. 2672–

2680 13, 23, 67, 68

[Gra13] Graves, Alex: Generating sequences with recurrent neural net-

works. In: arXiv preprint arXiv:1308.0850 (2013) 245

[GS05] Graves, Alex ; Schmidhuber, Jurgen: Framewise phoneme clas-

sification with bidirectional LSTM and other neural network archi-

tectures. In: Neural Networks 18 (2005), Nr. 5-6, S. 602–610 35

[HCS+16] Hubara, Itay ; Courbariaux, Matthieu ; Soudry, Daniel ; El-

Yaniv, Ran ; Bengio, Yoshua: Binarized neural networks. In:

256

REFERENCES

Advances in neural information processing systems, 2016, S. 4107–

4115 9, 10, 72

[HDY+12] Hinton, Geoffrey ; Deng, Li ; Yu, Dong ; Dahl, George E. ;

Mohamed, Abdel-rahman ; Jaitly, Navdeep ; Senior, Andrew

; Vanhoucke, Vincent ; Nguyen, Patrick ; Sainath, Tara N.

u. a.: Deep neural networks for acoustic modeling in speech recog-

nition: The shared views of four research groups. In: IEEE Signal

processing magazine 29 (2012), Nr. 6, S. 82–97 245

[HGDG17] He, Kaiming ; Gkioxari, Georgia ; Dollar, Piotr ; Girshick,

Ross: Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE

International Conference on IEEE, 2017, S. 2980–2988 66, 73

[HHQ+16] He, Pan ; Huang, Weilin ; Qiao, Yu ; Loy, Chen C. ; Tang,

Xiaoou: Reading scene text in deep convolutional sequences. In:

Proceedings of the Thirtieth AAAI Conference on Artificial Intel-

ligence, AAAI Press, 2016, S. 3501–3508 101

[HL08] Huiskes, Mark J. ; Lew, Michael S.: The MIR flickr retrieval

evaluation. In: Proceedings of the 1st ACM international confer-

ence on Multimedia information retrieval ACM, 2008, S. 39–43 11

[HLM+16] Han, Song ; Liu, Xingyu ; Mao, Huizi ; Pu, Jing ; Pedram,

Ardavan ; Horowitz, Mark A. ; Dally, William J.: EIE: efficient

inference engine on compressed deep neural network. In: Computer

Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International

Symposium on IEEE, 2016, S. 243–254 246

[HMD15] Han, Song ; Mao, Huizi ; Dally, William J.: Deep Compression:

Compressing Deep Neural Networks with Pruning, Trained Quan-

tization and Huffman Coding. (2015), 1–14. http://dx.doi.org/

abs/1510.00149/1510.00149. – DOI abs/1510.00149/1510.00149

58, 71

257

http://dx.doi.org/abs/1510.00149/1510.00149

http://dx.doi.org/abs/1510.00149/1510.00149

REFERENCES

[HS97] Hochreiter, Sepp ; Schmidhuber, Jurgen: Long short-term

memory. In: Neural computation 9 (1997), Nr. 8, S. 1735–1780 28,

34

[HS06] Hinton, Geoffrey E. ; Salakhutdinov, Ruslan R.: Reducing

the dimensionality of data with neural networks. In: science 313

(2006), Nr. 5786, S. 504–507 23, 61

[HS11] Hinton, Geoffrey ; Salakhutdinov, Ruslan: Discovering binary

codes for documents by learning deep generative models. In: Topics

in Cognitive Science 3 (2011), Nr. 1, S. 74–91 245

[HSK16] Holden, Daniel ; Saito, Jun ; Komura, Taku: A deep learning

framework for character motion synthesis and editing. In: ACM

Transactions on Graphics (TOG) 35 (2016), Nr. 4, S. 138 246

[HSW89] Hornik, Kurt ; Stinchcombe, Maxwell ; White, Halbert: Mul-

tilayer feedforward networks are universal approximators. In: Neu-

ral networks 2 (1989), Nr. 5, S. 359–366 25

[HVD15] Hinton, Geoffrey ; Vinyals, Oriol ; Dean, Jeff: Distill-

ing the knowledge in a neural network. In: arXiv preprint

arXiv:1503.02531 (2015) 226

[HZC+17] Howard, Andrew G. ; Zhu, Menglong ; Chen, Bo ;

Kalenichenko, Dmitry ; Wang, Weijun ; Weyand, Tobias

; Andreetto, Marco ; Adam, Hartwig: MobileNets: Effi-

cient Convolutional Neural Networks for Mobile Vision Applica-

tions. (2017). http://dx.doi.org/arXiv:1704.04861. – DOI

arXiv:1704.04861 58, 71

[HZRS15] He, Kaiming ; Zhang, Xiangyu ; Ren, Shaoqing ; Sun, Jian:

Delving deep into rectifiers: Surpassing human-level performance

on imagenet classification. In: Proceedings of the IEEE interna-

tional conference on computer vision, 2015, S. 1026–1034 40, 44

258

http://dx.doi.org/arXiv:1704.04861

REFERENCES

[HZRS16] He, Kaiming ; Zhang, Xiangyu ; Ren, Shaoqing ; Sun, Jian:

Deep residual learning for image recognition. In: Proceedings of

the IEEE conference on computer vision and pattern recognition,

2016, S. 770–778 29, 54, 100, 245

[IHM+16] Iandola, Forrest N. ; Han, Song ; Moskewicz, Matthew W.

; Ashraf, Khalid ; Dally, William J. ; Keutzer, Kurt:

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters

and <0.5MB model size. (2016), 1–13. http://dx.doi.org/10.

1007/978-3-319-24553-9. – DOI 10.1007/978–3–319–24553–9. –

ISBN 978–3–319–24552–2 58, 71

[IS15a] Ioffe, Sergey ; Szegedy, Christian: Batch normalization: Accel-

erating deep network training by reducing internal covariate shift.

In: arXiv preprint arXiv:1502.03167 (2015) 37, 40, 41

[IS15b] Ioffe, Sergey ; Szegedy, Christian: Batch normalization: Accel-

erating deep network training by reducing internal covariate shift.

In: International conference on machine learning, 2015, S. 448–456

100

[ISA+16] Irving, Geoffrey ; Szegedy, Christian ; Alemi, Alexander A.

; Een, Niklas ; Chollet, Francois ; Urban, Josef: Deepmath-

deep sequence models for premise selection. In: Advances in Neural

Information Processing Systems, 2016, S. 2235–2243 245

[IZZE17] Isola, Phillip ; Zhu, Jun-Yan ; Zhou, Tinghui ; Efros,

Alexei A.: Image-to-image translation with conditional adversarial

networks. In: arXiv preprint (2017) 63

[JSD+14] Jia, Yangqing ; Shelhamer, Evan ; Donahue, Jeff ; Karayev,

Sergey ; Long, Jonathan ; Girshick, Ross ; Guadarrama, Ser-

gio ; Darrell, Trevor: Caffe: Convolutional architecture for fast

feature embedding. In: Proceedings of the 22nd ACM international

conference on Multimedia ACM, 2014, S. 675–678 9

259

http://dx.doi.org/10.1007/978-3-319-24553-9

http://dx.doi.org/10.1007/978-3-319-24553-9

REFERENCES

[JSVZ14] Jaderberg, M. ; Simonyan, K. ; Vedaldi, A. ; Zisserman, A.:

Synthetic Data and Artificial Neural Networks for Natural Scene

Text Recognition. In: Workshop on Deep Learning, NIPS, 2014

101

[JSVZ15] Jaderberg, Max ; Simonyan, Karen ; Vedaldi, Andrea ; Zis-

serman, Andrew: Reading Text in the Wild with Convolu-

tional Neural Networks. In: International Journal of Computer

Vision 116 (2015), Nr. 1, 1-20. http://dx.doi.org/10.1007/

s11263-015-0823-z. – DOI 10.1007/s11263–015–0823–z. – ISSN

0920–5691, 1573–1405 xvii, 101

[JVZ14a] Jaderberg, Max ; Vedaldi, Andrea ; Zisserman, Andrew:

Deep Features for Text Spotting. In: Computer Vision - ECCV

2014, Springer International Publishing, 2014 (Lecture Notes in

Computer Science 8692). – ISBN 978–3–319–10592–5 978–3–319–

10593–2, 512-528 101

[JVZ14b] Jaderberg, Max ; Vedaldi, Andrea ; Zisserman, Andrew:

Deep Features for Text Spotting. In: Fleet, David (Hrsg.) ; Pa-

jdla, Tomas (Hrsg.) ; Schiele, Bernt (Hrsg.) ; Tuytelaars,

Tinne (Hrsg.): Computer Vision - ECCV 2014, Springer Interna-

tional Publishing, 2014 (Lecture Notes in Computer Science 8692).

– ISBN 978–3–319–10592–5 978–3–319–10593–2, 512-528 101

[KB14] Kingma, Diederik P. ; Ba, Jimmy: Adam: A method for stochas-

tic optimization. In: arXiv preprint arXiv:1412.6980 (2014) 37,

47

[KGB14] Kalchbrenner, Nal ; Grefenstette, Edward ; Blunsom,

Phil: A convolutional neural network for modelling sentences. In:


[KGBN+15] Karatzas, Dimosthenis ; Gomez-Bigorda, Lluis ; Nicolaou,

Anguelos ; Ghosh, Suman ; Bagdanov, Andrew ; Iwamura,

Masakazu ; Matas, Jiri ; Neumann, Lukas ; Chandrasekhar,

260

http://dx.doi.org/10.1007/s11263-015-0823-z

http://dx.doi.org/10.1007/s11263-015-0823-z

REFERENCES

Vijay R. ; Lu, Shijian ; Shafait, Faisal ; Uchida, Seiichi ; Val-

veny, Ernest: ICDAR 2015 Competition on Robust Reading. In:

Proceedings of the 2015 13th International Conference on Docu-

ment Analysis and Recognition (ICDAR). Washington, DC, USA :

IEEE Computer Society, 2015 (ICDAR ’15). – ISBN 978–1–4799–

1805–8, 1156–1160 100

[KH09] Krizhevsky, Alex ; Hinton, Geoffrey: Learning multiple layers

of features from tiny images / Citeseer. 2009. – Forschungsbericht

19

[Kim14] Kim, Yoon: Convolutional neural networks for sentence classifica-

tion. In: arXiv preprint arXiv:1408.5882 (2014) 245

[KL17] Koh, Pang W. ; Liang, Percy: Understanding black-box predic-

tions via influence functions. In: arXiv preprint arXiv:1703.04730

(2017) 70

[KMY17] Kang, Eunhee ; Min, Junhong ; Ye, Jong C.: A deep convolu-

tional neural network using directional wavelets for low-dose X-ray

CT reconstruction. In: Medical physics 44 (2017), Nr. 10 246

[KSH12] Krizhevsky, Alex ; Sutskever, Ilya ; Hinton, Geoffrey E.:

Imagenet classification with deep convolutional neural networks.

In: Advances in neural information processing systems, 2012, S.

1097–1105 17, 36, 48, 245

[KSU+13] Karatzas, Dimosthenis ; Shafait, Faisal ; Uchida, Seiichi ;

Iwamura, Masakazu ; Bigorda, Lluis G. ; Mestre, Sergi R.

; Mas, Joan ; Mota, David F. ; Almazan, Jon A. ; Heras,

Lluis P. l.: ICDAR 2013 robust reading competition. In: 2013 12th

International Conference on Document Analysis and Recognition,

IEEE, 2013, S. 1484–1493 101

[KTS+14] Karpathy, Andrej ; Toderici, George ; Shetty, Sanketh ; Le-

ung, Thomas ; Sukthankar, Rahul ; Fei-Fei, Li: Large-scale

261

REFERENCES

video classification with convolutional neural networks. In: Pro-

ceedings of the IEEE conference on Computer Vision and Pattern

Recognition, 2014, S. 1725–1732 245

[L+15] LeCun, Yann u. a.: LeNet-5, convolutional neural networks. In:

URL: http://yann. lecun. com/exdb/lenet (2015), S. 20 xiii, 29, 48

[LAE+16] Liu, Wei ; Anguelov, Dragomir ; Erhan, Dumitru ; Szegedy,

Christian ; Reed, Scott ; Fu, Cheng-Yang ; Berg, Alexander C.:

Ssd: Single shot multibox detector. In: European conference on

computer vision Springer, 2016, S. 21–37 66

[LBBH98] LeCun, Yann ; Bottou, Leon ; Bengio, Yoshua ; Haffner,

Patrick: Gradient-based learning applied to document recognition.

In: Proceedings of the IEEE 86 (1998), Nr. 11, S. 2278–2324 28

[LBH15] LeCun, Yann ; Bengio, Yoshua ; Hinton, Geoffrey: Deep learn-

ing. In: nature 521 (2015), Nr. 7553, S. 436 20, 24, 37, 245

[LBS+16] Lample, Guillaume ; Ballesteros, Miguel ; Subramanian,

Sandeep ; Kawakami, Kazuya ; Dyer, Chris: Neural ar-

chitectures for named entity recognition. In: arXiv preprint

arXiv:1603.01360 (2016) 246

[LLS15] Lenz, Ian ; Lee, Honglak ; Saxena, Ashutosh: Deep learning for

detecting robotic grasps. In: The International Journal of Robotics

Research 34 (2015), Nr. 4-5, S. 705–724 246

[LM14] Le, Quoc ; Mikolov, Tomas: Distributed representations of sen-

tences and documents. In: International Conference on Machine

Learning, 2014, S. 1188–1196 245

[LMB+14] Lin, Tsung-Yi ; Maire, Michael ; Belongie, Serge ; Hays, James

; Perona, Pietro ; Ramanan, Deva ; Dollar, Piotr ; Zitnick,

C L.: Microsoft coco: Common objects in context. In: European

conference on computer vision Springer, 2014, S. 740–755 11, 222

262

REFERENCES

[LNZ+17] Li, Zefan ; Ni, Bingbing ; Zhang, Wenjun ; Yang, Xiaokang ;

Gao, Wen: Performance Guaranteed Network Acceleration via

High-Order Residual Quantization. (2017). http://arxiv.org/

abs/1708.08687 9

[LOD18] Li, He ; Ota, Kaoru ; Dong, Mianxiong: Learning IoT in edge:

deep learning for the internet of things with edge computing. In:

IEEE Network 32 (2018), Nr. 1, S. 96–101 247

[Low04] Lowe, David G.: Distinctive image features from scale-invariant

keypoints. In: International journal of computer vision 60 (2004),

Nr. 2, S. 91–110 19

[LPSB17] Luan, Fujun ; Paris, Sylvain ; Shechtman, Eli ; Bala, Kavita:

Deep photo style transfer. In: CoRR, abs/1703.07511 2 (2017)

245

[Lu15] Lu, Yao: Unsupervised learning on neural network outputs:

with application in zero-shot learning. In: arXiv preprint

arXiv:1506.00990 (2015) 70

[LW+02] Liaw, Andy ; Wiener, Matthew u. a.: Classification and regres-

sion by randomForest. In: R news 2 (2002), Nr. 3, S. 18–22 19

[LZP17] Lin, Xiaofan ; Zhao, Cong ; Pan, Wei: Towards Accurate Binary

Convolutional Neural Network. In: Advances in Neural Informa-

tion Processing Systems, 2017, S. 344–352 9, 72

[LZXW14] Li, Wei ; Zhao, Rui ; Xiao, Tong ; Wang, Xiaogang: Deepreid:

Deep filter pairing neural network for person re-identification. In:

Proceedings of the IEEE Conference on Computer Vision and Pat-

tern Recognition, 2014, S. 152–159 246

[MAJ12] Mishra, Anand ; Alahari, Karteek ; Jawahar, Cv: Scene Text

Recognition using Higher Order Language Priors. In: BMVC 2012-

23rd British Machine Vision Conference, British Machine Vision

Association, 2012. – ISBN 1–901725–46–4, 127.1-127.11 101

263



REFERENCES

[Mar18] Marcus, Gary: Deep learning: A critical appraisal. In: arXiv

preprint arXiv:1801.00631 (2018) 75, 76

[MCCD13] Mikolov, Tomas ; Chen, Kai ; Corrado, Greg ; Dean, Jeffrey:

Efficient estimation of word representations in vector space. In:


[MHN13] Maas, Andrew L. ; Hannun, Awni Y. ; Ng, Andrew Y.: Rectifier

nonlinearities improve neural network acoustic models. In: Proc.

icml Bd. 30, 2013, S. 3 44

[MKS+13] Mnih, Volodymyr ; Kavukcuoglu, Koray ; Silver, David ;

Graves, Alex ; Antonoglou, Ioannis ; Wierstra, Daan ;

Riedmiller, Martin: Playing atari with deep reinforcement learn-

ing. In: arXiv preprint arXiv:1312.5602 (2013) 23

[MKS+15] Mnih, Volodymyr ; Kavukcuoglu, Koray ; Silver, David ;

Rusu, Andrei A. ; Veness, Joel ; Bellemare, Marc G. ;

Graves, Alex ; Riedmiller, Martin ; Fidjeland, Andreas K. ;

Ostrovski, Georg u. a.: Human-level control through deep rein-

forcement learning. In: Nature 518 (2015), Nr. 7540, S. 529 23,

68

[MLA+13] Malik, M. I. ; Liwicki, M. ; Alewijnse, L. ; Ohyama, W. ;

Blumenstein, M. ; Found, B.: ICDAR 2013 Competitions on

Signature Verification and Writer Identification for On- and Offline

Skilled Forgeries (SigWiComp 2013). In: 2013 12th International

Conference on Document Analysis and Recognition, 2013. – ISSN

1520–5363, S. 1477–1483 9

[MP43] McCulloch, Warren S. ; Pitts, Walter: A logical calculus of

the ideas immanent in nervous activity. In: The bulletin of math-

ematical biophysics 5 (1943), Nr. 4, S. 115–133 25

[MP69] Minsky, Marvin ; Papert, Seymour A.: Perceptrons: An intro-

duction to computational geometry. MIT press, 1969 25

264

REFERENCES

[MSC+13a] Mikolov, Tomas ; Sutskever, Ilya ; Chen, Kai ; Corrado,

Greg S. ; Dean, Jeff: Distributed representations of words and

phrases and their compositionality. In: Advances in neural infor-

mation processing systems, 2013, S. 3111–3119 19

[MSC+13b] Mikolov, Tomas ; Sutskever, Ilya ; Chen, Kai ; Corrado,

Greg S. ; Dean, Jeff: Distributed representations of words and

phrases and their compositionality. In: Advances in neural infor-

mation processing systems, 2013, S. 3111–3119 228

[MYM18] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph:

Dropout-GAN: Learning from a Dynamic Ensemble of Discrimi-

nators. In: arXiv preprint arXiv:1807.11346 (2018) 228

[MYM19a] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph:

Dropout-GAN: Learning from a Dynamic Ensemble of Discrimi-

nators (under review). In: Proceedings of the 2019 Conference on

Artificial Intelligence, 2019 (AAAI ’19) 13

[MYM19b] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph: Mi-

croGAN: Promoting Variety through Micro-Batch Discrimination

(under review). In: Proceedings of the International Conference on

Learning Representations, 2019 (ICLR ’19) 13

[NH10] Nair, Vinod ; Hinton, Geoffrey E.: Rectified linear units im-

prove restricted boltzmann machines. In: Proceedings of the 27th

international conference on machine learning (ICML-10), 2010, S.

807–814 29, 37, 43

[OMS17] Olah, Chris ; Mordvintsev, Alexander ; Schubert, Ludwig:

Feature visualization. In: Distill 2 (2017), Nr. 11, S. e7 70

[OPM02] Ojala, Timo ; Pietikainen, Matti ; Maenpaa, Topi: Multires-

olution gray-scale and rotation invariant texture classification with

local binary patterns. In: IEEE Transactions on pattern analysis

and machine intelligence 24 (2002), Nr. 7, S. 971–987 19

265

REFERENCES

[OW13] Ouyang, Wanli ; Wang, Xiaogang: Joint deep learning for pedes-

trian detection. In: Proceedings of the IEEE International Confer-

ence on Computer Vision, 2013, S. 2056–2063 246

[PCC+] Paszke, Adam ; Chintala, Soumith ; Collobert, Ronan ;

Kavukcuoglu, Koray ; Farabet, Clement ; Bengio, Samy ;

Melvin, Iain ; Weston, Jason ; Mariethoz, Johnny: Pytorch:

Tensors and dynamic neural networks in python with strong gpu

acceleration, may 2017 9

[Qia99] Qian, Ning: On the momentum term in gradient descent learning

algorithms. In: Neural networks 12 (1999), Nr. 1, S. 145–151 45

[RCPC+10] Rasiwasia, Nikhil ; Costa Pereira, Jose ; Coviello,

Emanuele ; Doyle, Gabriel ; Lanckriet, Gert R. ; Levy, Roger

; Vasconcelos, Nuno: A new approach to cross-modal multi-

media retrieval. In: Proceedings of the 18th ACM international

conference on Multimedia ACM, 2010, S. 251–260 11

[RDGF16] Redmon, Joseph ; Divvala, Santosh ; Girshick, Ross ;

Farhadi, Ali: You only look once: Unified, real-time object de-

tection. In: Proceedings of the IEEE conference on computer vision

and pattern recognition, 2016, S. 779–788 66

[RDS+15] Russakovsky, Olga ; Deng, Jia ; Su, Hao ; Krause, Jonathan

; Satheesh, Sanjeev ; Ma, Sean ; Huang, Zhiheng ; Karpathy,

Andrej ; Khosla, Aditya ; Bernstein, Michael u. a.: Imagenet

large scale visual recognition challenge. In: International Journal

of Computer Vision 115 (2015), Nr. 3, S. 211–252 54, 62

[RHGS15] Ren, Shaoqing ; He, Kaiming ; Girshick, Ross ; Sun, Jian:

Faster r-cnn: Towards real-time object detection with region pro-

posal networks. In: Advances in neural information processing

systems, 2015, S. 91–99 66, 73

266

REFERENCES

[RHW86a] Rumelhart, David E. ; Hinton, Geoffrey E. ; Williams,

Ronald J.: Learning representations by back-propagating errors.

In: nature 323 (1986), Nr. 6088, S. 533 20

[RHW86b] Rumelhart, David E. ; Hinton, Geoffrey E. ; Williams,

Ronald J.: Learning representations by back-propagating errors.

In: nature 323 (1986), Nr. 6088, S. 533 26

[RORF16] Rastegari, Mohammad ; Ordonez, Vicente ; Redmon, Joseph

; Farhadi, Ali: Xnor-net: Imagenet classification using binary

convolutional neural networks. In: European Conference on Com-

puter Vision Springer, 2016, S. 525–542 9, 10, 72

[Ros58] Rosenblatt, Frank: The perceptron: a probabilistic model for

information storage and organization in the brain. In: Psychologi-

cal review 65 (1958), Nr. 6, S. 386 25

[RVRK16] Richter, Stephan R. ; Vineet, Vibhav ; Roth, Stefan ;

Koltun, Vladlen: Playing for Data: Ground Truth from Com-

puter Games. In: Leibe, Bastian (Hrsg.) ; Matas, Jiri (Hrsg.)

; Sebe, Nicu (Hrsg.) ; Welling, Max (Hrsg.): European Con-

ference on Computer Vision (ECCV) Bd. 9906, Springer Interna-

tional Publishing, 2016 (LNCS), S. 102–118 xiv, 39

[RYHH10] Rashtchian, Cyrus ; Young, Peter ; Hodosh, Micah ; Hock-

enmaier, Julia: Collecting image annotations using Amazon’s

Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Work-

shop on Creating Speech and Language Data with Amazon’s Me-

chanical Turk Association for Computational Linguistics, 2010, S.

139–147 11, 222

[RYM16] Rantzsch, Hannes ; Yang, Haojin ; Meinel, Christoph: Signa-

ture Embedding: Writer Independent Offline Signature Verification

with Deep Metric Learning. In: Bebis, George (Hrsg.) ; Boyle,

Richard (Hrsg.) ; Parvin, Bahram (Hrsg.) ; Koracin, Darko

267

REFERENCES

(Hrsg.) ; Porikli, Fatih (Hrsg.) ; Skaff, Sandra (Hrsg.) ; En-

tezari, Alireza (Hrsg.) ; Min, Jianyuan (Hrsg.) ; Iwai, Daisuke

(Hrsg.) ; Sadagic, Amela (Hrsg.) ; Scheidegger, Carlos (Hrsg.)

; Isenberg, Tobias (Hrsg.): Advances in Visual Computing. Cham

: Springer International Publishing, 2016. – ISBN 978–3–319–

50832–0, S. 616–625 9, 247

[RYM18] Razaei, Mina ; Yang, Haojin ; Meinel, Christoph: Instance Tu-

mor Segmentation using Multitask Convolutional Neural Network.

In: IEEE Joint Conference on Neural Networks (IJCNN) IEEE,

2018 13

[RYM19a] Razaei, Mina ; Yang, Haojin ; Meinel, Christoph: Conditional

Generative Adversarial Refinement Networks for Unbalanced Med-

ical Image Semantic Segmentation, 2019 (WACV ’19) 13

[RYM19b] Rezaei, Mina ; Yang, Haojin ; Meinel, Christoph: Recurrent

generative adversarial network for learning imbalanced medical im-

age semantic segmentation. In: Multimedia Tools and Applications

(2019), Feb. http://dx.doi.org/10.1007/s11042-019-7305-1.

– DOI 10.1007/s11042–019–7305–1. – ISSN 1573–7721 13, 14, 82

[SBY16] Shi, Baoguang ; Bai, Xiang ; Yao, Cong: An end-to-end train-

able neural network for image-based sequence recognition and its

application to scene text recognition. In: IEEE Transactions on

Pattern Analysis and Machine Intelligence (2016) 101

[Sch15] Schmidhuber, Jurgen: Deep learning in neural networks: An

overview. In: Neural networks 61 (2015), S. 85–117 20

[SDBR14] Springenberg, Jost T. ; Dosovitskiy, Alexey ; Brox, Thomas

; Riedmiller, Martin: Striving for simplicity: The all convolu-

tional net. In: arXiv preprint arXiv:1412.6806 (2014) 69

[SFH17] Sabour, Sara ; Frosst, Nicholas ; Hinton, Geoffrey E.: Dy-

namic routing between capsules. In: Advances in Neural Informa-

tion Processing Systems, 2017, S. 3856–3866 74

268

http://dx.doi.org/10.1007/s11042-019-7305-1

REFERENCES

[SGZ+16] Salimans, Tim ; Goodfellow, Ian ; Zaremba, Wojciech ;

Cheung, Vicki ; Radford, Alec ; Chen, Xi: Improved tech-

niques for training gans. In: Advances in Neural Information Pro-

cessing Systems, 2016, S. 2234–2242 68

[SHK+14] Srivastava, Nitish ; Hinton, Geoffrey ; Krizhevsky, Alex ;

Sutskever, Ilya ; Salakhutdinov, Ruslan: Dropout: a simple

way to prevent neural networks from overfitting. In: The Journal

of Machine Learning Research 15 (2014), Nr. 1, S. 1929–1958 37,

42

[SHS+17] Silver, David ; Hubert, Thomas ; Schrittwieser, Julian ;

Antonoglou, Ioannis ; Lai, Matthew ; Guez, Arthur ; Lanc-

tot, Marc ; Sifre, Laurent ; Kumaran, Dharshan ; Grae-

pel, Thore u. a.: Mastering chess and shogi by self-play with

a general reinforcement learning algorithm. In: arXiv preprint

arXiv:1712.01815 (2017) 23, 68

[SHZ+18] Sandler, Mark ; Howard, Andrew ; Zhu, Menglong ; Zhmogi-

nov, Andrey ; Chen, Liang-Chieh: Inverted residuals and linear

bottlenecks: Mobile networks for classification, detection and seg-

mentation. In: arXiv preprint arXiv:1801.04381 (2018) 72

[SLJ+15] Szegedy, Christian ; Liu, Wei ; Jia, Yangqing ; Sermanet,

Pierre ; Reed, Scott ; Anguelov, Dragomir ; Erhan, Dumitru ;

Vanhoucke, Vincent ; Rabinovich, Andrew ; Others: Going

deeper with convolutions Cvpr, 2015 37, 50, 51

[SMDH13] Sutskever, Ilya ; Martens, James ; Dahl, George ; Hinton,

Geoffrey: On the importance of initialization and momentum in

deep learning. In: International conference on machine learning,

2013, S. 1139–1147 39

[SMH+11] Susskind, Joshua ; Mnih, Volodymyr ; Hinton, Geoffrey u. a.:

On deep generative models with applications to recognition. In:

269

REFERENCES

Computer Vision and Pattern Recognition (CVPR), 2011 IEEE

Conference on IEEE, 2011, S. 2857–2864 246

[SSS+17] Silver, David ; Schrittwieser, Julian ; Simonyan, Karen ;

Antonoglou, Ioannis ; Huang, Aja ; Guez, Arthur ; Hubert,

Thomas ; Baker, Lucas ; Lai, Matthew ; Bolton, Adrian u. a.:

Mastering the game of go without human knowledge. In: Nature

550 (2017), Nr. 7676, S. 354 vi, 5, 23, 63, 68

[SVK17] Su, Jiawei ; Vargas, Danilo V. ; Kouichi, Sakurai: One

pixel attack for fooling deep neural networks. In: arXiv preprint

arXiv:1710.08864 (2017) 70

[SVZ13] Simonyan, Karen ; Vedaldi, Andrea ; Zisserman, Andrew:

Deep inside convolutional networks: Visualising image classifica-

tion models and saliency maps. In: arXiv preprint arXiv:1312.6034

(2013) 69

[SWL+16] Shi, Baoguang ; Wang, Xinggang ; Lyu, Pengyuan ; Yao, Cong

; Bai, Xiang: Robust scene text recognition with automatic rec-

tification. In: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2016, S. 4168–4176 101, 102

[SZ14a] Simonyan, Karen ; Zisserman, Andrew: Two-stream convolu-

tional networks for action recognition in videos. In: Advances in

neural information processing systems, 2014, S. 568–576 246

[SZ14b] Simonyan, Karen ; Zisserman, Andrew: Very deep convolutional

networks for large-scale image recognition. In: arXiv preprint

arXiv:1409.1556 (2014) 245

[SZ15] Simonyan, Karen ; Zisserman, Andrew: Very deep convolutional

networks for large-scale image recognition. In: ICLR, 2015 37, 50

[TCP+18] Tan, Mingxing ; Chen, Bo ; Pang, Ruoming ; Vasudevan, Vijay

; Le, Quoc V.: MnasNet: Platform-Aware Neural Architecture

Search for Mobile. In: arXiv preprint arXiv:1807.11626 (2018)

72, 75

270

REFERENCES

[TOHC15] Tokui, Seiya ; Oono, Kenta ; Hido, Shohei ; Clayton, Justin:

Chainer: a next-generation open source framework for deep learn-

ing. In: Proceedings of workshop on machine learning systems

(LearningSys) in the twenty-ninth annual conference on neural in-

formation processing systems (NIPS) Bd. 5, 2015, S. 1–6 9

[TYRW14] Taigman, Yaniv ; Yang, Ming ; Ranzato, Marc’Aurelio ; Wolf,

Lior: Deepface: Closing the gap to human-level performance in

face verification. In: Proceedings of the IEEE conference on com-

puter vision and pattern recognition, 2014, S. 1701–1708 246

[ULVL16] Ulyanov, Dmitry ; Lebedev, Vadim ; Vedaldi, Andrea ; Lem-

pitsky, Victor S.: Texture Networks: Feed-forward Synthesis of

Textures and Stylized Images. In: ICML, 2016, S. 1349–1357 246

[VDODZ+16] Van Den Oord, Aaron ; Dieleman, Sander ; Zen, Heiga ;

Simonyan, Karen ; Vinyals, Oriol ; Graves, Alex ; Kalch-

brenner, Nal ; Senior, Andrew W. ; Kavukcuoglu, Koray:

WaveNet: A generative model for raw audio. In: SSW, 2016, S.

125 66, 246

[VLBM08] Vincent, Pascal ; Larochelle, Hugo ; Bengio, Yoshua ; Man-

zagol, Pierre-Antoine: Extracting and composing robust features

with denoising autoencoders. In: Proceedings of the 25th interna-

tional conference on Machine learning ACM, 2008, S. 1096–1103

23, 61

[VRD+15] Venugopalan, Subhashini ; Rohrbach, Marcus ; Donahue,

Jeffrey ; Mooney, Raymond ; Darrell, Trevor ; Saenko, Kate:

Sequence to sequence-video to text. In: Proceedings of the IEEE

international conference on computer vision, 2015, S. 4534–4542

63

[VTBE15] Vinyals, Oriol ; Toshev, Alexander ; Bengio, Samy ; Erhan,

Dumitru: Show and tell: A neural image caption generator. In:

271

REFERENCES

Proceedings of the IEEE conference on computer vision and pattern

recognition, 2015, S. 3156–3164 228, 245

[WBB11] Wang, Kai ; Babenko, B. ; Belongie, S.: End-to-end scene

text recognition. In: 2011 International Conference on Computer

Vision, 2011, S. 1457–1464 101, 102

[WSC+16] Wu, Yonghui ; Schuster, Mike ; Chen, Zhifeng ; Le, Quoc V. ;

Norouzi, Mohammad ; Macherey, Wolfgang ; Krikun, Maxim

; Cao, Yuan ; Gao, Qin ; Macherey, Klaus u. a.: Google’s neural

machine translation system: Bridging the gap between human and

machine translation. In: arXiv preprint arXiv:1609.08144 (2016)

63, 246

[WSRS+17] Wang, Yuxuan ; Skerry-Ryan, RJ ; Stanton, Daisy ; Wu,

Yonghui ; Weiss, Ron J. ; Jaitly, Navdeep ; Yang, Zongheng ;

Xiao, Ying ; Chen, Zhifeng ; Bengio, Samy u. a.: Tacotron: A

fully end-to-end text-to-speech synthesis model. In: arXiv preprint

(2017) 245

[WYBM16] Wang, Cheng ; Yang, Haojin ; Bartz, Christian ; Meinel,

Christoph: Image Captioning with Deep Bidirectional LSTMs.

In: Proceedings of the 2016 ACM on Multimedia Conference. New

York, NY, USA : ACM, 2016 (MM ’16). – ISBN 978–1–4503–3603–

1, 988–997 vi, 7, 12, 35, 63, 73, 222, 245

[WYM15a] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Deep se-

mantic mapping for cross-modal retrieval. In: Tools with Artificial

Intelligence (ICTAI), 2015 IEEE 27th International Conference on

IEEE, 2015, S. 234–241 11

[WYM15b] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Visual-

Textual Late Semantic Fusion Using Deep Neural Network for

Document Categorization. In: International Conference on Neural

Information Processing Springer, 2015, S. 662–670 11

272

REFERENCES

[WYM16a] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: A deep

semantic framework for multimodal representation learning. In:

Multimedia Tools and Applications 75 (2016), Aug, Nr. 15, 9255–

9276. http://dx.doi.org/10.1007/s11042-016-3380-8. – DOI

10.1007/s11042–016–3380–8 vi, 7, 11, 14, 82, 222

[WYM16b] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Exploring

multimodal video representation for action recognition. In: Neural

Networks (IJCNN), 2016 International Joint Conference on IEEE,

2016, S. 1924–1931 11

[WYM18] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Image Cap-

tioning with Deep Bidirectional LSTMs and Multi-Task Learning.

In: ACM Trans. Multimedia Comput. Commun. Appl. 14 (2018),

April, Nr. 2s, 40:1–40:20. http://dx.doi.org/10.1145/3115432.

– DOI 10.1145/3115432. – ISSN 1551–6857 vi, 7, 12, 14, 82, 222

[WYX+17] Wang, Bokun ; Yang, Yang ; Xu, Xing ; Hanjalic, Alan ; Shen,

Heng T.: Adversarial cross-modal retrieval. In: Proceedings of the

2017 ACM on Multimedia Conference ACM, 2017, S. 154–162 73

[WZZ+13] Wan, Li ; Zeiler, Matthew ; Zhang, Sixin ; Le Cun, Yann ;

Fergus, Rob: Regularization of neural networks using dropcon-

nect. In: International Conference on Machine Learning, 2013, S.

1058–1066 42

[XWCL15] Xu, Bing ; Wang, Naiyan ; Chen, Tianqi ; Li, Mu: Empirical

evaluation of rectified activations in convolutional network. In:

arXiv preprint arXiv:1505.00853 (2015) 37, 45

[YFBM17a] Yang, Haojin ; Fritzsche, Martin ; Bartz, Christian ; Meinel,

Christoph: Bmxnet: An open-source binary neural network imple-

mentation based on mxnet. In: Proceedings of the 2017 ACM on

Multimedia Conference ACM, 2017, S. 1209–1212 10

[YFBM17b] Yang, Haojin ; Fritzsche, Martin ; Bartz, Christian ; Meinel,

Christoph: BMXNet: An Open-Source Binary Neural Network

273

http://dx.doi.org/10.1007/s11042-016-3380-8

http://dx.doi.org/10.1145/3115432

REFERENCES

Implementation Based on MXNet. In: Proceedings of the 2017

ACM on Multimedia Conference. New York, NY, USA : ACM,

2017 (MM ’17). – ISBN 978–1–4503–4906–2, 1209–1212 14, 57,

58, 81

[YLHH14] Young, Peter ; Lai, Alice ; Hodosh, Micah ; Hockenmaier,

Julia: From image descriptions to visual denotations: New sim-

ilarity metrics for semantic inference over event descriptions. In:

Transactions of the Association for Computational Linguistics 2

(2014), S. 67–78 11, 222

[You18] YouTube: YouTube statistics. http://youtube.com/, 2018 vi,

3

[YWBM16] Yang, Haojin ; Wang, Cheng ; Bartz, Christian ; Meinel,

Christoph: SceneTextReg: A Real-Time Video OCR System. In:

Proceedings of the 2016 ACM on Multimedia Conference. New

York, NY, USA : ACM, 2016 (MM ’16). – ISBN 978–1–4503–

3603–1, 698–700 vi, 6, 9, 14, 38, 81, 219

[YWC+15] Yang, Haojin ; Wang, Cheng ; Che, Xiaoyin ; Luo, Sheng ;

Meinel, Christoph: An Improved System For Real-Time Scene

Text Recognition. In: Proceedings of the 5th ACM on International

Conference on Multimedia Retrieval. New York, NY, USA : ACM,

2015 (ICMR ’15). – ISBN 978–1–4503–3274–3, 657–660 9

[ZCS+17] Zhang, Quanshi ; Cao, Ruiming ; Shi, Feng ; Wu, Ying N. ;

Zhu, Song-Chun: Interpreting cnn knowledge via an explanatory

graph. In: arXiv preprint arXiv:1708.01785 (2017) 71

[ZCWZ17] Zhang, Quanshi ; Cao, Ruiming ; Wu, Ying N. ; Zhu, Song-

Chun: Mining object parts from cnns via active question-

answering. In: Proc IEEE Conf on Computer Vision and Pattern

Recognition, 2017, S. 346–355 71

[ZCY+16] Zhang, Yu ; Chen, Guoguo ; Yu, Dong ; Yaco, Kaisheng ;

Khudanpur, Sanjeev ; Glass, James: Highway long short-term

274

http://youtube.com/

REFERENCES

memory rnns for distant speech recognition. In: Acoustics, Speech

and Signal Processing (ICASSP), 2016 IEEE International Con-

ference on IEEE, 2016, S. 5755–5759 245

[ZCZ+17] Zhang, Quanshi ; Cao, Ruiming ; Zhang, Shengming ; Red-

monds, Mark ; Wu, Ying N. ; Zhu, Song-Chun: Interactively

transferring CNN patterns for part localization. In: arXiv preprint

arXiv:1708.01783 (2017) 71

[ZF14] Zeiler, Matthew D. ; Fergus, Rob: Visualizing and understand-

ing convolutional networks. In: European conference on computer

vision Springer, 2014, S. 818–833 69

[ZIE16] Zhang, Richard ; Isola, Phillip ; Efros, Alexei A.: Colorful

image colorization. In: European Conference on Computer Vision

Springer, 2016, S. 649–666 245

[ZKSE16] Zhu, Jun-Yan ; Krahenbuhl, Philipp ; Shechtman, Eli ;

Efros, Alexei A.: Generative visual manipulation on the natu-

ral image manifold. In: European Conference on Computer Vision

Springer, 2016, S. 597–613 245

[ZL16] Zoph, Barret ; Le, Quoc V.: Neural architecture search with re-

inforcement learning. In: arXiv preprint arXiv:1611.01578 (2016)

75

[ZVSL17] Zoph, Barret ; Vasudevan, Vijay ; Shlens, Jonathon ; Le,

Quoc V.: Learning transferable architectures for scalable image

recognition. In: arXiv preprint arXiv:1707.07012 2 (2017), Nr. 6

72

[ZWN+16] Zhou, Shuchang ; Wu, Yuxin ; Ni, Zekun ; Zhou, Xinyu ; Wen,

He ; Zou, Yuheng: DoReFa-Net: Training Low Bitwidth Convo-

lutional Neural Networks with Low Bitwidth Gradients. 1 (2016),

Nr. 1, 1–14. http://arxiv.org/abs/1606.06160 9, 10, 72

275


REFERENCES

[ZZ18] Zhang, Quan-shi ; Zhu, Song-Chun: Visual interpretability for

deep learning: a survey. In: Frontiers of Information Technology

& Electronic Engineering 19 (2018), Nr. 1, S. 27–39 69

[ZZLS17] Zhang, Xiangyu ; Zhou, Xinyu ; Lin, Mengxiao ; Sun, Jian:

ShuffleNet: An Extremely Efficient Convolutional Neural Network

for Mobile Devices. (2017), 1–10. http://arxiv.org/abs/1707.

01083 58, 72

276



Acronyms

ACM Association for Computing Machin-

ery

AI Artificial Intelligence

AM Acoustic Model

ASR Automated Speech Recognition

BIC Bayesian Information Criteria

CC Connected Component

CMU Carnegie Mellon University

CNN Convolutional Neural Network

CPU Central Processing Unit

CV Computer Vision

DCT Discrete Cosine Transform

DL Deep Learning

DNN Deep Neural Network

eLBP edge-Based Local Binary Pattern

ERSB Energy Ratio in Subband

FTP File Transfer Protocol

GPU Graphics Processing Unit

GUI Graphic User Interface

HOG Histogram of Oriented Gradients

HPI Hasso Plattner Institute

HSV Hue Saturation Value

HTTP Hypertext Transfer Protocol

ICDAR International Conference on Docu-

ment Analysis and Recognition

IDC International Data Corporation

IEEE Institute of Electrical and Electron-

ics Engineers

IoT Internet of Things

IPv6 Internet Protocol Version 6

IR Information Retrieval

KFS Key-Frames Selection

LALM Line Against Line Matching

LBP Local Binary Pattern

LM Language Model

LOD Linked Open Data

LSTM Long Short Term Memory

MFCC Mel Frequency Cepstral Coefficient

MSE Mean Square Error

NPR Non-Pitch Ratio

NSR Non-Silence Ratio

OCR Optical Character Recognition

OpenCV Open Source Computer Vision Li-

brary

OS Operation System

RBF Radial Basis Function

RNN Recurrent Neural Network

SEMEX SEmantic Media EXplorer

SIFT Scale Invariant Feature Transform

SMIL Synchronized Multimedia Integra-

tion Language

SOAP Simple Object Access Protocol

SPARQL SPARQL Protocol and RDF Query

Language

SPR Smooth Pitch Ratio

Stanford POS Tagger Stanford Log-linear

Part-Of-Speech Tagger

277

ACRONYMS

SVM Support Vector Machine, are super-

vised learning models with associ-

ated learning algorithms that analyze

data and recognize patterns, used for

classification and regression analysis.

TCP Transmission Control Protocol

TED Technology Entertainment and De-

sign

tele-TASK tele-Teaching Anywhere Solution

Kit

TFIDF Term Frequency Inverse Document

Frequency

TREC Text REtrieval Conference

UDP User Datagram Protocol

UIMA Unstructured Information Manage-

ment Architecture

VDR Volume Dynamic Range

VSM Vector Space Model

VSTD Volume Standard Deviation

WER Word Error Rate

WWW World Wide Web

ZCR Zero Crossing Rate

ZSTD Standard Deviation of Zero Crossing

Rate

278

Deep Representation Learning for Multimedia Data Analysis ...

Documents

Transcript of Deep Representation Learning for Multimedia Data Analysis ...