Post on 24-Apr-2023
Deep Representation Learningfor Multimedia Data Analysis
Habilitationsschrift(kumulative)
zur Erlangung des akademischen GradesDr. rer. nat. habil.
eingereicht an derDigital Engineering Fakultat
der Universitat Potsdam
vorgelegt von
Dr. rer. nat Haojin Yanggeboren am 02.12.1981 in Henan
“I do not know what I may appear to the world, but to myself I seem to havebeen only like a boy playing on the seashore, and diverting myself in now and thenfinding a smoother pebble or a prettier shell than ordinary, whilst the great oceanof truth lay all undiscovered before me.”
– Isaac Newton
“There is only one heroism in the world: to see the world as it is, and to loveit.”
– Romain Rolland
Dekan:Prof. Dr. Christoph Meinel, Digital Engineering Fakultat (DEF)
Gutachter:Prof. Dr. Christoph Meinel, Universitat Potsdam
Prof. Dr. Wolfgang Effelsberg, Universitat Mannheim
Prof. Dr. Ralf Steinmetz, Technische Universitat Darmstadt
Prufungskommission:Prof. Dr. Felix Naumann, Informationssysteme DEF (Vorsitzender)
Prof. Dr. Christoph Meinel, Internet-Technologien und -Systeme DEF
Prof. Dr. Wolfgang Effelsberg, Multimedia Systeme, Uni Mannheim
Prof. Dr. Ralf Steinmetz, Multimedia Kommunikation, TU Darmstadt
Prof. Dr. Andreas Polze, Betriebssysteme und Middleware DEF
Prof. Dr. Robert Hirschfeld, Software-Architekturen DEF
Prof. Dr. Patrick Baudisch, Human Computer Interaction DEF
Prof. Dr. Tobias Friedrich, Algorithm Engineering DEF
Prof. Dr. Erwin Bottinger, Digital Health Center DEF
Prof. Dr. Manfred Stede, Angewandte Computerlinguistik MNF
Einreichung:Mittwoch, 10. Oktober 2018
Kolloquium:Title: “Deep Representation Learning for Multimedia Data Analysis”
Termin: Freitag, 10. Mai 2019, 14.00 Uhr im HPI-Horsaal 2
Probevorlesung:Title: “A Concise History of Neural Networks”
Termin: Freitag, 14. Juni 2019, 09.00 Uhr im HPI-Horsaal 3
Abstract
In the last decade, due to the rapid development of digital devices, Internet
bandwidth and social networks, an enormous amount of multimedia data have
been created in the WWW (World Wide Web). According to the publicly avail-
able statistics, more than 400 hours of video are uploaded to YouTube every
minute [You18]. 350 million photos are uploaded to Facebook every day [Fac18];
By 2021, video traffic is expected to make up more than 82% of all consumer
Internet traffic [Cis17]. There is thus a pressing need to develop automated tech-
nologies for analyzing and indexing those “big multimedia data” more accurately
and efficiently. One of the current approaches is Deep Learning. This method
is recognized as a particularly efficient machine learning method for multimedia
data.
Deep learning (DL) is a sub-field of Machine Learning and Artificial Intelli-
gence, and is based on a set of algorithms that attempt to learn representations
of data and model their high-level abstractions. Since 2006 DL has attracted
more and more attention in both academia and industry. Recently DL has pro-
duced break-record results in a broad range of areas, such as beating human in
strategic game systems like Go (Googles AlphaGo [SSS+17]), autonomous driv-
ing [BDTD+16], and achieving dermatologist-level classification of skin cancer
[EKN+17], etc.
In this Habilitationsschrift, I mainly address the following research problems:
Nature scene text detection and recognition with deep learning. In this work, we
developed two automatic scene text recognition systems: SceneTextReg [YWBM16]
and SEE [BYM18] by following the supervised and semi-supervised process-
ing scheme, respectively. We designed novel neural network architectures and
achieved promising results in both recognition accuracy and efficiency.
Deep representation learning for multimodal data. We studied two sub-topics:
visual-textual feature fusion in multimodal and cross-modal document retrieval
task [WYM16a]; Visual-language feature learning with its use case image cap-
tioning. The developed captioning model is robust to generate the new sentence
descriptions for a given image in a very efficient way [WYBM16, WYM18].
vi
We developed BMXNet, an open-source Binary Neural Network (BNN) im-
plementation based on the well-known deep learning framework Apache MXNet
[CLL+15]. We further conducted an extensive study on training strategy and
executive efficiency of BNN on image classification task. We showed meaningful
scientific insights and made our models and codes publicly available; these can
serve as a solid foundation for the future research work.
Operability and accuracy of all proposed methods have been evaluated using
publicly available benchmark datasets. While designing and developing theo-
retical algorithms, we also work on exploring how to apply these algorithms to
practical applications. We investigated the applicability of two use cases, namely
automatic online lecture analysis and medical image segmentation. The result
demonstrates that such techniques might significantly impact or even subvert
traditional industries and our daily lives.
vii
Acknowledgments
First, I own a high debt of gratitude to my family for their unconditional support
throughout the years of my doctoral studies and habilitation. I would like to
thank my parents, even though we are geographically separated, we have always
remained close to each other. Especially to my father, you are the paragon and
spiritual support for me forever. Words fail to express my appreciation to my
love Wei, my angle Yuetong and my little rider Yanqing. Without you, I could
never finish this work.
Second, I would like to express my sincere gratitude to Professor Christoph
Meinel, my supervisor at the Hasso-Plattner-Institute for his guidance and inspi-
ration. Professor Meinel, I am much indebted to you for sharing your valuable
time and level of professionalism never cease to impress. Without your support,
I could not successfully finish my job at HPI.
I would like to thank the colleagues, who have contributed to this Habili-
tationsschrift. This includes Xiaoyin Che, Cheng Wang, Christian Bartz, Mina
Rezaei and Joseph Bethge.
Last but not least, I would like to thank Professor Hasso Plattner. You estab-
lished HPI, an excellent place for research. Your selfless support allows people
to chase great dreams. I would like to thank the beautiful time at HPI. All the
experiences I learned here just changed my life, and motivate me to keep moving
forward.
viii
Contents
List of Figures xiii
List of Tables xvii
PART I: Introduction and Fundamentals 1
1 Introduction 3
1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Research Topics . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Deep Learning: The Current Highlighting Approach of Artifi-
cial Intelligence 17
2.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Representation Learning . . . . . . . . . . . . . . . . . . . 20
2.2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 24
2.2.1.1 Neural Network 1.0 . . . . . . . . . . . . . . . . . 25
2.2.1.2 Neural Network 2.0 . . . . . . . . . . . . . . . . . 25
2.2.2 Neural Network 3.0 - Deep Learning Algorithms . . . . . . 36
2.2.2.1 Data Preprocessing and Initialization . . . . . . . 37
ix
CONTENTS
2.2.2.2 Batch Normalization . . . . . . . . . . . . . . . . 40
2.2.2.3 Regularization . . . . . . . . . . . . . . . . . . . 42
2.2.2.4 Activation Function . . . . . . . . . . . . . . . . 43
2.2.2.5 Optimization Algorithms . . . . . . . . . . . . . 45
2.2.2.6 Loss Function . . . . . . . . . . . . . . . . . . . . 47
2.2.2.7 DNN Architectures . . . . . . . . . . . . . . . . . 48
2.2.2.8 Visualization Tool for Network Development . . . 59
2.3 Recent Development in the Age of Deep Learning . . . . . . . . . 60
2.3.1 Success Factors . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.2 DL Applications . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.3 DL Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.4 Current Research Topics . . . . . . . . . . . . . . . . . . . 65
2.3.4.1 Region-based CNN . . . . . . . . . . . . . . . . . 65
2.3.4.2 Deep Generative Models . . . . . . . . . . . . . . 66
2.3.4.3 Weakly Supervised Model, e.g., Deep Reinforce-
ment Learning . . . . . . . . . . . . . . . . . . . 68
2.3.4.4 Interpretable Machine Learning Research . . . . . 69
2.3.4.5 Energy Efficient Models for Low-power Devices . 71
2.3.4.6 Multitask and Multi-module Learning . . . . . . 72
2.3.4.7 Capsule Networks . . . . . . . . . . . . . . . . . 74
2.3.5 Current Limitation of DL . . . . . . . . . . . . . . . . . . 75
2.3.6 Applicable Scenario of DL . . . . . . . . . . . . . . . . . . 76
PART II: Selected Publications 78
3 Assignment to the Research Questions 81
4 SceneTextReg 83
4.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
x
CONTENTS
5 SEE 91
5.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . 100
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1.1 Localization Network . . . . . . . . . . . . . . . . 100
5.3.1.2 Recognition Network . . . . . . . . . . . . . . . . 100
5.3.1.3 Implementation . . . . . . . . . . . . . . . . . . . 101
5.3.2 Experiments on Robust Reading Datasets . . . . . . . . . 101
6 Learning Binary Neural Networks with BMXNet 103
6.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Image Captioner 135
7.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8 A Deep Semantic Framework for Multimodal Representation
Learning 157
8.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 157
8.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9 Automatic Lecture Highlighting 181
9.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10 Medical Image Semantic Segmentation 197
10.1 Contribution to the Work . . . . . . . . . . . . . . . . . . . . . . 197
10.2 Manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11 Discussion 219
12 Conclusion 225
12.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
12.2 Some Concerns about DL . . . . . . . . . . . . . . . . . . . . . . 229
xi
CONTENTS
PART III: Appendices and References 231
A Ph.D. Publications 233
B Publications After Ph.D. 237
C Deep Learning Applications 245
References 249
Acronyms 277
xii
List of Figures
2.1 The taxonomy of AI, ML and DL . . . . . . . . . . . . . . . . . . 18
2.2 Scale drives DL progress (source: Andrew Ng’lecture 2013 ) . . . . 19
2.3 HOG feature for face verification . . . . . . . . . . . . . . . . . . 20
2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Hierarchical feature learning of DNN (image source: Zeiler and
Fergus 2013 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Visual pathway of visual cortex (image source: Simon Thorpe) . . 22
2.7 Brief history of machine learning . . . . . . . . . . . . . . . . . . 24
2.8 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Non-linear activation, e.g., Sigmoid function . . . . . . . . . . . . 26
2.10 The CNN architecture of Lenet (image source: Lenet5 [L+15]) . . 29
2.11 A convolution operation on an image using a 3 × 3 kernel. Each
pixel in the output image is the weighted sum of 9 pixels in the
input image. (image credit: Tom Herold) . . . . . . . . . . . . . . 30
2.12 A pooling operation using a 2 × 2 kernel, stride 2, where max-
pooling and average pooling are demonstrated. . . . . . . . . . . . 31
2.13 Computational graph of a RNN in folded (left) and unfolded view
(right). (image credit: Xiaoyin Che) . . . . . . . . . . . . . . . . 32
2.14 Detailed structure of a “Vanilla” RNN cell. (image credit: Xiaoyin
Che) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.15 Detailed structure of a LSTM cell. (image credit: Xiaoyin Che) . 34
2.16 Computational graph of a Bidirectional RNN. (image credit: Xi-
aoyin Che) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
xiii
LIST OF FIGURES
2.17 (Left) Images generated by our text sample generation tool. (Right)
Images taken from ICDAR dataset of the robust reading challenge.
(image credit: Christian Bartz ) . . . . . . . . . . . . . . . . . . . 38
2.18 Ground truth image created from computer games. (image credit:
[RVRK16]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.19 Pictorial representation of the concept Dropout . . . . . . . . . . 42
2.20 Activation function: (top-left) sigmoid function, (bottom-left) deriva-
tive of sigmoid function, (top-middle) tanh function, (bottom-
middle) derivative of tanh function, (top-right) ReLU function,
(bottom-right) derivative of ReLU function. . . . . . . . . . . . . 43
2.21 Derived linear unit activation function: (left) PReLU and LReLU
activation function, (right) ELU activation function. . . . . . . . . 45
2.22 Architecture of AlexNet . . . . . . . . . . . . . . . . . . . . . . . 49
2.23 Structure diagram of VGG-Net . . . . . . . . . . . . . . . . . . . 50
2.24 A stack of three convolution layers with 3×3 kernel and stride 1,
has the same active receptive field as a 7×7 convolution layer. . . 51
2.25 Structure diagram of GoogLeNet, which emphasizes the so-called
“Inception Module” . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.26 Naive version of “Inception Module”, where the numbers in the
figure e.g., 28×28×128 denote the width×height×depth of the fea-
ture maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.27 “Bottleneck” design idea for a convolution layer: preserves width
and height, but reduces depth . . . . . . . . . . . . . . . . . . . . 53
2.28 “Inception Module” with “bottleneck” layers, where the numbers
in the figure e.g., 28×28×128 denote the width×height×depth of
the feature maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.29 Structure diagram of ResNet. Left: a “Residual Module”, right:
the overall network design. The different colors indicate the blocks
with various layer type or the different number of filters. . . . . . 55
2.30 “Residual Module”. Left: initial design of the residual block, right:
residual block with “Bottleneck” design. . . . . . . . . . . . . . . 56
xiv
LIST OF FIGURES
2.31 A comparison of DNN models, which gives indication for practical
applications. Left: top-1 model accuracy on ImageNet dataset,
right: computation complexity comparison. (image credit: Alfredo
Canziani [CPC16]) . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.32 Exemplary visualization of VisualBackProp method. (image credit:
Mariusz Bojarski [BCC+16]) . . . . . . . . . . . . . . . . . . . . . 59
2.33 Block diagram of the VisualBackProp method. (image credit:
Mariusz Bojarski [BCC+16]) . . . . . . . . . . . . . . . . . . . . . 60
2.34 Top-5 errors of essential DL models in ImageNet challenge . . . . 62
2.35 Activity of DL frameworks. Left: arXiv mentions as of March 3,
2018 (past 3 months); Right: Github aggregate activity April -
July 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.36 Architecture of Generative Adversarial Networks (image source:
Gharakhanian) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.37 Multitask Learning Example . . . . . . . . . . . . . . . . . . . . . 73
5.1 Samples from ICDAR, SVT and IIIT5K datasets that show how
well our model finds text regions and is able to follow the slope of
the words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
xv
List of Tables
1.1 Comparison of available implementations for binary neural net-
works. Alongside BMXNet other implementations are difficult to
use for actual applications, because actual model saving and de-
ployment is not possible . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Classfication accuracy on ImageNet dataset of mainstream CNN
architectures. Essential information over the parameters is pro-
vided, such as the number of weights and computation operations,
etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 DL framework features . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3 DL frameworks benchmarking. Average time (ms) for 1000 images:
ResNet-50 Feature Extraction (Source: analyticsindiamag.com) . 65
5.1 Recognition accuracies on the ICDAR 2013, SVT and IIIT5K ro-
bust reading benchmarks. Here we only report results that do not
use per image lexicons. (*[JSVZ15] is not lexicon-free in the strict
sense as the outputs of the network itself are constrained to a 90k
dictionary.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xvii
1
Introduction
This Habilitationsschrift presents multimedia analysis with deep learning tech-
nology as the central theme. In this chapter, I will demonstrate the need for, the
benefits of applying deep learning methods, and present the research problems
we aim to solve in this thesis. Subsequently, I will summarize the achievement
regarding scientific contributions, and conclude the chapter with a structural
overview of the thesis.
1.1 Motivation and Scope
In the recent research of multimedia analysis and retrieval, multimodal represen-
tation learning becomes one of the most beneficial methods. The establishment
of this research trend is due to the significant improvement of machine learning
technology and the nature of multimedia data, which consist of multiple modal-
ities that illustrate the ordinary semantic meaning of information from hybrid
resources (e.g., visual, textual and auditory content).
Due to the rapid development of digital devices, Internet bandwidth, multi-
media portals, and social networks, the amount of multimedia data in the WWW
(World Wide Web) becomes particularly huge. More than 400 hours of video are
uploaded to YouTube every minute [You18]; 350 million photos are uploaded to
Facebook every day [Fac18]; By 2021, video traffic is expected to make up more
than 82% of all consumer Internet traffic [Cis17]. It is therefore pressingly needed
3
1. INTRODUCTION
to develop novel methods for processing those “big multimedia data” more ac-
curately and efficiently, to make them understandable and searchable based on
their content. Due to the high-efficiency machine learning technologies have been
widely applied in this domain. There are two benefits of doing so: first, we can
find more precise positions of the target objects in the multimedia content, as
e.g., searching for a visual target appears in the visual scene of an image or a
video, or locating a spoken text appears in an audio speech. Moreover, we in-
tend to understand the semantic meaning of the image or video content on top
of the visual and auditory recognition result and further generate humanly sound
nature language sentences for describing the content. The second benefit is that
machine learning is known as a data-driven technology which makes it highly
suitable to process massive amounts of data. Therefore the development of a
machine learning algorithm can in turn be gained by using multimedia data.
Artificial Intelligence (AI ) is the intelligence exhibited by computer. This
term is applied when a machine mimics “cognitive” functions that humans asso-
ciate with other human minds, such as “learning” and “problem-solving”. Cur-
rently, researchers and developers in this field are making efforts to AI and ma-
chine learning algorithms which intend to train the computer to mimic some
human skills such as “reading”, “listening”, “writing” and “decision making” etc.
Some AI applications such as Optical Character Recognition (OCR) and Au-
tomatic Speech Recognition (ASR) recently become conventional technologies in
the industry. One of the current machine learning approaches is Deep Learning,
which is recognized as a particularly efficient representation learning method for
multimedia data.
From the year 2006, Deep Learning (DL) or Deep Neural Network has at-
tracted more and more attention in both academia and industry. DL is a sub-field
of Machine Learning and Artificial Intelligence based on a set of algorithms that
attempt to learn representations of data and model their high-level abstractions.
In a deep neural network, there are multiple so-called neural layers between the
input and output. The algorithm is allowed to use those layers to learn higher
abstraction, composed of various linear and non-linear transformations. Due to
the rapid upgrade of computation power and copious amounts of training data
4
1.1 Motivation and Scope
become available, deep learning achieved impressively great success in many top-
ics such as computer vision, speech recognition, and natural language processing.
Recently DL also gives us break-record results in many different application ar-
eas, e.g. beating human in strategic game systems like Go (Google’s AlphaGo
[SSS+17]), self-driving cars [BDTD+16], achieving dermatologist-level classifica-
tion of skin cancer [EKN+17], etc.
1.1.1 Research Questions
Human learning behaviors inspire many research ideas in DL. The human brain
can learn meaningful things from diverse contexts and perform several learning
tasks at the same time. Unfortunately, the current machine learning algorithms
can only achieve high performance on individual perception tasks. It is still
impossible to build a generic perception model for multiple tasks, such as solving
the computer vision problems like detecting and recognizing colors, objects, faces,
and reading texts simultaneously. Moreover, human learn new skills based on
previous experiences and knowledge rather than from scratch. Thus transfer
learning and multi-task learning are frequently discussed by current DL research
community.
In our work, we have raised the following research questions and conducted
different research topics for these issues, which will be discussed in section 1.1.2.
• Q1: DL is data hungry, how can we alleviate the reliance on substantial
data annotations?
– through synthetic data?
– through unsupervised and semi-supervised learning method?
• Q2: How can we perform multiple computer vision tasks with a uniform
end-to-end neural network architecture?
• Q3: How can we apply DL models on low power devices as e.g., smart-
phones, embedded devices, wearable and IoT (Internet of Things) devices?
• Q4: Can DL models gain multimodal and cross-modal representation learn-
ing tasks?
5
1. INTRODUCTION
• Q5: Can we effectively and efficiently apply multimedia analysis and DL
algorithms in real-world applications?
1.1.2 Research Topics
We conducted and investigated the following research topics according to the
research questions raised in the previous section.
Nature Scene Text Detection and Recognition is one of the yet not
solved essential problems in computer vision due to numerous difficulties, such
as different contrast and image quality, complicated background, lighting effect,
blending- and blurring impact, geometrical distortion, various font styles, and
font sizes, etc. In the nature scenes, text can be found on cars, road signs,
billboards, etc. Automatically detecting and reading text from natural scene
images is a crucial part of systems, which is being used for several challenging
tasks, such as image-based machine translation, autonomous driving, and image
or video indexing.
In our work, we developed two automatic scene text recognition systems Scene-
TextReg [YWBM16] and SEE [BYM18] by following the supervised and semi-
supervised processing scheme, respectively. To solve the lacking of training data
problem for fully supervised approach, we develop a data generator which can
efficiently create text image samples. We designed novel neural network architec-
tures and achieved promising results in both recognition accuracy and processing
speed.
Towards Lower Bit-width Neural Networks State-of-the-art deep mod-
els are computationally expensive and consume large storage space. Deep learning
is also strongly demanded by numerous applications from areas such as mobile
platforms, wearable devices, autonomous robots, and IoT devices. How to ef-
ficiently apply deep models on such low power devices becomes a challenging
research problem. The recently introduced Binary Neural Networks (BNNs) is
one of the possible solutions for this problem.
We developed BMXNet which is an open-source BNN implementation based
on the well-known deep learning framework Apache MXNet [CLL+15]. We con-
ducted an extensive study on training strategy and executive efficiency of BNN.
We systematically evaluated different network architectures and hyperparameters
6
1.1 Motivation and Scope
to provide useful insights on how to train a BNN. Further, we present how we
improved classification accuracy by increasing the number of connections through
the network. We showed meaningful scientific insights and made our models and
codes publicly available. These can serve as a solid foundation for the future
research work.
Deep Representation Learning for Multimodal Data Representation
learning [BCV13a], or feature learning, is a set of techniques for transforming
raw input data to good representations which can adequately support machine
learning algorithms. The performance of a machine learning algorithm heavily
relies on the quality of data representations. Representation learning allows a
machine to automatically learn good discriminative representations in the con-
text of specific machine learning task, and make machine learning methods less
dependent on labor-intensive feature engineering.
We can apply representation learning methods to multimodal data, e.g., given
an image with its textual tags, the learned word representation can be combined
with its visual representation, to enable the exploration of the shared semantics
of two modalities. Another example is automatic video indexing by combin-
ing visual and auditory representation. In multimodal representation learning,
joint representation of two different modalities can be learned via end-to-end
neural networks. Therefore in our work, we studied two sub-topics: visual-
textual feature fusion in multimodal and cross-modal document retrieval task
[WYM16a]; Visual-language feature learning with its use case Image Caption-
ing [WYBM16, WYM18]. The developed captioning model is robust to generate
novel sentences for describing arbitrary given images in a very efficient way.
We studied the feasibility and practicality of developed algorithms in two
practical use cases: Automatic Online Lecture Analysis and Medical Image Seg-
mentation.
Automatic Online Lecture Analysis In this work, we propose a com-
prehensive solution to highlight the online lecture videos at different levels of
granularity. Our solution is based on automatic analysis of multimedia lecture
materials, such as speeches, transcripts and lecture slides (both in the file and
video format). The extracted highlighting information can facilitate the learners,
7
1. INTRODUCTION
especially in the context of MOOCs (Massive Open Online Course). In com-
parison with ground truth created by experts, our approach achieves satisfied
precision, which is better than baseline approaches and also welcomed by user
feedbacks.
Medical Image Segmentation In this work, we introduced a novel Condi-
tional Refinement Generative Adversarial Network to address the medical image
segmentation task. Our approach can solve several common problems in medi-
cal image segmentation, such as unbalanced data distribution of classes, varying
image dimension and resolution. We achieved promising results on three popular
medical imaging datasets for semantic segmentation of abnormal tissues as well as
body organs, including BraTS2017 dataset for brain tumor segmentation (MRI
images) [20117a], LiTS2017 dataset for liver cancer segmentation (Computed To-
mography (CT ) images) [20117b], and MDA231 microscopic light dataset from
human breast carcinoma cells [BE15]. Overall, the achieved results demonstrate
strong generalization ability of the proposed method for medical image segmen-
tation task.
1.2 Contribution
As mentioned in the last section, the central theme of this Habilitationsschrift
is multimedia data analysis with deep learning algorithms. Thus, the scientific
contribution and publication can be categorized according to the involved analysis
tasks and developed deep learning frameworks. The main contributions of the
thesis are presented as follows:
• I extensively studied text detection and recognition problem using deep
learning technology in several different application contexts.
– First, I developed SceneTextReg1, a real-time scene text recognition
system. The system applies deep neural networks in both text detec-
tion and word recognition stage. I trained the corresponding models
in the fully supervised manner. SceneTextReg achieved the same level
concerning word recognition accuracy as Google’s PhotoOCR system
1https://youtu.be/fSacIqTrD9I
8
1.2 Contribution
[BCNN13a]. It is worth to mention that Google’s system was trained
based on millions of real-world samples created by human annotators,
where SceneTextReg was trained by only using synthetic samples gen-
erated by our data engine.
– Although data generator works well for text, most of the time, the
same concept is hard to be ported to arbitrary object classes, and tech-
nically not possible to simulate all the scenarios. Therefore, concerning
the long-term vision, unsupervised as well as semi-supervised methods
are desired if we want to scale up the number of supported classes.
We intended to develop a semi-supervised system for object detection
and recognition. Because of the accumulated experiences in the past,
we still choose text as the first experimental subject. We proposed
a semi-supervised system SEE for end-to-end scene text recognition.
This system only applies weak supervision signal for text detection and
achieved state-of-the-art accuracy on many popular opened benchmark
datasets.
– As a useful application, we developed a new approach for the writer
independent verification of offline signatures. This approach is based
on deep metric learning. By comparing triplets of two genuine and
one forged signature, the system learns to embed signatures into a
high-dimensional space, in which the Euclidean distance functions as a
metric of their similarity. Our system ranks best in nearly all evaluation
metrics from the ICDAR SigWiComp 2013 challenge [MLA+13].
(related publications: [BYM18, BYM17a, YWBM16, RYM16, YWC+15,
BQ15])
• Binary neural networks (BNNs) seem to be a promising approach for devices
with low computational power. However, none of the existing mainstream
deep learning frameworks natively supports such binary neural layers, in-
cluding Caffe [JSD+14], Tensorflow [ABC+16], MXNet [CLL+15], PyTorch
[PCC+], Chainer [TOHC15]. Existing BNNs or quantized NN approaches
[RORF16, HCS+16, ZWN+16, LNZ+17, LZP17], which have promising re-
sults. However, there is often no source code for actual implementations
9
1. INTRODUCTION
Table 1.1: Comparison of available implementations for binary neural networks. Along-
side BMXNet other implementations are difficult to use for actual applications, because
actual model saving and deployment is not possible
Title GPU CPUPython
API
C++
API
Save
Binary
Model
Deploy
on
Mobile
Open
Source
Cross
Platform
BNNs [HCS+16] X X X
DoReFa-Net [ZWN+16] X X X X X
XNOR-Net [RORF16] X X
BMXNet [YFBM17a] X X X X X X X X
present (see Table 1.1). This makes follow-up research and application de-
velopment based on BNNs difficult. Moreover, architectures, design choices,
and hyperparameters are often presented without thorough explanation or
experiments. According to the actual needs, we thus made the following
contributions:
– First, we developed BMXNet [YFBM17a], an open-source BNN imple-
mentation based on the well-known deep learning framework Apache
MXNet [CLL+15]. We share our code and developed models for re-
search use, from which both academia and industry can take advan-
tage. To our knowledge, BMXNet is the first open-source BNN
implementation that supports binary model saving and de-
ployment on Android as well as iOS mobile devices.
– We further focus on increasing our understanding of the training pro-
cess and making it accessible to everyone. We provide novel empirical
proof for the choice of methods and parameters commonly used to train
BNNs, such as how to deal with the bottleneck architecture and the
gradient clipping threshold. We found that dense shortcut connections
can improve the classification accuracy of BNNs significantly and show
how to create robust models with this architecture. We present an
overview of the performance of commonly used network architectures
with binary weights.
(related publications: [YFBM17a, BYBM18, BYBM19])
10
1.2 Contribution
• The contributions of multimodal representation learning are summarized as
follows:
– In visual-textual multimodal representation learning, we propose a deep
semantic framework for mapping visual and textual feature to common
feature space. By imposing supervised pre-training as a regularizer, we
can better capture intra- and inter-modal relationships. In multimodal
fusion, it shows that combining visual and textual features can achieve
better performance compared to unimodal features. In our experiment,
we applied two mainstream datasets: Wikipedia dataset [RCPC+10]
and MIR Flicker 25K [HL08]. We achieved state-of-the-art results on
both cross-modal and multimodal retrieval tasks.
(related publication: [WYM16a, WYM15b, WYM15a])
– In multimodal video-representation learning, to learn the discrimina-
tive video representations we explored the fusion of video appearance,
motion, and auditory information. Our experimental result shows that
fusing spatial, temporal (motion) and auditory information can boost
recognition performance with appropriate fusion strategies. Our ap-
proach achieved highly competitive performance as compared to previ-
ous methods.
(related publication: [WYM16b])
– In visual-language representation learning, we developed an end-to-
end trainable deep Bidirectional Long-Short Term Memory (BLSTM )
network to capture the relationships between a visual input image
and language sequences. The effectiveness and generalization ability
of the proposed system have been evaluated using multiple bench-
mark datasets, including Flickr8K [RYHH10], Flickr30K [YLHH14],
MSCOCO [LMB+14], and Pascal1K [RYHH10]. The experimental re-
sults show that our models outperformed related work in both image
captioning and image-sentence retrieval task. Furthermore, we conduct
a transfer-learning experiment on the Pascal1K dataset. The result
demonstrates that without applying the training data from Pascal1K,
11
1. INTRODUCTION
our model still achieved the best performance on both tasks. We devel-
oped a real-time captioning system for demonstration purpose, called
Neural Visual Translator1.
(related publication: [WYBM16, WYM18])
• As mentioned in previous chapters, I extensively studied two application
use cases, the proper contributions are summarized as follows:
– In automatic online lecture analysis, we propose a comprehensive solu-
tion to highlight the online lecture videos in both lecture segment and
transcript sentence level. Our solution is based on automatic analy-
sis of multimedia lecture materials, such as speeches, transcripts and
lecture slides (both in the file and video format). The extracted high-
lighting information can facilitate the learners, especially in the context
of MOOCs.
For sentence-level lecture-highlighting based on audio and subtitles of
MOOC videos, we achieved the precision over 60%, in comparison with
ground truth created by experts. This is way better than baseline
work and also welcomed by user feedbacks. Segment-level lecture-
highlighting works with statistical analysis, mainly by exploring speech
transcripts, lecture slides, and their correlations. With the ground
truth created by massive users, an evaluation process shows that the
general accuracy can reach 70%, which is reasonably promising. Fi-
nally, we conducted and report the correlation study of two types of
lecture highlights.
(related publication: [CYM18, CLYM16, CYM15, CYM13])
– In medical image segmentation, we introduced a novel Conditional Re-
finement Generative Adversarial Network to address the medical image
segmentation task. We studied the effects of several crucial architec-
tural choices for semantic segmentation task on medical imaging. We
introduce a patient-wise mini-batch normalization technique that helps
to accelerate the learning process and improve the accuracy.
1https://youtu.be/a0bh9_2LE24
12
1.3 Publication
We achieved promising results on three famous medical imaging datasets
for semantic segmentation of abnormal tissues as well as the body or-
gan, including BraTS2017 dataset for brain tumor segmentation (MRI
images) [20117a], LiTS2017 dataset for liver cancer segmentation (Com-
puted Tomography (CT ) images) [20117b], MDA231 microscopic light
dataset from human breast carcinoma cells [BE15].
(related publication: [RYM19a, RYM18, RYM19b])
• The recently proposed Generative Adversarial Networks (GANs) [GPAM+14]
achieved state-of-the-art results on a large variety of unsupervised learn-
ing tasks, such as image generation, audio synthesis, and human language
generation, etc. However, there are still several significant shortcomings
of GANs, such as missing modes from the data distribution or even col-
lapsing large amounts of probability mass on some modes. We extensively
studied the mode-collapse problem and proposed to incorporate adversarial
dropout in generative multi-adversarial networks. Our approach forces the
single generator not to constrain its output to satisfy a single discrimina-
tor, instead, to fulfill a dynamic ensemble of discriminators. We show that
this approach leads to a more generalized generator, promoting variety in
the generated samples and avoiding the mode-collapse problem commonly
experienced with GANs. We provide evidence that the proposed solution
promotes sample diversity on five different datasets, mitigates mode-collapse
and further stabiles training. (related publication: [MYM19a, MYM19b])
1.3 Publication
Earlier version of several parts of this thesis have been published in international
journals and presented at international scientific conferences.
According to the requirements of the cumulative Habilitationsschrift at the
University of Potsdam, I prepared two publication lists respectively for the time
of my Ph.D. study and after the Ph.D.:
• Publications during my Ph.D. study (14): Appendix A
• Publications after Ph.D. (40+): Appendix B
13
1. INTRODUCTION
A list of selected publications from Appendix B, assigned to the corresponding
research questions (defined in section 1.1.1), for this cumulative Habilitationss-
chrift is prepared as follows:
• Q1, Q2:
– “SceneTextReg: A Real-Time Video OCR System” [YWBM16]
– “SEE: Towards Semi-Supervised End-to-End Scene Text Recognition”
[BYM18]
• Q3:
– “BMXNet: An Open-Source Binary Neural Network Implementation
Based on MXNet” [YFBM17b]
– “Learning to Train a Binary Neural Network” [BYBM18]
– “Back to Simplicity: How to Train Accurate BNNs from Scratch?”
[BYBM19]
• Q4:
– “Image Captioning with Deep Bidirectional LSTMs and Multi-Task
Learning” [WYM18]
– “A Deep Semantic Framework for Multimodal Representation Learn-
ing” [WYM16a]
• Q5:
– “Automatic Online Lecture Highlighting Based on Multimedia Analy-
sis” [CYM18]
– “Recurrent Generative Adversarial Network for Learning Imbalanced
Medical Image Semantic Segmentation” [RYM19b]
14
1.4 Outline of the Thesis
1.4 Outline of the Thesis
The thesis is organized in the following manner.
Many commonly used techniques in DL are proposed in the past 2-3 years
and updated rapidly. They are relatively new to the readers without sufficient
DL knowledge, and the thorough explanation of those techniques are usually not
provided in the selected scientific publications in the PART II of the thesis. Thus,
to improve reader’s understanding, I wrote a “foundations” chapter (Chapter 2),
which presents some fundamentals of DL techniques as well as a comprehensive
review of recent efforts, development and current limitations of DL technologies.
Chapter 4 to 10 present the selected publications for this cumulative Habil-
itationsschrift (paper list cf. section 1.3). I provide an overview as well as a
clarification of the self-contribution for each selected paper.
Chapter 11 discusses the achievements and the current limitations of the pre-
sented approaches in this thesis. Chapter 12 concludes the thesis and provides a
comprehensive outlook on future work, followed by appendices and references.
15
2
Deep Learning: The Current
Highlighting Approach of
Artificial Intelligence
Since the 1950s, Machine Learning (ML), a subset of Artificial Intelligence (AI),
started revolutionizing different application fields in the last few decades. Artifi-
cial Neural Networks (ANN) is a subfield of ML, and from where Deep Learning
(DL) spawned and has been considered as a subfield of representation learning
(cf. Figure 2.1).
Deep learning has demonstrated enormous success in a large variety of appli-
cations since AlexNet1 won the ImageNet challenge [DDS+09]. This new research
field of machine learning has been overgrowing and helps to open new opportu-
nity in AI research. There are different models proposed for the different class
of learning approaches, including supervised, semi-supervised, unsupervised and
deep reinforcement learning. Most of the time, the experimental results show the
state-of-the-art performance of deep learning over traditional machine learning
methods in the field of Speech Recognition, Machine Translation, Computer Vi-
sion, Image or Video Processing, Medical Imaging, Robotics, Natural Language
Processing (NLP) and many others. Meanwhile, DL is impacting many industrial
products, e.g., autonomous driving, digital assistant, digital health. The success
of Deep Learning has opened the current wave of AI.
1AlexNet is a deep neural network architecture, proposed by Krizhevsky et al. [KSH12]
17
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.1: The taxonomy of AI, ML and DL
One of the most crucial advantages of DL is the ability of hierarchically learn-
ing features from large-scale data. If we say ML is a data-driven technique, then
we can also say that the scale is driving DL progress. Figure 2.2 demonstrates the
comparison of deep neural networks with different levels of size and the traditional
ML algorithms. From which we can easily find out that one of the most significant
benefits of DL models is their superior fitting and generalization ability. Their
performance can be significantly gained by adding new training data and appro-
priately increasing the network complexity. As the number of data increases,
the performance of traditional machine learning approaches become steady. On
the contrary, the performance of deep learning models increases concerning the
increment in the amount of data.
2.1 Data Representation
2.1.1 Feature Engineering
In traditional ML, Feature Engineering is fundamental to its applications, is the
process of using domain knowledge of the data to create features that make ML
algorithms work. Creating handcrafted features is a time and cost expensive task,
and the expert knowledge is required. Therefore, researchers intended to explore
the feasibility of using algorithms such as artificial neural networks to perform
automated feature learning.
18
2.1 Data Representation
Figure 2.2: Scale drives DL progress (source: Andrew Ng’lecture 2013 )
In a traditional ML approach, given a new problem, we then often perform
the following working steps:
• Data preparation, create a labeled dataset as e.g., CIFAR [KH09] or Ima-
geNet [DDS+09] for image classification
• Spend hours for hand engineering representative features as e.g., HOG
[DT05], SIFT [Low04], LBP [OPM02], bag-of-words [MSC+13a], feeding
into a ML algorithm
• Evaluate different ML algorithms as e.g., SVM [BGV92], Random Forest
[LW+02]
• Repeat feature engineering and evaluation step, pick the best configuration
for the application
Figure 2.3 shows how do we use Histogram of Oriented Gradients (HOG),
a traditional feature engineering method for face verification. We first get a
candidate face region image as the input. We then apply edge filter to create the
gradient map of the input image. Based on the gradient magnitude in horizontal
and vertical direction, we can further calculate the gradient direction at each
image pixel. We then use the histogram to calculate the statistics of gradient
directions in local regions. The histogram of gradient directions is so-called HOG
feature, will be fed into an ML classifier such as SVM, to distinguish the class
category “face” or “non-face.”
19
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.3: HOG feature for face verification
2.1.2 Representation Learning
DL or Deep Neural Networks (DNN) on the other hand consists of input, out-
put, and several hidden layers in between (cf. Figure 2.4). DNN allows many
stages of non-linear information processing units (hidden layers) with hierarchi-
cal architectures that are exploited for feature learning and pattern classification
[Sch15, LBH15]. All the weights in a DNN can be updated by using Backpropa-
gation [RHW86a] algorithm.
This automated feature learning process has been referred to as representation
learning. Bengio et al. give the definition of Representation Learning : “Learn-
ing method based on representations of data can be defined as representation
learning” in [BCV13b].
Recent literature defines that DL based representation learning involves a
hierarchy of features or concepts, where the high-level concepts can be defined
from the low-level ones. Figure 2.5 visualizes the learned features extracted from
different levels of a DNN model, where the low-level features are edges, corners,
and gradients from different input color channels. The features from mid-level
layers are feature-groups showing the part of objects. Finally, the high-level
features depict more complete objects such as faces, wheels, bodies.
The bionic evidence of this kind of hierarchical structure has been found by
Gallant and Van Essen et al. [FV91]: the mammalian visual cortex is hierar-
chical. In Figure 2.6, the ventral (recognition) pathway in the visual cortex has
multiple stages: Retina - LGN - V1 - V2 - V4 - PIT - AIT ..., which contains lots
20
2.1 Data Representation
Figure 2.4: Artificial Neural Networks
Figure 2.5: Hierarchical feature learning of DNN (image source: Zeiler and Fergus 2013 )
of intermediate representations of the visual signal as e.g., simple visual forms
like edges and corners at V1, intermediate visual forms like feature groups at
V4, and high-level object descriptions like faces at AIT. This study successfully
revealed the essential bionic significance of DNN and shows that with DNN we
can learn higher abstractions (high-level features) from the late-stage layers, and
they demonstrate superior performance in a wide range of application domains.
In some articles, researchers describe DL as a universal learning approach that
can solve almost all kinds of problems in different application domains, showing
that DL is not task-specific [B+09].
21
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.6: Visual pathway of visual cortex (image source: Simon Thorpe)
2.2 Fundamentals
According to the Deep Learning book [GBCB16], DL approaches can be cat-
egorized into three categories: supervised, semi-supervised and unsupervised.
Moreover, there is another subfield of learning approaches called Deep Reinforce-
ment Learning (DRL) which often discussed under the scope of semi-supervised
or weakly supervised learning methods.
Supervised Learning The methods in this category have a common charac-
teristic: learning with fully labeled data, i.e., each sample xt from a dataset X
has a corresponding label yt. Thus the environment has a set of inputs and their
corresponding outputs (xt, yt) ∼ ρ. A DL model is trained by using the dataset
X.
For instance, if for an input xt, the model predicts yt = f(xt), then the model
can receive a loss value L(yt, yt). The training algorithm will then iteratively
update the model parameters for a better approximation of the desired outputs.
The training process stops when we are satisfied with the model outputs. Af-
ter successful training, the model can make correct predictions of given inputs.
There are different supervised learning approaches for DL including Convolu-
tional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) including
Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). CNNs
and LSTM, as the key techniques applied in our work, are thoroughly described
in section 2.2.1.2, respectively.
Unsupervised Learning The definition of unsupervised learning methods
can be roughly summarized as a learning approach without labels. In this cate-
22
2.2 Fundamentals
gory, the learning algorithms learn the internal representation and essential fea-
tures to discover unknown relationships or structure within the input dataset.
Typically, clustering, generative model, and dimensionality reduction method
are considered as unsupervised learning approaches. In the development of DL,
some unsupervised learning methods have achieved great success, including Auto
Encoders (AE) [VLBM08], Restricted Boltzmann Machine (RBM) [HS06], and
the recently proposed Generative Adversarial Networks (GANs) [GPAM+14].
Section 2.3.4.2 gives a brief overview of GANs. Moreover, LSTM and RL are
also applied for unsupervised learning in many application use cases.
Semi-supervised Learning, Deep Reinforcement Learning A learning
approach with partially labeled datasets or using weakly supervised signal is
considered as a semi-supervised or weakly supervised method. In the recent
DL research, Deep Reinforcement Learning is one of the typical semi-supervised
learning techniques.
For a given task, an RL system is set within the task-specific environment and
executes a set of actions. The consequence of each action on the environment is
measured, and a reward value is calculated. By maximizing the reward, the
system finds a series of actions that are most effective towards achieving the
specified task. This way of working is different from supervised learning and other
kinds of learning approaches studied before, including traditional statistical ML
methods, and ANN.
We can apply RL in a different scope of fields, such as decision making in
fundamental sciences, ML in computer science. Furthermore, the reward strategy
has been widely studied in the field of engineering and mathematics, robotics
control, power station control, etc., in the last couple of decades.
Deep Reinforcement Learning (DRL) first practiced in 2013 in Google Deep-
Mind’s paper [MKS+13]. From then on, DRL has achieved great success in mas-
tering strategy games, including AlphaGo and AlphaGo Zero for the game GO
[SSS+17], and other games like Atari [MKS+15], Dota2 [Blo18], Chess and Shougi
[SHS+17].
Mathematically, let (xt) ∼ ρ denote the training samples, and the intelligent
agent gives the prediction: yt = f(xt). The agent can receive the loss value:
ct ∼ P (ct|xt, yt), where P is an unknown probability distribution, the environment
23
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.7: Brief history of machine learning
asks the agent a question and returns a noisy score for agent’s answer. The
fundamental differences between RL and supervised learning are: first, there is no
straightforward loss function defined, in other words, we don’t have full knowledge
of the objective function being optimized. We have to query it through interaction
to get feedback. Second, we are interacting with a state-based environment: the
actual input xt depends on previous actions.
LeCun et al. [LBH15] pointed out that DRL is one of the most promising
directions of DL research.
2.2.1 Artificial Neural Networks
In this section, I will introduce the related fundamental technologies of Artificial
Neural Network according to its historical development timeline.
Figure 2.7 depicts a brief historical timeline of ML, from which we can cate-
gorize the development of ANN into three different time phases discussed in the
subsequent sections. More comprehensive description of ANN can be found in
the literature [BB+95, GBCB16].
24
2.2 Fundamentals
Figure 2.8: Perceptron
2.2.1.1 Neural Network 1.0
McCulloch and Pitts (1943) show that neurons can be combined to construct
a Turing Machine using ANDs, ORs, and NOTs [MP43]. Inspired by McCul-
loch and Pitts, Rosenblatt invented Perceptron [Ros58] in 1958. He showed
that perceptron can converge if the suitable learning objective can be well repre-
sented. The perceptron outputs a binary result y based on a linear combination
of weighted inputs and a threshold θ:
y =
{1 if
∑wixi + b > θ
0 otherwise(2.1)
where wi, xi and b respectively denote weight parameters, inputs, and a bias
term.
In 1969, Minsky and Papert [MP69] showed the limitations of Perceptron,
where Perceptron is not able to learn XOR logic, which is not linearly separable.
This paper killed the research in neural networks for a decade.
2.2.1.2 Neural Network 2.0
In the 1980s, the second wave of AI research emerged and built up the basics
which are employed today in DL community. There are many essential techniques
proposed during this period.
First, the most significant limitation of Perceptron for non-linear problems
has been overcome by adding a non-linear activation function to the neural layer.
Figure 2.9 depicts a commonly used non-linear activation function: Sigmoid func-
tion, which is applied on top of the linear combination of the weighted inputs.
Hornik et al. [HSW89] and Cybenko [Cyb89] proved that multilayer feedforward
25
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.9: Non-linear activation, e.g., Sigmoid function
networks are universal approximators. It means that even a very simple ANN
with only one single hidden layer, if its output is followed by a sigmoid activation
function, then it can approximate any math function from a finite space to an-
other finite space with any desired accuracy, provided that the number of hidden
neurons is large enough.
2.2.1.2.1 Backpropagation (BP) By adding non-linear activation function
for each layer, Multi-layered Perceptron (MLP) has been proposed. As the name
suggests, the MLP consists of multiple Perceptrons arranged in layers. However,
how to effectively train an MLP remained as a challenging problem until BP
algorithm being proposed by Rumelhart and Hinton et al. [RHW86b]. Algorithm
1 shows the pseudo code of basic BP. For MLP, we can easily represent ANN
models by using computation graphs. Then we can apply Chain-Rule to efficiently
calculate the gradient from the network output to the frontal layers with BP, as
shown in Algorithm 1 for a single path network.
We can define a composite function for an L-layers ANN.
y = f(x) = ϕ(wL...ϕ(w2ϕ(w1x+ b1) + b2)...+ bL) (2.2)
In Equation 2.2, L=2, we thus can rewrite the function as
y = f(x) = f(g(x)) (2.3)
according to the Chain-Rule, the derivative of this function is
∂y
∂x=∂f(x)
∂x= f
′(g(x)) · g′(x) (2.4)
26
2.2 Fundamentals
Algorithm 1 Backpropagation
Input: A network with l layers, the activation function is σl, the outputs of hidden
layer:
hl = σl(WTl hl−1 + bl)
and the network output:
y = hl
Calculate the gradient:
δ ← ∂ε(yi, yi)
∂y
For i ← l to 0 do
Calculate gradient w.r.t weights for present layer:
∂ε(y, y)
∂Wl=∂ε(y, y)
∂hl
∂hl∂Wl
= δ∂hl∂Wl
Calculate gradient w.r.t bias for present layer:
∂ε(y, y)
∂bl=∂ε(y, y)
∂hl
∂hl∂bl
= δ∂hl∂bl
Apply SGD using ∂ε(y,y)∂Wl
and ∂ε(y,y)∂bl
Backpropagate gradient to the frontal layers:
δ ← ∂ε(y, y)
∂hl
∂hl∂hl−1
= δ∂hl∂hl−1
End
27
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
2.2.1.2.2 Stochastic Gradient Descent (SGD) In Algorithm 1, we can realize
that the model optimization is based on the SGD algorithm, which is a stochastic
approximation of the gradient descent optimization and an iterative method for
minimizing a cost function. In DL, due to the large size of the dataset, SGD
often means mini-batch SGD. It performs an update for every mini-batch of n
training examples and reduces the variance of the parameter updates.
Algorithm 2 explains SGD in detail.
Algorithm 2 Stochastic Gradient Descent (SGD)
Input: cost function ε, learning rate η, a dataset {X,y} and the model z(θ, x)
Output: Optimum θ which minimizes ε
REPEAT until convergence:
Shuffle {X, y};For each mini-batch of xi, yi in {X,y} do
yi = z(θ, xi)
θ := θ - η· 1N
∑Ni=1
∂ε(yi,yi)∂θ
End
There were two epoch-making research works, Convolutional Neural Net-
works (CNN) [LBBH98] and Long Short-Term Memory (LSTM) [HS97],
published at this time. Since the two approaches are also the cornerstone of
our recent work, I provide the fundamental principles of them in the following
paragraphs.
2.2.1.2.3 Convolutional Neural Networks (CNNs) Fukushima first proposed
CNNs in 1988 [Fuk88]. However, due to the hardware limitation, CNNs first time
became popular in the research community after LeCun et al. obtained successful
results for the handwritten digit classification [LBBH98].
CNNs are more similar to the human vision system, being efficient at learning
hierarchical abstractions of visual features (cf. section 2.1.2). The pooling layer of
CNNs is effective in absorbing shape variations and reducing the feature dimen-
sions. Furthermore, CNNs apply smaller receptive fields which compose sparse
connections. It does not do a matrix multiplication with the entire input image
at once, but instead, performs a convolutional operation with the small kernels
on the input. This is the reason why CNNs have significantly fewer parameters
28
2.2 Fundamentals
Figure 2.10: The CNN architecture of Lenet (image source: Lenet5 [L+15])
than a fully-connected neural layer, and CNNs have translation invariant prop-
erty. Most of all, CNNs are trained with the gradient-based learning algorithm
like SGD, and can suffer less from the diminishing gradient problem by using
activation function such as Rectified Linear Unit (ReLU) [NH10].
Figure 2.10 shows the architecture of Lenet [L+15], developed by LeCun et al.
in 1998 for handwritten digits recognition. Lenet is the cornerstone of the current
CNN architectures and consists of feature extractors and a classifier network.
Each layer of the network receives the output from its adjacent previous layer
as its input and passes its output as the input to the next layer. The feature
extractors, in turn, consist of convolution and pooling (subsampling) layers, which
are placed in the low and middle-level of the network.
Generally, as the features propagate from lower level layers to higher-level
layers, the dimensions of feature maps are reduced progressively. However, the
number of feature maps usually increased in order to maintain the capacity of the
feature representations. In this way, we can obtain higher feature abstractions
through this layer-wise structure, without losing the information density. The
outputs of the last pooling layer are fed as the inputs into a fully-connected
(fc) network which is called classification layer. Feed-forward NNs are used as
the classification layer in many previous network architectures. However, due to
that, the fc-layers are expensive in terms of network parameters, researchers tend
to apply new techniques including average-pooling and global average-pooling
[HZRS16] as an alternative to fc-networks. The score of the respective class is
calculated using a softmax layer. Based on the highest score, the classifier gives
outputs for the corresponding classes.
29
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
input output3 × 3 kernel
Figure 2.11: A convolution operation on an image using a 3 × 3 kernel. Each pixel in
the output image is the weighted sum of 9 pixels in the input image. (image credit: Tom
Herold)
More formally, convolution layers are expressed as:
Y j(r) = σ
(∑
i
Kij(r) ∗X i(r) + bj(r)
)(2.5)
where X i and Y j are the i-th input and j-th output map, respectively. Kij
denotes the convolution kernel applied to the input maps and ∗ denotes the
convolutional operation. σ is a non-linear activation function, as e.g., ReLU. bj
is the bias term of the j-th output map and r indicates a local region where the
convolution performs.
Figure 2.11 shows the convolution of a single pixel in the output map Y j. For
example, a commonly used convolution kernel has a square size of 3×3. For each
pixel in the output map, the surrounding 3 × 3 = 9 pixels are involved by the
convolutional operation. A convolution layer usually has a set of kernels with a
pre-defined size, and their parameters are initialized randomly. This is to enhance
the richness of features, where each of the kernel is corresponding to the desired
output dimension. Other common hyperparameters include stride and padding.
The former one defines how many pixels the kernel moves on the input feature
map. The latter specifies how to handle convolutions along the edge of an input,
where the kernel would need to include pixels from outside the image. We will
use each kernel to convolve all possible positions of the input, and this property
is so called (Parameter Sharing). Instead of learning parameters for every nodes
30
2.2 Fundamentals
Figure 2.12: A pooling operation using a 2× 2 kernel, stride 2, where max-pooling and
average pooling are demonstrated.
in the input like the fc layer, it only learns a set of kw × kh × c × o parameters,
where kw and kh denote the width and height of the convolution kernel, c the
number of input channels, and o the number of output feature maps. A non-
linear activation function is applied to the output of a convolutional layer. The
parameters of the kernels are learned during the training process by using the BP
algorithm. A CNN layer in the frontal stage learns to detect primitive features
such as edges, corners, and higher feature abstractions can be obtained by this
layer wise architecture.
In Lenet, a convolutional layer is typically followed by a pooling layer, also
called subsampling or downsampling layer (cf. Figure 2.10). Pooling layers are
used to reduce the spatial dimensions (width and height) of input data. However,
the number of input and output feature maps does not change. For example, if
the input channel is N , then precisely N output feature maps will be created. If a
2×2 downsampling kernel with the stride two is used, then the spatial dimension
of the output map will be the half of the corresponding input dimension. The
commonly used pooling methods are average pooling and max-pooling, as shown
in Figure 2.12. In case of the average pooling, it will sum up over N×N patches of
the input maps and compute the average value. On the other hand, max-pooling
only outputs the maximum value within its neighborhood.
2.2.1.2.4 Recurrent Neural Networks (RNNs) Regarding the fact that, hu-
man thoughts have persistence. We don’t lose our thought of the moment and
31
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.13: Computational graph of a RNN in folded (left) and unfolded view (right).
(image credit: Xiaoyin Che)
start the new thinking from scratch in a second. For example, if we are reading an
article, we understand each sentence and paragraph based on the understanding of
previous sentences and paragraphs. It indicates that our knowledge significantly
depends on the sequential context information. However, traditional feed-forward
neural networks cannot deal with such problems. Therefore, the idea of Recur-
rent Neural Networks (RNNs) was developed in the 1980s. Figure 2.13 shows the
structure of basic RNNs in both folded and unfolded view.
Unlike CNNs, RNNs consist of layers with recurrent connections forming a
feedback loop to the previous state. An RNN in unrolled form consists of several
internal time steps. The output of the previous step will be part of the input
of next step. This allows RNNs to maintain internal states, also called hidden
states. The parameters in the RNN layer are kept static through continuous
steps.
In practice, RNNs are very well suited for processing sequential data of arbi-
trary lengths, such as videos (frame sequences) and texts (word sequences). There
are several different Modelling methods developed for RNN, including Many-to-
Many (language model, encoder-decoder model), Many-to-One (sentiment anal-
ysis) and One-to-Many (image captioning).
Figure 2.14 depicts the detailed structure of a “Vanilla” RNN cell. Mathe-
matically, the internal state ht of the current step can be computed as follows
32
2.2 Fundamentals
Figure 2.14: Detailed structure of a “Vanilla” RNN cell. (image credit: Xiaoyin Che)
(for simplicity the bias term bt is omitted):
ht = tanh(Uxt +Wht−1) (2.6)
where W and U are the weight matrix for the hidden state of the previous time
step ht−1 and the current input xt, respectively. The prediction result yt can be
calculated as:
yt = softmax(V ht) (2.7)
The probability distribution obtained by applying the softmax function on top of
the output. Once the forward propagation for the whole sequence finished, the
Cross-Entropy Loss for each time step t will be calculated, and the total loss is
the sum losses over all steps.
Backpropagating errors in an RNN can be done by applying the standard
BP algorithm to the unfolded computational graph of the RNN. This method
is called Back-Propagation Through Time (BPTT). The gradients obtained by
applying BPTT can be used by an optimization algorithm such as SGD to train
the RNN.
The main drawback of “Vanilla” RNNs are gradient exploding and vanishing
problem. While computing the gradient the same weight matrices are multiplied
with each other several times, which will lead to the following issues:
33
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.15: Detailed structure of a LSTM cell. (image credit: Xiaoyin Che)
• Exponentially increasing of gradients, if the weight parameters are larger
than abs(1.0). This leads to unstable training.
• Exponentially vanishing of gradients, if the weight parameters are smaller
than abs(1.0). The model only learns how to keep “short memory”.
The gradient exploding problem can be partially solved by setting a threshold
to “clip” the gradient. However, there is no good solution found for gradient
vanishing problem with “Vanilla” RNN.
In 1997, Hochreiter et al. proposed Long Short-Term Memory (LSTM) [HS97],
which offers an “Advanced” RNN cell structure.
Figure 2.15 demonstrates the detailed structural design of an LSTM cell.
Unlike “Vanilla” RNNs, LSTM is designed to capture sequential context over
long as well as short periods of time. The key idea of LSTM is the new design of
Cell State, intended to save long-term memory. It also introduces several new
hidden cells, each cell holds its internal state and is carefully controlled by three
trainable Gates. Those gates control the information flow and decide if the cell’s
state should be altered.
First, a forget gate ft decides which information to erase from the previous
cell’s state Ct−1. Next, the input gate it controls which parts of the new infor-
34
2.2 Fundamentals
mation Ct from the current time step are stored in the cell’s state. Finally, the
output gate ot constraints how state information is used for computing the hidden
state of the current time step ht.
Overall, the LSTM is calculated as follows:
Ct = tanh(Ucxt +WCht−1) (2.8)
ft = σ(Ufxt +Wfht−1) (2.9)
it = σ(Uixt +Wiht−1) (2.10)
ot = σ(Uoxt +Woht−1) (2.11)
Long-term memory update:
Ct = ft ◦ Ct−1 + it ◦ Ct (2.12)
Short-term memory output:
ht = ot ◦ tanh(Ct) (2.13)
Since LSTM achieved significant success in a wide range of application ar-
eas, there are many derived variants proposed. For example, a simplified ap-
proach called Gated Recurrent Unit (GRU) [CVMG+14] was developed for ma-
chine translation, which demonstrates similar accuracy but better efficiency com-
pared to LSTM. Another approach named Bidirectional LSTM (BLSTM) also
attracted a lot of attention from different application fields. Figure 2.16 shows
the basic computational graph of a Bidirectional RNN, where the backward RNN
cells (blue color) have an independent computation flow with the forward cells
(yellow color). On top of both data flows, an additional activation cell is used
to create the integrated outputs. BLSTM achieved many state-of-the-art results
in different application areas, such as image captioning [WYBM16], phoneme
classification [GS05], and named entity recognition [CN15] etc.
35
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.16: Computational graph of a Bidirectional RNN. (image credit: Xiaoyin Che)
Although, almost all the core algorithms used nowadays in DL community
already exist since the 1980s and successful evidence has been shown, unfortu-
nately, the second wave of AI research only lasted until the mid-1990s. There
are several important reasons for this result: first of all, some other ML methods,
such as SVM, achieved better results in the mainstream tasks; second, ANNs are
too hard to train and training a network is computationally costly; moreover, the
large-scale datasets used today were not available at that time. As a result, the
research of ANN ushered in another cold winter.
2.2.2 Neural Network 3.0 - Deep Learning Algorithms
In 2012, AlexNet proposed by Krizhevsky and Hinton et al. [KSH12], won the
ImageNet LSVRC-2012 competition and created a large margin to other com-
petitors (the top-5 error rate:16% vs. 26.2% of second place). This event opened
the current wave of AI, by which Deep Neural Networks (DNNs) are considered
to be the core foundation.
Generally speaking, the recent efforts in DL research are focusing on the opti-
mization of the information flow of DNNs and the architecture engineering. The
recent achievements of gradient flow optimization have focused on the following
36
2.2 Fundamentals
aspects:
• problem: gradient vanishing in deep networks; solution: ReLU [NH10]
• problem: “dying ReLU”; solution: LeakyReLU, PReLU, ELU [XWCL15],
etc.
• problem: extremely “deep” network is hard to train; solution: adding short-
cut connections to enable more flexible gradient circulation
• problem: strong bias in the data flow of deep networks; solution: forced sta-
bility of the mean and variance of parameters, Batch Normalization [IS15a]
• problem: overfitting; solution: adding noises to the gradient flow, Dropout
[SHK+14]
• problem: standard SGD is heavily relying on manual hyperparameter tun-
ing; solution: adaptive optimization methods, e.g. Adam [KB14]
I will give a detailed description of the mentioned achievements in the rest of
this chapter, followed by an introduction of several most significant DNN archi-
tectures. For more comprehensive knowledge of DL algorithms, I would highly
recommend the following literature [GBCB16, LBH15, B+09].
2.2.2.1 Data Preprocessing and Initialization
Successfully training a DNN usually requires some advanced training techniques
or components which need to be analyzed carefully. There are different ap-
proaches have been applied before feeding the data to the network. First, we
use mean-subtraction techniques to reduce the bias of the dataset. In practice,
for instance, we subtract the input image by the mean image in a pixel-wise man-
ner, where the mean image is computed based on the training set (e.g., the mean
image of AlexNet with the channel size [3, 224, 224]). Another influential deep
model VGG-Net [SZ15] uses per-channel mean subtraction with mean-vector hav-
ing three values [103.939, 116.779, 123.68] for “blue”, “green” and “red” channel,
respectively. Moreover, Google’s Inception-Net [SLJ+15] utilizes zero-centered
RGB channels and squashes the input to [-1, 1] further.
37
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.17: (Left) Images generated by our text sample generation tool. (Right) Images
taken from ICDAR dataset of the robust reading challenge. (image credit: Christian Bartz )
Due to that DL algorithms usually require large datasets for the training and
creating large datasets with pixel-level labels has been extremely costly due to the
amount of human effort required. Therefore, data augmentation methods have
been widely used to enhance the training datasets, where techniques for preparing
a dataset include sample rescaling, random cropping, flipping data with respect to
the horizon or vertical axis, color jittering, PCA/ZCA whitening as well as arbi-
trary combinations of small translation, rotation, stretching, shearing, distortions
and many others. Moreover, for some specific tasks, by which training data are
challenging to obtain, we often develop a data engine to generate the synthetic
data for the model training and use the real data for the performance testing. The
synthetic data should approximate the real data distribution as much as possible.
For example, in our work [YWBM16] we developed a data engine for generating
text sample images with various factors, such as font styles, background blend-
ing, distortion, blurring, reflection, etc. Figure 2.17 demonstrates a comparison
of generated samples with real-world images from the ICDAR dataset of the ro-
bust reading challenge. Base on this system we achieved similar word recognition
38
2.2 Fundamentals
Figure 2.18: Ground truth image created from computer games. (image credit:
[RVRK16])
results compared to Google’s PhotoOCR system [BCNN13a], which is built based
on millions of real-world data having been manually annotated.
Moreover, Richter et al. [RVRK16] developed an approach to rapidly creating
pixel-accurate semantic label maps for images extracted from modern computer
games. The authors produced dense pixel-level semantic annotations for 25 thou-
sand pictures synthesized by a photorealistic open-world computer game. Figure
2.18 shows an exemplary picture generated by this approach.
2.2.2.1.1 Parameter Initialization Parameter initialization of deep networks
has a considerable impact on the overall performance before Batch Normaliza-
tion has been proposed. This fact has been confirmed in many previous works
[SMDH13], most of the networks apply random initialization for weights. How-
ever, for more complicated tasks, effective initialization techniques are desired
for the high dimensionality input data. The weights of a DNN should not be
symmetrical to ease the backpropagation process. There are many effective tech-
niques proposed over the last few years. Glorot et al.[GB10] proposed a simple
but effective approach, by which the network weights Wl of the lth layer are
scaled by the inverse of the square root of the input dimension, more formally
this method can be represented as:
V ar(in) = 1Dl
Wl = Random(Dl,H)√Dl
(2.14)
39
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
where Dl denotes the dimension of the input of the lth layer, and Random(Dl, H)
denotes the randomly initialized weights. This method is known as Xavier initial-
ization, which is based on the symmetric activation function with respect to the
hypothesis of linearity [GB10]. However, this approach might also receive biased
data flow when we use it together with ReLU activation function. Therefore, He
et al. (2015) [HZRS15] introduced an additional term in the Xavier initialization
function which can effectively alleviate the data bias problem. The distribution of
the weights of the lth layer is a normal distribution with zero mean and variance2nl
, expressed as follows:
Wl ∼ N(0,2
nl) (2.15)
2.2.2.2 Batch Normalization
As mentioned in the previous section, in DL, we often use mean subtraction to
reduce the dataset bias. In this way, the network converges faster and shows
better regularization during training. It thus has a positive impact on the overall
accuracy. However, this process is performed outside of the network before the
training and the deep network has a layer-wise structure with a lot of non-linear
data transformations. Even some small changes in the frontal layers can result in
big outliers in the late layers, and such outliers will result in gradient bias in the
BP process. Thus, additional compensations (more training epochs) for solving
such outliers are required.
Batch Normalization (BN) [IS15a] helps to accelerate DL processes by re-
ducing internal covariance by shifting input data, where the inputs are linearly
transformed to have zero mean and unit variance. To optimize the training pro-
cess, BP is then applied to the internal layers of the DNNs. BN is commonly used
in the most state-of-the-art DNN architectures, such as ResNet, Inception-Net,
DenseNet, etc.
The algorithm of BN is given in Algorithm 3.
The parameters γ and β are defined for the scale and shift factor for the
normalized data values so that the normalization does not only depend on layer
values. Calculating BN is slightly different in testing, where the mean and vari-
ance are not computed based on the batch, rather than a single fixed empirical
40
2.2 Fundamentals
Algorithm 3 Batch Normalization (BN)
Inputs: Values of x over a mini-batch: B = {x1...xm}Output: {yi = BNγ,β(xi)}Parameters to be learned: γ, β
Calculates mini-batch mean:
µB ←1
m
m∑
i=1
xi
Calculates mini-batch variance:
σ2B ←1
m
m∑
i=1
(xi − µB)2
Normalize:
xi ←xi − µB√σ2B + ε
Scale and shift:
yi = γxi + β ≡ BNγ,β(xi)
mean and variance of activations during training is used. The advantages and
some criterions of using BN can be summarized as follows:
• Prevents outliers, so that improves gradient flow in the backward pass
• Increase learning rate which makes the training faster
• Reduce the strong dependence on initialization
• Reduce the need for dropout (reported by Ioffe and Szegedy, [IS15a])
• L2 weight regularization
• Remove Local Response Normalization (LRN) (if used)
• Shuffle training sample more thoroughly
• Use less distortion of images in the training set
41
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.19: Pictorial representation of the concept Dropout
2.2.2.3 Regularization
Regularization techniques in ML are defined to enhance the generalization ability
of an ML model. Unlike other technologies, regularization methods can keep the
number of features, but reduce their magnitude (feature value). It works well
when the ML models have a lot of features such as DNNs, where each feature
contributes a bit to the classification result. The commonly used regularization
methods in DNNs are L1, L2, and L1 + L2.
Different regularization approaches have been proposed in the past few years
for deep networks. From which Dropout is a very straightforward but efficient
approach, proposed by Srivastava et al. (2012)[SHK+14]. In Dropout a randomly
selected subset of activations is set to zero within a layer. It uses the parameter
dropout-rate to set the probability of dropping, e.g., 50%, which means that
50% of the neuron activations of a particular layer will randomly set to zero.
Therefore, training a deep model with Dropout can be considered as training a
large ensemble of sub-models. It prevents co-adaptation of features and forces the
network to have a redundant representation. The concept of Dropout is shown in
Figure 2.19. Dropconnect, proposed by Wang et al. [WZZ+13], is another effective
regularization approach. In this approach, instead of dropping the activations,
the subsets of weights within layers are set to zero. As a result, each layer receives
a randomly selected subset of units from the previous layer.
42
2.2 Fundamentals
Figure 2.20: Activation function: (top-left) sigmoid function, (bottom-left) derivative of
sigmoid function, (top-middle) tanh function, (bottom-middle) derivative of tanh function,
(top-right) ReLU function, (bottom-right) derivative of ReLU function.
2.2.2.4 Activation Function
Previously, Sigmoid and Tanh activation function have been commonly used
in neural networks. The corresponding graphical representations are shown in
Figure 2.20 (cf. the left and middle column). Equation 2.16 and 2.17 show the
mathematical expressions.
Sigmoid : σ(x) = 11+e−x
Derivative : σ′(x) = σ(x)(1− σ(x))
(2.16)
Tanh : tanh(x) = 21+e−2x − 1
Derivative : tanh′(x) = 1− tanh(x)2
(2.17)
However, both Sigmoid and Tanh function have the gradient vanishing prob-
lem when the network has many hidden layers. We were thus not able to train a
very deep network using those two activation functions. This problem has been
solved by using ReLU activation function, proposed by Nair et al. (2010) [NH10].
The basic concept of ReLU is to simply keeps all the values above zero and sets all
negative values to zero, as shown graphically in Figure 2.20 (cf. the right column)
43
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
and formally in Equation 2.18. ReLU converges much faster than Sigmoid and
Tanh, since the gradient always equals 1 when x ≥ 0. This characteristic solves
the gradient vanishing problem. But when x < 0, the corresponding neurons will
never be activated in the forward pass and always get zero gradients in backward
pass. This issue is so-called “dying ReLU” problem.
ReLU : f(x) = max(x, 0)
Derivative : f′(x) =
{0, if x< 01, if x≥ 0
(2.18)
The straightforward solution for solving this problem is simply to modify
the negative part of the ReLU function, to enable the negative data flow. Some
improved variants of ReLU have been proposed recently, such as parametric ReLU
(PReLU) [HZRS15], Leaky ReLU (LReLU) [MHN13] and Exponential Linear
Unit (ELU) [CUH15]. Figure 2.21 shows their graphical representations, and the
mathematical expressions are given in Equation 2.19, 2.20 and 2.21.
Leaky ReLU : f(x) =
{0.01x, if x< 0x, if x≥ 0
Derivative : f′(x) =
{0.01, if x< 01, if x≥ 0
(2.19)
PReLU : f(x) =
{αx, if x< 0x, if x≥ 0
Derivative : f′(x) =
{α, if x< 01, if x≥ 0
(2.20)
We can quickly realize that LReLU needs to manually choose the constant
parameter (0.01 in Equation 2.19), while PReLU can adaptively learn the pa-
rameter α from the training data.
ELU : f(x) =
{α(ex − 1), if x< 0x, if x≥ 0
Derivative : f′(x) =
{f(x) + α, if x< 01, if x≥ 0
(2.21)
44
2.2 Fundamentals
Figure 2.21: Derived linear unit activation function: (left) PReLU and LReLU activation
function, (right) ELU activation function.
ELU takes all benefits of ReLU and does not have the “dying ReLU” problem,
but its exponential function is computationally expensive. Xu et al. (2015)
[XWCL15] present an empirical study of several rectified activations in CNNs,
which provides more insights on this topic.
2.2.2.5 Optimization Algorithms
In section 2.2.1.2, we present SGD, a widely used optimization algorithm for
training neural networks. With SGD we can obtain a reliable result in the case
of a proper initialization and an appropriate learning rate scheduling scheme.
However, SGD requires a lot of manual tuning for its hyperparameters and con-
verges slowly compared to the recently proposed adaptive optimization methods.
Furthermore, SGD has trouble navigating ravines, i.e. areas where the surface
curves much more steeply in one dimension than in another, which are common
around local optima or saddle point.
2.2.2.5.1 Momentum Momentum [Qia99] is a method which helps to accel-
erate the training process with the SGD approach. It has similar functionality to
the same term in physics. The technique can boost SGD moving in the relevant
direction and dampens oscillations. The core idea of this method is to utilize the
moving average of the gradients instead of only using the current real value of
the gradient.
45
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Mathematically, it adds a fraction factor γ of the update vector of the previous
time step to the current update vector, that can be expressed as follows:
vt = γvt−1 + η5θ J(θ)
θ = θ − vt(2.22)
where γ is the momentum term, η denotes the learning rate for the tth update, and
v denotes the “velocity”. The momentum term increases for dimensions whose
gradients tend in the same directions and dampens updates for dimensions whose
gradients change directions. This results in a faster convergence and reduces
oscillation.
However, even we can achieve some speedup with Momentum, we still have to
manually choose an initial learning rate and define its update scheduling scheme.
Therefore, the adaptive learning rate method is still highly desired.
2.2.2.5.2 RMSprop RMSprop is an adaptive learning rate method proposed
by Geoffrey Hinton in his Coursera lecture 1. RMSprop follows the same idea as
Adagrad [DHS11] that adapts the learning rate to each parameter, performing
more significant updates for infrequent and smaller updates for frequent param-
eters. Unlike Adagrad, RMSprop utilizes only the magnitude of gradient of the
previous iteration, which prevents the monotonically decreasing learning rate of
Adagrad and provides better performance in many cases. The mathematical ex-
pression of RMSprop is presented in Equation 2.23.
E[g2]t = γE[g2]t−1 + (1− γ)g2t
θt+1 = θt − η√E[g2]t+ε
gt(2.23)
Instead of inefficiently accumulating all previously squared gradients, the sum
of gradients is recursively calculated as a decaying average of all past squared
gradients. E[g2]t defines the running average at the time step t. From Equation
2.23, we can realize that E[g2]t only depends on the previous average and the
actual gradient. The learning rate η is divided by an exponentially decaying
average of squared gradients. The suggested default value of γ and η are 0.9 and
0.001, respectively.
1http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
46
2.2 Fundamentals
2.2.2.5.3 Adaptive Moment Estimation (Adam) Adam [KB14] is probably
the most widely used adaptive optimization method in the DL community at
present. Similar to its counterparts, Adam also computes the learning rate for
each parameter. This method can be considered as the combination of Momentum
and RMSprop. It utilizes the exponentially decaying average of past squared
gradients like RMSprop (indicated by vt in Equation 2.24) on the one hand and
keeps the exponentially decaying average of past gradients as well (denoted by
mt), which is similar to Momentum.
mt = β1mt−1 + (1− β1)gt
vt = β2vt−1 + (1− β2)g2t(2.24)
where mt and vt are defined respectively for estimating the first moment (repre-
sents the mean) and the second moment (represents the variance) of the gradients.
The authors of [KB14] observed that mt and vt are biased towards zero. There-
fore the bias-corrected version of them are:
mt = mt
1−βt1
vt = vt1−βt
2
(2.25)
Then, the parameter update function is expressed as:
θt+1 = θ − η√vt+ε
mt (2.26)
The proposed default value for β1, β2 and ε is 0.9, 0.999 and 10−8, respectively.
Adam works quite well in practice and is particularly suggested as a good starting
point for a deeper, more complex network.
2.2.2.6 Loss Function
The Softmax function is an activation function applied to the last fc-layer to
obtain a probability distribution. It maps any K-dimensional (K ∈ N+) input
vector x ∈ RK onto a vector with the value distribution [0, 1], where the sum of
all elements is 1:
p(x )i =expxi∑Kk=0 expxk
47
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
To train a neural network, we iteratively calculate the difference between a desired
output class y and a predicted class y. We refer to such an error measure between
the predicted and expected output value as a Loss Function. A commonly used
loss function for multi-class classification problem is the cross-entropy loss, which
is also mostly used in this thesis.
In binary classification, where the number of classes K = 2, cross-entropy can
be calculated as:
Lcross-entropy = −(y log(p) + (1− y) log(1− p))
For K > 2, we compute a separate loss for each class label per observation
and sum the result, which is expressed as follows:
Lcross-entropy = −K∑
k=1
yi,k log(pi,k)
where K denotes the number of classes, log is the natural log function, y denotes
the binary indicator (0 or 1) if class label k is the correct classification for the
observation i, and p is the predicted probability observation i is of class c.
2.2.2.7 DNN Architectures
In section 2.2.1.2, I present Lenet [L+15], which has been recognized as the cor-
nerstone of the subsequent CNN architectures. In this section, I will introduce
some recent CNN architectures which are in turn the backbones of the most
AI applications. The top-5 errors of essential deep CNN models in ImageNet
classification challenge can be found in Figure 2.34.
2.2.2.7.1 AlexNet As already mentioned, AlexNet [KSH12] can be considered
as the starting sign of the current AI wave. It achieved state-of-the-art accuracy
and won the most difficult ImageNet ILSVRC challenge in 2012 (Image classifi-
cation task with 1000 classes, 1.2 million training and 500k validation images).
It was a significant breakthrough in the field of machine learning and computer
vision for visual recognition and classification.
48
2.2 Fundamentals
Method LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50
Top-5 errors N/A 16.4% 7.4% 6.7% 5.3%
Input size 28×28 227×227 224×224 224×224 224×224
Number of conv-layers 2 5 16 21 50
Kernel size {5} {3, 5, 11} {3} {1, 3, 5, 7} {1, 3, 7}Number of weights (conv) 26k 2.3M 14.7M 6.0M 23.5M
Number of MACs (conv) 1.9M 666M 15.3G 1.43G 3.86G
Number of fc-layers 2 3 3 1 1
Number of weights (fc) 406k 58.6M 124M 1M 1M
Number of MACs (fc) 405k 58.6M 124M 1M 1M
Total weights 431k 61M 138M 7M 25.5M
Total MACs 2.3M 724M 15.5G 1.43G 3.9G
Table 2.1: Classfication accuracy on ImageNet dataset of mainstream CNN architectures.
Essential information over the parameters is provided, such as the number of weights and
computation operations, etc.
Figure 2.22: Architecture of AlexNet
As shown in Figure 2.22, AlexNet has five convolution layers and three fully-
connected (fc) layers. The first convolution layer consists of convolution and max-
pooling with Local Response Normalization (LRN), where 96 filters are used with
the size 11×11. The same operations are performed in the second convolution
layer with 256 5×5 filters, and in the 3rd 4th and 5th conv-layer with 384, 384, and
256 filters respectively. There are several novel concepts introduced in AlexNet,
including using ReLU activation function instead of Sigmoid and Tanh; adding
Dropout into the network as a regularizer; implementing CNNs using CUDA
which achieved a significant acceleration.
Table 2.1 shows some essential information of the network, such as the number
of convolution and fc layers, the number of weight parameters, and the number of
Multiplier-Accumulators (MACs) in convolution and fc layers, respectively. From
49
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.23: Structure diagram of VGG-Net
the table we can realize that the convolution layers occupy a large proportion of
the total computation (95%) and the fully-connected layers have a large number
of weight parameters (94%). It indicates that if we want to reduce the model
size, we should consider eliminating the fc layer, and to speed up the model
computation, we could apply parallel computing techniques in the convolutional
layers.
Overall, we can clearly see the outstanding contribution of AlexNet, as a pio-
neer or a game-changer who made important guidance for the subsequent devel-
opment of DNNs.
2.2.2.7.2 VGG-Net After the success of AlexNet, the architecture engineering
of deep CNNs has been developed rapidly and became a mainstream research
direction in the DL community.
In 2013, Zeiler and Fergus won the ILSVRC’13 classification challenge with
ZFNet, which only adaptes the filter size of the input convolution layer from
11×11 to 7×7 and increases the filter number of the last three convolution layers
to 512, 1024 and 512. Those changes made an accuracy improvement of about
5%.
In 2014, there were two essential approaches published: VGG-Net [SZ15] and
GoogLeNet [SLJ+15], which won the ILSVRC’14 localization and classification
challenge, respectively. VGG-Net also achieved the second place in the classifi-
cation task.
The main contribution of VGG-Net is that it shows the importance of the
depth of a network. Alongside the other hyperparameters of CNNs, the depth
50
2.2 Fundamentals
Figure 2.24: A stack of three convolution layers with 3×3 kernel and stride 1, has the
same active receptive field as a 7×7 convolution layer.
could be a critical component for achieving much better accuracy. As shown in
Figure 2.23, VGG-Net consists of several different modules of convolutional layers
regarding the number of conv-layers and output feature maps. ReLU activation
function has been applied for obtaining non-linearity, followed by a single max-
pooling layer and several fully connected layers. The final layer of the model is a
softmax layer for classification.
We can learn some interesting characteristics from the design of VGG-Net. It
first time introduced the notion “module” or “block” in a deep network (cf. Figure
2.23, where various colors characterize different modules). It strictly uses 3×3
filters with the stride and pad of 1. In this way, we can significantly reduce the
weights of the network. E.g. given a 7×7 input map, stacking three conv-layers
with 3×3 kernel has the same active receptive field but much fewer parameters
as directly using a 7×7 kernel (number of parameters: 3∗(32C2) vs. 72C2), as
demonstrated in Figure 2.24.
VGG-16 is one of the most influential deep models in the early age of DL.
Because it was the preferred choice in the community as an image feature ex-
tractor for a large varity of applications. This trend has not been changed until
Inception-Net and ResNet models become available. It emphasizes the notion
“keep deep and keep simple”. However, the most accurate model from the VGG
series: VGG-19 is also the most computational expensive model, which contains
138M weights and has 15.5G MACs.
2.2.2.7.3 GoogLeNet GoogLeNet [SLJ+15] is the winner of the ILSVRC’14
image classification challenge, proposed by Google researchers. The essential
51
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.25: Structure diagram of GoogLeNet, which emphasizes the so-called “Inception
Module”
design aim of this model is to reduce the computation complexity compared to
the traditional CNNs.
Figure 2.25 demonstrates the structure of GoogLeNet, from which we can
see that the network consists of many sub-networks, the so-called “Inception
Module”. The idea behind is to design a good local network topology (sub-
network) and then stack these modules on top of each other. Those sub-modules
have variable receptive fields, which were created by different kernel sizes. Then,
we can concatenate all filter outputs together in a depth-wise manner, which is
called “Depth-Concat”. The initial concept of the inception module can be seen
in Figure 2.26. It applies four parallel filter operations on the input of the current
layer, to learn visual features from different scales. A problem with this design
is that it can not effectively reduce the model complexity as initially expected.
Therefore a simple but quite efficient idea was proposed: the “bottleneck” design.
Figure 2.27 describes this idea, where given an input map with the depth,
width, and height of [64, 56, 56], we then use a 1×1 conv-kernel with the depth
32 and stride 1 to convolve the input. The output map has the dimension [32,
56, 56]. This way, we can preserve the width and height (spacial dimension) of
the feature map, but arbitrarily reduce its depth.
Figure 2.28 shows the final “Inception Module” of GoogLeNet with “bottle-
neck” layer design. A 1×1 convolution layer with depth 64 has been applied
52
2.2 Fundamentals
Figure 2.26: Naive version of “Inception Module”, where the numbers in the figure e.g.,
28×28×128 denote the width×height×depth of the feature maps.
Figure 2.27: “Bottleneck” design idea for a convolution layer: preserves width and height,
but reduces depth
53
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.28: “Inception Module” with “bottleneck” layers, where the numbers in the
figure e.g., 28×28×128 denote the width×height×depth of the feature maps.
before the activations feeding into the 3×3 and 5×5 convolution layers, and after
the 3×3 max-pooling layers, respectively. To do so, we can effectively reduce the
number of operations of the “Inception Module” from 854M to 358M.
GoogLeNet consists of 22 conv-layers (the deepest one at that time) and
achieves the top-5 errors of 6.7%, which is almost 10% gains compared to AlexNet.
Moreover, the number of total weights is 12× less than AlexNet, 27× less than
VGG-19. The “bottleneck” design provides a massive inspiration for the follow-up
lightweight CNN models.
2.2.2.7.4 ResNet ResNet, proposed by He et al. [HZRS16], swept 1st place
in all ILSVRC’15 and COCO’151 competitions, including image classification,
object detection, and object segmentation task. ResNet-152 achieved 3.57% top-
5 errors in ILSVRC’15 image classification, which was the first time a machine
learning model surpassed “Human performance” (5.1% top-5 errors reported in
[RDS+15]) on this task. In my opinion, the main contribution of this work is
that the authors found an efficient solution to train ultra-deep CNNs without
suffering from the network degradation and vanishing gradient problem. In the
paper [HZRS16], the authors conducted a study on increasing the number of
hidden layers on a “plain” CNN model with the VGG-Net like architecture.
1http://cocodataset.org
54
2.2 Fundamentals
Figure 2.29: Structure diagram of ResNet. Left: a “Residual Module”, right: the overall
network design. The different colors indicate the blocks with various layer type or the
different number of filters.
55
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.30: “Residual Module”. Left: initial design of the residual block, right: residual
block with “Bottleneck” design.
Surprisingly, a deeper model with 56 layers performed worse than a shallow model
with 20 layers. Since both testing and training accuracy of the shallow model were
better than the deeper one, it can thus indicate that it was not the overfitting
problem. The authors considered this result as the network degradation problem,
and thus came to a conclusion: multiple layer non-linear feed-forward network
is hard to learn identity mapping. Assume that copying the learned layers from
the shallower model to a deeper one and setting additional layers to identity
mapping, the deeper model should be able to perform at least as well as the
shallower one. Furthermore, the authors raised another question: since the direct
mapping in CNN is difficult to learn, can we learn the residual information of the
information flow? Based on this assumption the authors proposed a new network
module, called “Residual Module”, which suggests adding a residual connection
(serves as an identity mapping) into the conv-layers, shown in the left part of
Figure 2.29. The right part of Figure 2.29 depicts the overall design of ResNet.
ResNet is developed with many different numbers of layers: 18, 34, 50, 101,
152, and even 1202 (on CIFAR-10 dataset). The popular ResNet-50 contains
49 conv-layers and 1 fc layer at the end of the network for classification. The
total number of weights and MACs for the whole network are 25.5M and 3.9G
respectively.
56
2.2 Fundamentals
Figure 2.31: A comparison of DNN models, which gives indication for practical appli-
cations. Left: top-1 model accuracy on ImageNet dataset, right: computation complexity
comparison. (image credit: Alfredo Canziani [CPC16])
The basic residual module architecture is shown in Figure 2.30 (left). The
output of a residual layer H(X) can be defined based on the outputs from the
previous layer defined as X. F (X) is the output after performing operations in
the block, such as convolution, BN etc., followed by a ReLU activation function.
The final output of the residual unit at the current layer can be defined with the
following equation:
H(X) = F (X) +X (2.27)
if F (X) = 0, then H(X) = X is an identity mapping.
The whole ResNet is based on the stacked residual modules, by which there
are at least two conv-layers in each residual module, and it periodically doubles
the number of filters and downsamples the spatial dimension using the stride
2. Inspired by GoogLeNet the authors also utilized the “bottleneck” design (cf.
Figure 2.30 (right)) for the deeper ResNets (≥ 50 layers), which successfully
improves the efficiency and reduces the model size.
2.2.2.7.5 MobileNet There are two main approaches which allow for execu-
tion on mobile devices: One of them is to use quantized floating-point number
with lower precision values for weights and activations, as e.g., the binary NNs
use 1 bit of storage. Our work BMXNet [YFBM17b, BYBM18] belongs to this
group. On the other hand, information in a CNN can be compressed through
57
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
a compact network design. These designs rely on full-precision floating point
numbers, but reduce the total number of parameters with more efficient network
design, while preventing loss of accuracy.
One of the most impacting work is MobileNet [HZC+17], implemented by
Howard et al. in 2017. They use a so-called depth-wise separable convolution
technique, proposed by Chollet [Cho17] (2017). In this approach, the convolution
layers apply a single 3×3 filter to each input channel, and a 1×1 convolution is
employed to combine their outputs. The authors tend to keep large activation
maps through the network and downsample them late in the network, to main-
tain more information. The total number of weights and MACs for the whole
MobileNet are 4.2M and 568M respectively. On ImageNet dataset, it achieves
similar accuracy compared to VGG-16 and GoogLeNet but requires much fewer
weights and MACs.
Other approaches from this group include Xception [Cho17], SqueezeNet [IHM+16],
Deep Compression [HMD15], and ShuffleNet [ZZLS17], etc. These approaches
reduce the memory consumption, but still require GPU hardware for efficient
training and inference. Specific acceleration strategies for CPUs still need to be
developed for these methods. Therefore in our ongoing research work, we devel-
oped BMXNet [YFBM17b], which is indended to tackle this problem, presented
in section 6.
2.2.2.7.6 Summary AlexNet, VGG, Inception-Net, ResNet and MobileNets
are all in wide use and available in model zoos of the mainstream DL frame-
works. Canziani et al. (2017) [CPC16] provides a detailed analysis of DNN mod-
els regarding the practical use cases. Figure 2.31 shows the comparison results
regarding both accuracy and model complexity.
The recent research interests of DL architectures can be roughly summarized
as follows:
• Design of layer and/or skip connections
• Further improvement of the gradient flow
• The more recent trend towards examing necessity of depth vs. width, and
residual connections
58
2.2 Fundamentals
Figure 2.32: Exemplary visualization of VisualBackProp method. (image credit: Mariusz
Bojarski [BCC+16])
• Efficient, lightweight models for low power devices
2.2.2.8 Visualization Tool for Network Development
In this section, I will introduce a visualization technique called VisualBackProp
[BCC+16], which can serve as a visual indicator showing what features have
been learned from a deep CNN model. We can visualize extracted features from
deeper CNN layers with the same resolution as the input image. In other words,
we can visualize which sets of pixels of the input image contribute most to the
final predictions of a CNN model. It can be easily applied during both training
and inference in real time, due to its high efficiency. Therefore, VisualBackProp
becomes a debugging tool for the development of CNN based systems. Figure 2.32
demonstrates some input images and the corresponding feature representations.
Pixels in the highlighted regions are more correlated with the prediction results.
Figure 2.33 depicts the block diagram of VisualBackProp. The method uses a
forward pass to obtain a prediction. Then, it uses the feature maps obtained after
each ReLU activation function. Subsequently, the feature maps from each layer
are averaged, resulting in a single feature map per layer. Next, the averaged
feature map of the deepest convolutional layer is scaled up to the size of the
feature map of the previous layer, which can be achieved by using deconvolution
operation. The authors pointed out that for deconvolution the same filter size and
stride are used as in the convolutional layer which outputs the feature map that
they are scaling up with the deconvolution. In deconvolution, all weights are set
to 1 and biases to 0. Hadamard product is then performed between the obtained
59
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.33: Block diagram of the VisualBackProp method. (image credit: Mariusz
Bojarski [BCC+16])
scaled-up averaged feature map and the averaged feature map from the previous
layer. This process continues layer-by-layer exactly as described above until the
network’s input reached as shown in Figure 2.33. Finally, a result mask with the
size as the input image is obtained, where the values are normalized to the value
range [0, 1]. From the process we can realize that this method backpropagates the
feature maps instead of the gradients, it thus does not need additional gradient
calculation and the backward pass of the network. The obtained visual content
can serve as useful guidance for the CNNs development.
2.3 Recent Development in the Age of Deep Learning
DL is being applied in a large variety of application areas at present as mentioned,
which is the reason why we refer it to as a universal learning approach [B+09].
DL approaches show robustness to natural variations of automatically learned
features in those applications, and we can apply the same DL architecture with
different data types.
60
2.3 Recent Development in the Age of Deep Learning
In this section, I will review some recent efforts in DL research and applications
regarding different perspectives.
2.3.1 Success Factors
The recognized success-factors of deep learning are summarized as the fol-
lowing three points:
• Enormous labeled datasets become available for training deep neural net-
works, e.g.,
– ImageNet dataset for image classification with 14,197k images in 21.841
categories
– YouTube-8M dataset1 with 7 million videos
• The rapid development of hardware accelerations and massive amounts of
computational power available
– Applying GPU in neural network computation becomes more common
– Training time of a very complicated neural network significantly re-
duced, e.g., ten years ago, we needed several months. But today, we
count it by days.
– The rapid development of cloud, high-performance and distributed
computing method
• Working ideas on how to train deep neural networks:
– Stacked Restricted Boltzman Machines (RBM), Hinton et al. 2006
[HS06]
– Stacked Autoencoders (AE), Bengio et al. 2008 [VLBM08]
The two work [HS06] and [VLBM08] are considered as the cornerstone of the
recent deep learning revolution, which proved the possibility of training deep
neural network and indicated the further research directions. At the same time,
the rapid development of high-performance computing and the appearance of
large-scale labeled datasets contributed to this revolution as well.
1https://research.google.com/youtube8m/
61
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.34: Top-5 errors of essential DL models in ImageNet challenge
2.3.2 DL Applications
DL achieved many outstanding successes in the fields of computer vision and
speech recognition. ImageNet [DDS+09], in the real sense, is the first large-
scale dataset for visual recognition, which consists of 1.2 million training images
with 1000 object classes. In 2012, AlexNet won the ImageNet challenge and
outperformed the second place method by almost 10% regarding the classification
accuracy. Since then, DL models continuously broke the records and ruled nearly
all computer vision competitions. Figure 2.34 shows the accuracy of ImageNet
winners over the years. The winner of the year 2015, ResNet-152 shows the result
with only 3.57% top-5 classification error, which is better than human errors for
this task at 5.1% [RDS+15].
DL methods also achieved great success in speech recognition and machine
translation field. For instance, on the popular speech recognition dataset TIMIT
[Gar93], the recently developed deep learning approaches surpass all the other
previous methods. Specifically, it outperforms the CRF based methods by about
15% on phone error rate (PER). Similar breakthroughs have also been obtained
in the NLP field. E.g. for machine translation, significant improvements have
62
2.3 Recent Development in the Age of Deep Learning
been accomplished through the recent Neural Machine Translation approaches
[BCB14, WSC+16].
There are many other challenging issues have been solved in the past few
years with DL, which were not possible to solve efficiently before. For instance:
image and video captioning [WYBM16, VRD+15], cross-domain image-to-image
style transferring using Generative Adversarial Network (GAN) [IZZE17], beat-
ing human in the strategy game Go [SSS+17], achieving a dermatologist-level
classification of skin cancer [EKN+17], and many more as described in Appendix
C.
2.3.3 DL Frameworks
In the past few years, a good number of open-source libraries and frameworks for
DL has been published. They provide a rich application environment for people
to choose from.
Selected mainstream DL frameworks and SDKs are listed below:
• Tensorflow : https://www.tensorflow.org/
• Caffe : http://caffe.berkeleyvision.org/
• KERAS : https://keras.io/
• MXNET : https://mxnet.apache.org/
• Theano (stopped) : http://deeplearning.net/software/theano/
• Torch : http://torch.ch/
• PyTorch : http://pytorch.org/
• Chainer : http://chainer.org/
• DeepLearning4J : https://deeplearning4j.org/
• DIGITS : https://developer.nvidia.com/digits
• CNTK : https://github.com/Microsoft/CNTK
• MatConvNet : http://www.vlfeat.org/matconvnet/
63
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.35: Activity of DL frameworks. Left: arXiv mentions as of March 3, 2018 (past
3 months); Right: Github aggregate activity April - July 2017
FrameworkCore
LanguagePlatform Interface
Distributed
training
Model
ZooMulti-GPU
Multi-
threaded
CPU
Caffe C++Linux, MacOS,
Windows
Python,
MatlabNo yes
Only
data-parallelyes
Tensorflow C++Linux,MacOS,
Windows
Python,Java,
Goyes yes Most flexible yes
MXNet C++Linux,MacOS,
Windows,Devices
Python,Scala,
R,Julia, Perlyes yes yes yes
Pytorch LuaLinux,MacOS,
WindowsPython yes yes yes yes
Chainer Python Linux Python yes yes yes openblas
CNTK C++ Windows,Linux Python,C# yes yes yes yes
Table 2.2: DL framework features
• cuDNN : https://developer.nvidia.com/cudnn
Figure 2.35 demonstrates the activity of several mainstream DL frameworks,
which can give some insights into the impact of them. Google’ Tensorflow
[ABC+16] is no doubt the current most popular DL framework from both re-
search and development aspects.
Table 2.2 shows the different features of selected DL frameworks, such as
core-language, supported platforms, programming interface, GPU support, etc.
Table 2.3 shows a speed evaluation of selected mainstream DL frameworks,
performed by Maladkar et al. 2018. They used the CIFAR-10 dataset with 50.000
training samples and 10.000 test samples, uniformly distributed over ten classes.
A CNN has been used across different platforms with GPU support. Two GPU
types: Nvidia K80 and P100 with CUDA and cuDNN [CWV+14] support were
used in this evaluation. The numbers in the table are the average time in ms for
64
2.3 Recent Development in the Age of Deep Learning
DL Library K80/CUDA8/cuDNN6 P100/CUDA8/cuDNN6
Caffe2 148 54
Chainer 162 69
CNTK 163 53
Gluon 152 62
Keras(CNTK) 194 76
Keras(TensorFlow) 241 76
Keras(Theano) 269 93
Tensorflow 173 57
Theano(Lasagne) 253 65
MXNet 145 51
PyTorch 169 51
Julia-Knet 159 n/a
Table 2.3: DL frameworks benchmarking. Average time (ms) for 1000 images: ResNet-50
Feature Extraction (Source: analyticsindiamag.com)
feature extraction using a ResNet-50 model for 1000 testing images. We can find
out that MXNet and Caffe2 demonstrate good processing speed by using both
GPUs, by contrast, Theano(Lasagne), Keras(Theano) and Keras(TensorFlow)
obtained relatively poor results regarding efficiency.
2.3.4 Current Research Topics
In this section, I summarize some current popular research topics in DL com-
munity, including region-based CNN, deep generative models, semi-supervised or
weakly supervised models, interpretable research, energy efficient models, multi-
task and multi-module learning, and a novel deep model Capsule Network.
2.3.4.1 Region-based CNN
This research topic made a significant contribution to several AI applications,
e.g., autonomous driving, mobile vision, drone, medical imaging, etc. Girshick
et al. (2014) proposed Region-based Convolutional Neural Network (R-CNN)
[GDDM14] for object recognition. R-CNN consists of three modules, i.e. candi-
date regions generation, visual features extraction from regions using CNNs, and
65
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
a set of class-specific SVM for object classification.
Due to the low computation speed of R-CNN, the same authors further pro-
posed Fast R-CNN framework [Gir15] (2015), which exploits R-CNN architecture
and achieves fast processing speed. Fast R-CNN consists of convolutional and
pooling layers, a selective search module for region proposals, and a sequence of
fully connected layers.
In the same year, Ren et al. proposed Faster RCNN [RHGS15], which uses
Region Proposal Network (RPN) for real-time object detection. RPN is a fully
convolutional network which can efficiently generate region proposals. Moreover
the whole system, including RPN and a recognition network, can be optimized
end-to-end. Many AI applications such as autonomous driving, medical image
processing are using this approach for their object detection engine. Other well-
known approaches like SSD [LAE+16] and Yolo [RDGF16] also belongs to this
category, which further improved the processing speed.
He et al. (2017) proposed Mask RCNN [HGDG17] for instance object seg-
mentation. Mask RCNN extends Faster RCNN architecture and utilizes an extra
branch for object mask, that established the new state-of-the-art.
2.3.4.2 Deep Generative Models
The concept of generative models in machine learning is used for data modeling
with a conditional probability density function. Generally, this kind of model is
considered as the probabilistic model with a joint probability distribution over
observation and target (label) values. The recent development of deep generative
models gives us many promising applications. Now we can generate different
types of images, human speeches, musics, poems, texts, and many other things
using deep generative models.
Van Den Oord et al. proposed WaveNet, a deep neural network for generating
raw audio [VDODZ+16]. WaveNet is composed of a stack of CNN layers, and
softmax distribution layer for outputs. Since WaveNet models audio based on
the WAV form, it thus can be applied for generating any acoustic signal such as
human speeches, music, etc.
Gregor et al. proposed Draw, a recurrent neural network for image generation
[GDG+15]. Draw is based on Variational Auto-Encoder (VAE) architecture and
66
2.3 Recent Development in the Age of Deep Learning
applies RNN for both encoder and decoder. Moreover, it introduces a human like
dynamic attention mechanism, which further improves the system performance.
Goodfellow et al. proposed Generative Adversarial Networks (GANs) for es-
timating generative models with an adversarial process [GPAM+14]. Generally,
GAN is referred to as an unsupervised deep learning approach, which offers an
alternative approach to maximum likelihood estimation techniques. In GANs ar-
chitecture, the standard framework consists of two different models: a generator
G, which tries to capture the real data distribution to generate fake samples that
has a realistic looking, and a discriminator D, which tries to do a better job at
distinguishing real and fake samples. G maps a latent space to the data space by
receiving Gaussian noise z as input and applying transformations to it to generate
new samples, while D maps a given sample to a probability P of it coming from
the real data distribution. In the ideal setting, given enough training epochs, G
would probably start producing samples having a realistic looking that D would
not be able to distinguish between real and fake samples anymore. Hence, D
would assign P = 0.5 to all samples, no matter coming from real or fake data
distribution. However, given the training instability inherent to GANs training,
this equilibrium is hard to reach and it is hardly ever achieved in practice.
According to Figure 2.36, two players D and G are playing a minimax game
with the function of V (D,G), expressed as follows:
minG
maxD
V (D,G) = Ex∼Pr(x)[log(D(x))] + Ez∼Pz(z)[log(1−D(G(z)))] (2.28)
where Pz(z) represents the noise distribution used to sample G’s input and G(z)
represents its output, which can be considered as a fake sample originated from
mapping the modified input noise to the real data distribution. In contrast,
Pr(x) represents the real data distribution and D(x) represents the probability
P of sample x being a real sample from the training set. To maximize Equation
2.28, D’s goal is then to maximize the probability of correctly classifying a sample
as real or fake by getting better at distinguishing such cases by assigning P close
to 1 to real images and P close to 0 to generated images. On the contrary,
to minimize Equation 2.28, G tries to minimize the probability of its generated
samples being classified as fake, by fooling D into assigning them a P value close
to 1.
67
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
Figure 2.36: Architecture of Generative Adversarial Networks (image source: Gharakha-
nian)
Beyond the vanilla GAN [GPAM+14], numbers of improved versions have
been proposed in the recent years [SGZ+16]. The newly introduced approaches
mainly addressed two problems: meaningfully correlating the loss metric with the
generators convergence and sample’s quality. Second, they aimed to improve the
stability of the optimization process. In practice, GANs can produce photoreal-
istic images for applications such as visualization of interior or industrial design,
shoes, bags, and clothing items.
2.3.4.3 Weakly Supervised Model, e.g., Deep Reinforcement Learning
Reinforcement Learning is categorized as the weakly (or semi-) supervised learn-
ing method. It uses reward and punishment system for the next action generated
by the learning model, which is mostly used for games and robots, usually solves
decision-making problems.
Recently, Deep Reinforcement Learning (DRL), a combination of DNN and
RL has become a hot topic due to the great success in mastering games. AI bots
are beating human world champions and grandmasters in strategical and other
games. For instance, AlphaGo and AlphaGo Zero for game of GO [SSS+17], Atari
[MKS+15], Dota2 [Blo18], Chess and Shougi [SHS+17].
68
2.3 Recent Development in the Age of Deep Learning
There are many semi-supervised and unsupervised techniques have been im-
plemented based on the concept of RL. In RL, there is no clearly defined loss
function, which therefore makes the learning process more difficult, compared to
the fully supervised approaches. The fundamental differences between RL and
fully supervised methods can be summarized as follows: first we don’t have the
full access to the objective function which is being optimized; it uses an inter-
active querying process, the agent is interacting with a state-based environment;
the input depends on the previous actions. It is foreseeable that DRL will be an
essential research direction for robotics and automation systems for a long time
to come.
2.3.4.4 Interpretable Machine Learning Research
Interpretable Machine Learning is a growing research area in machine learning,
which aims to present the reasoning of the ML system to the human so that
we can verify it and understand it better. Although DNNs have shown superior
performance in a broad range of tasks, the interpretability is always like the
Achilles’ heel of deep learning models. In DL, people tend to convert a new
problem into an objective function and solve this problem by optimizing the
objective function in an end-to-end fashion. This end-to-end learning strategy
makes DNN representations a “black box”. Except for the final output (mostly,
the prediction results), it is hard for a human to understand the logic of DNN ’s
intermediate predictions hidden inside the network (states of the hidden layers).
In recent years, a growing number of researchers have realized that high model
interpretability is of significant value in both theory and practice. For instance,
in a control system, interpreting ML models can help us to identify its incor-
rect decisions, which is especially meaningful for automated medical systems and
robots. Interpretability is crucial for applications where a single wrong decision
can be extremely costly, e.g., self-driving cars.
According to [ZZ18], we can roughly categorize the recent studies on visual
interpretability for DL as follows:
• Visualizing the intermediate representations of a DNN. The gradient-based
methods such as [ZF14, SVZ13, SDBR14] are the mainstream approaches
69
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
in this category. These methods mainly compute gradients of the score
of a given convolutional unit concerning the input image. They use the
gradients to estimate the image appearance which maximizes the conv-unit
score. Olah et al. (2017) [OMS17] proposed a toolbox of existing techniques
to visualize the feature patterns encoded in different convolutional layers of a
pre-trained DNN. The up-convolutional net [DB16] proposed by Dosovitskiy
et al. (2016) is another typical technique to visualize CNN representations.
The up-convolutional net inverts CNN feature maps to images. We can
regard up-convolutional nets as a tool that indirectly illustrates the image
appearance corresponding to a feature map. The mentioned methods mainly
invert feature maps of a convolutional layer back to the input image or
synthesize the image that maximizes the score of a given unit in a pre-
trained CNN.
VisualBackProp (2017) is another similar approach [BCC+16], which back-
propagates the image (feature map) values instead of the gradients. With
VisualBackProp we can obtain the visualization of higher abstraction from
the deeper layer with a higher resolution. Due to the high computation
efficiency, we can efficiently use VisualBackProp as a debugging tool for the
development of CNN -based systems.
• Diagnosis of CNN representations. Some recent methods intended to go
beyond the visualization and to diagnose CNN representations further to
extract understandable insights of features encoded in a deep model. For ex-
ample, the work in [Lu15, AR15] studied the feature distributions of different
categories/attributes in the feature space of a pre-trained CNN. The work in
[SVK17, KL17] aimed to compute adversarial samples for pre-trained CNNs,
to estimate the vulnerable points in the feature space. In other words, these
studies aim to determine the minimum noisy perturbation of the input im-
age which is sensitive to the final prediction. The influence function can
also provide plausible ways to create training samples to attack the learning
of CNN models. Those samples are so-called adversarial samples, which is
another derived research topic for the interdisciplinary research of internet
security and ML.
70
2.3 Recent Development in the Age of Deep Learning
• Disentanglement of “the mixture of patterns” encoded in the learned filters
of CNNs. These studies mainly disentangle complex representations in con-
volutional layers and transform network representations into interpretable
graphs. Related work: [ZCS+17, FH17]
• Semantic-level middle-to-end learning via human-computer interaction. A
clear semantic disentanglement of CNN representations may further enable
middle-to-end learning of neural networks with weak supervision. Related
work: [ZCWZ17, ZCZ+17]
2.3.4.5 Energy Efficient Models for Low-power Devices
State-of-the-art deep models are computationally expensive and consume large
storage space. Deep learning is also strongly demanded by numerous applications
from areas such as mobile platforms, wearable devices, autonomous robots and
IoT devices. How to efficiently apply deep models on such low power devices
becomes a challenging research problem. The recent research work in this field
can be roughly categorized into two groups: the first one is lightweight DNN,
which reduces the number of parameters through a compact design of network
architecture. These designs rely on full-precision floating point numbers and
try to prevent loss of accuracy; the second possible solution is low-bit neural
networks. The network information can be compressed by avoiding the common
usage of full-precision floating point weights, which use 32 bit of storage. Instead,
quantized floating-point number with lower precision (e.g., 8 bit of storage) or
even binary (1 bit of storage) weights are used in these approaches.
SqueezeNet was presented by Iandola et al. [IHM+16] in 2016. The authors
replace a significant portion of 3×3 filters with smaller 1×1 filters in convolutional
layers and reduce the number of input channels to the remaining 3×3 filters for
a reduced number of parameters. Additionally, they facilitate late downsam-
pling to maximize their accuracy based on the lower number of weights. Further
compression is done by applying deep compression [HMD15] to the model for
an overall model size of 0.5 MB. MobileNet was implemented by Howard et al.
[HZC+17]. They apply a depth-wise separable convolution where convolutions
71
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
apply a single 3×3 filter to each input channel. Then, a 1×1 convolution is em-
ployed to combine their outputs. Zhang et al. [ZZLS17] use channel-shuffling
to achieve group convolutions in addition to depth-wise convolution. ShuffleNet
achieves a comparably lower error rate for the same number of operations needed
for MobileNet. In 2018, Tan et al. proposed MnasNet [TCP+18], whose net-
work structure was created by an automatic search method, inspired by NasNet
[ZVSL17]. The authors extended NasNet by making it suitable for mobile de-
vices. From the accuracy rate, the number of parameters and the running speed,
it has surpassed all the previous manually designed lightweight DNNs, including
MobileNet V2 [SHZ+18] released by Google recently. The mentioned approaches
reduce memory requirements, but still require GPU hardware for efficient training
and inference. Specific acceleration strategies for CPUs still need to be developed
for these methods.
On the contrary, approaches which use binary weights instead of full-precision
weights can achieve compression and acceleration. However, the drawback usually
is a severe drop in accuracy. For instance, the weights and activations in BNN
are restricted to either +1 or -1, as presented by Hubara et al. [HCS+16]. XNOR-
Net was built on a similar idea proposed by Rastegari et al. [RORF16]. They
suggested a channel-wise scaling factor to improve the approximation of full-
precision weights, but require weights between layers to be stored as full-precision
numbers. DoReFa-Net was presented by Zhou et al. [ZWN+16], which focuses
on quantizing the gradients together with different bit-widths (down to binary
values) for weights and activations. Lin et al. proposed ABC-Net [LZP17] which
achieves a drop in top-1 accuracy of only about 5% on the ImageNet dataset
compared to a full-precision network using the ResNet-18 architecture. However,
this result is based on five binary weight bases. This approach thus significantly
increases model complexity and size. Therefore finding a way to train a binary
neural network accurately remains an unsolved task.
2.3.4.6 Multitask and Multi-module Learning
Multitask learning is inspired by biological neural networks, e.g., human vision
system. The human vision system can simultaneously perform different tasks like
reading texts, recognizing objects, distinguishing facial expression, etc. Figure
72
2.3 Recent Development in the Age of Deep Learning
Figure 2.37: Multitask Learning Example
2.37 shows an example of multitask learning, where the tasks are distinguishing
from {dog or human} and {boy or girl}, respectively. We can use one neural
network for solving two tasks through a joint feature learning process. This way,
more general features can be learned from the shared layers of two tasks. We can
learn features that might not be easy to obtain just using the original task. In
this example, the features learned from the task-{boy or girl} are beneficial for
distinguishing girl and dog with long hair in the other task.
On the other hand, the part that is not related to one task is equivalent to
noise during the learning process, while noise could improve the generalization
ability of the model. Another beneficial reason is that, regarding the optimization
objective, the local minimum of different tasks are in different locations, and the
interaction of multiple tasks can help to escape from the local minimum.
There are many recently proposed DL applications applied multitask learning
techniques, for instance, Faster RCNN [RHGS15] and Mask RCNN [HGDG17]
for object detection and instance segmentation, Neural Visual Translator [WYBM16]
for image captioning, SEE [BYM18] for scene text detection and recognition,
ACMR [WYX+17] for cross-modal multimedia retrieval, and many others.
73
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
2.3.4.7 Capsule Networks
Sabour et al. proposed Capsule Networks (CapsNet) [SFH17] (2017), which can
be considered as one of the most recent theoretical breakthroughs in DL. CapsNet
usually contains several convolution layers and one capsule layer at the end. This
work intended to solve several current limitations of CNNs. The standard CNNs
utilize the so-called Invariance representations:
Represent(X) = Represent(Transform(X)) (2.29)
where X is the input to a CNN. If we want to use a CNN to recognize the input
with variations of small translation, rotation or distortion, we usually need first
to transform the input data with those transformations, then we let the CNN
learn the transformed representations. Thus, the CNN itself cannot distinguish
the input with or without geometrical transformations. While CapsNet, on the
contrary, intends to learn the so-called Equivariance representations, formed as:
Transform(Represent(X)) = Represent(Transform(X)) (2.30)
where we will not lose the location and transformation information of an ob-
ject. It uses layers of capsules instead of layers of neurons, by which a capsule
consists of a set of neurons. In the context of computer vision, a capsule can
be considered as a visual entity. Active low-level entities make predictions, and
upon agreeing to multiple predictions, a high-level entity becomes active. Low-
level capsules are responsible for significant changes of object location in visual
content occurred. High-level capsules are accountable for the changes in visual
content. A routing-by-agreement mechanism is used for the communication of
different levels of capsule layers, instead of using the standard pooling method,
since the conventional pooling methods result in losing location information of
low-level entities. While the location information may express essential semantic
properties for visual recognition tasks.
CapsNet achieved promising results on several small and medium size datasets,
such as MNIST, Multi-digit MNIST. However, before it becomes popular, we still
need to prove its applicability on large-scale datasets e.g. ImageNet, and also
show its value in some real-world applications.
74
2.3 Recent Development in the Age of Deep Learning
2.3.5 Current Limitation of DL
The current limitations of DL approaches can be summarized according to the
following aspects:
• There are many hyperparameters, such as network architecture, learning
rate, initialization metrics, loss function, activation function etc., which
have to be determined in the training of a DL model. Tuning such hyper-
parameters the expert knowledge is required.
Although with the recently proposed techniques such as AutoML [TCP+18]
we can search for the optimal hyperparameters automatically, the search-
ing process is extremely time and cost consuming. For instance, Google
researchers utilized 800 modern GPUs for their neural architecture search
method on CIFAR-10 dataset [ZL16] and the searching time was over sev-
eral months.
• As mentioned in section 2.3.4, the Adversarial Sample is a class of samples
that are maliciously designed to attack machine learning models. It is almost
impossible to distinguish the difference between real and adversarial samples
with naked eyes, and such adversarial samples will lead to a wrong judgment
of a DL model, but not the human. Regarding its particular importance
in many application areas, the attacks and defenses of adversarial samples
become a new active research field.
• DL models have very limited interpretability. However, there are some
recent methods achieved promising results for this problem (cf. section
2.3.4.4), and from long-term perspective, the interpretability is always an
actively studied topic in the community.
• Gary Marcus published a review of DL methods [Mar18], in which he criti-
cally discusses its nature and limitations, such as requiring more data, not
being sufficiently transparent, not well being integrated with prior knowl-
edge, struggling with open-ended inference, and inability to distinguish cau-
sation from correlation [Mar18]. He also mentioned that DL assumes a sta-
ble world, works as an approximation, but its answers often cannot be fully
75
2. DEEP LEARNING: THE CURRENT HIGHLIGHTING APPROACHOF ARTIFICIAL INTELLIGENCE
trusted. DL is difficult to engineer with and has potential risks as being an
excessive hype. He suggested reconceptualizing DL by considering the possi-
bilities in unsupervised learning, symbol manipulation, and hybrid models,
learning insights from cognitive science and psychology and taking bolder
challenges [Mar18].
2.3.6 Applicable Scenario of DL
In this sub-section, we summarize some applicable scenarios of DL according to
the advantages as well as the limitations.
DL is employed in several situations where machine intelligence would be
beneficial:
• Learning a function that maps well-defined inputs to well-defined outputs.
– Classification tasks, e.g., labeling images of bird breeds
– Regression tasks, e.g., analyzing a loan application to predict the like-
lihood of future default
• Absence of a human expert, e.g., navigation on Mars, or human is unable
to explain their expertise, e.g., speech recognition, computer vision.
• The task provides clear feedback with clearly definable goals and metrics.
• No need for a detailed explanation of how the decision was made.
– For instance, it may be not suitable for medical imaging like breast
cancer prediction on MRI using an unexplainable model.
• The problem size is too broad for the limited reasoning capabilities; on
the other hand, large (digital) datasets exist or can be created containing
input-output pairs.
• DL (or ML) systems are less effective when the task requires long chains of
reasoning or complex planning that rely on common sense or background
knowledge unknown to the computer.
76
2.3 Recent Development in the Age of Deep Learning
• Suitable for tasks, no specialized dexterity, physical skills, or mobility re-
quired. This is due to the current limitations of robotics, which will indeed
be improved in the future.
77
3
Assignment to the Research
Questions
In this chapter, a list of selected publications from Appendix B is presented.
• Q1: DL is data hungry, how can we alleviate the reliance on substantial
data annotations? Through synthetic data? Or/and through unsupervised
and semi-supervised learning method?
Q2: How can we perform multiple computer vision tasks with a uniform
end-to-end neural network architecture?
Publication:
– “SceneTextReg: A Real-Time Video OCR System” [YWBM16], cf.
chapter 4
– “SEE: Towards Semi-Supervised End-to-End Scene text Recognition”
[BYM18], cf. chapter 5
• Q3: How can we apply DL models on low power devices as e.g., smart-
phones, embedded devices, wearable, and IoT devices?
Publication:
– “BMXNet: An Open-Source Binary Neural Network Implementation
Based on MXNet” [YFBM17b], cf. chapter 6
81
3. ASSIGNMENT TO THE RESEARCH QUESTIONS
– “Learning to Train a Binary Neural Network” [BYBM18] and “Back to
Simplicity: How to Train Accurate BNNs from Scratch?” [BYBM19],
cf. chapter 6
• Q4: Can DL models gain multimodal and cross-modal representation learn-
ing tasks?
Publication:
– “Image Captioning with Deep Bidirectional LSTMs and Multi-Task
Learning” [WYM18], cf. chapter 7
– “A Deep Semantic Framework for Multimodal Representation Learn-
ing” [WYM16a], cf. chapter 8
• Q5: Can we effectively and efficiently apply multimedia analysis and DL
algorithms in real-world applications?
Publication:
– “Automatic Online Lecture Highlighting Based on Multimedia Analy-
sis” [CYM18], cf. chapter 9
– “Recurrent Generative Adversarial Network for Learning Imbalanced
Medical Image Semantic Segmentation” [RYM19b], cf. 10
82
4
SceneTextReg
In this paper we present a system for real-time video text recognition. The system
is based on the standard workflow of text spotting system, which includes text
detection and word recognition procedure. We apply deep neural networks in both
procedures. Our current implementation demonstrates a real-time performance
for recognizing scene text by using a standard laptop with a webcam. The word
recognizer achieves quite competitive results to state-of-the-art methods by only
using synthetic training data.
4.1 Contribution to the Work
• Main contributor to the formulation and implementation of research ideas
• Main contributor to the conceptual implementation
• Main contributor to the technical implementation
• Core maintainer of the software project
4.2 Manuscript
This manuscript is the extended version of the original paper, by which I detailed
the content of following sections: system design, evaluation and future work,
83
4. SCENETEXTREG
to further improve the legibility of the article. Additional to the manuscript, I
prepared a demo video to represent the proposed real-time system.1
1https://youtu.be/fSacIqTrD9I
84
SceneTextReg: A Real-Time Video OCR System
Haojin Yang, Cheng Wang, Christian Bartz, Christoph MeinelHasso Plattner Institute (HPI), University of Potsdam, Germany
P.O. Box 900460,D-14440 Potsdam
{haojin.yang, cheng.wang, meinel}@hpi.de{christian.bartz}@student.hpi.uni-potsdam.de
ABSTRACTWe present a system for real-time video text recognition.The system is based on the standard workflow of text spot-ting system, which includes text detection and word recog-nition procedures. We apply deep neural networks in bothprocedures. In text localization stage, textual candidatesare roughly captured by using a Maximally Stable ExtremalRegions (MSERs) detector with high recall rate, false alarmsare then eliminated by using Convolutional Neural Network(CNN ) verifier. For word recognition, we developed a skele-ton based method for segmenting text region from its back-ground, then a CNN+LSTM (Long Short-Term Memory)based word recognizer is utilized for recognizing texts. Ourcurrent implementation demonstrates a real time perfor-mance for recognizing scene text by using a standard laptopwith webcam1. The word recognizer achieves competitiveresult to state-of-the-art methods by only using synthetictraining data.
CCS Concepts•Computing methodologies → Computer vision; Vi-sual content-based indexing and retrieval; •Computersystems organization → Real-time systems;
KeywordsVideo OCR; Multimedia Indexing; Deep Neural Networks
1. INTRODUCTIONThe amount of video data available on the World Wide
Web (WWW ) is growing rapidly. According to the officialstatistic-report of YouTube, 100 hours of video are uploadedevery minute. Therefore, how to efficiently retrieve videodata on the WWW or within large video archives has becomean essential and challenging task.
1A demo video is prepared: https://youtu.be/fSacIqTrD9I
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
MM ’16 October 15-19, 2016, Amsterdam, Netherlandsc© 2016 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-3603-1/16/10.
DOI: http://dx.doi.org/10.1145/2964284.2973811
On the other hand, due to the rapid popularization ofsmart mobile and wearable devices, large amounts of self-recorded “lifelogging” videos are created. Generally, it lacksmetadata for indexing such video data, since the only search-able textual content is often the title given by the uploader,which is typically brief and subjective. A more general so-lution is highly desired for gathering video metadata auto-matically.
Text in video is one of the most important high-level se-mantic feature, which directly depicts the video content. Ingeneral, text displayed in a video can be categorized intoscene text and overlay text (or artificial text). In contrastto overlay text, to detect and recognize scene text is oftenmore challenging. Numerous problems affecting the recogni-tion results, as e.g., texts appeared in a nature scene imagecan be in a very small size with the high variety of contrast;motion changes of the camera may affect the size, shape,and brightness of text content, and may lead to geometri-cal distortion. All of those factors have to be considered inorder to obtain a correct recognition result.
Most of the proposed scene-text recognition methods canbe briefly divided into two categories, either based on con-nected components (CCs) or sliding windows. The CCsbased approaches include Stroke Width Transform (SWT )[5], MSERs [17], Oriented Stroke [18] etc. One of the signifi-cant benefits of CCs based method is its computational effi-ciency since the detection is often a one pass process acrossimage pixels. The sliding window based methods as e.g.,[20, 3, 19, 9] usually apply representative visual features totrain a machine learning classifier for text detection. Herehand-crafted features [19, 14, 1], as well as deep features[20, 9] can be applied, and text regions will be detected byscanning the whole image with a sub-window in multiplescales with a potential overlapping. In [20, 3, 9], slidingwindow based methods with deep features achieved promis-ing accuracy for end-to-end text recognition. [22] proposeto consider scene text detection as a semantic segmenta-tion problem, by which their Fully Convolutional Network(FCN ) performs per-pixel prediction for classifying text andbackground. However, their proposed approaches may hardto achieve sufficient performance for real-time applicationdue to the expensive computation time.
In our approach, we intended to take advantages of bothcategories, i.e. the computation benefit of CCs based al-gorithm and the powerful text-classification ability of deepfeatures. The demonstrated system achieves real-time per-formance2 on a standard laptop (3.2 GHz CPU×4, 8G RAM,
2Similar to [17], we consider the real-time ability of a video
Figure 1: Network architecture of the verification CNN model. The numbers in the layer names e.g.,“conv1 64x5x5” describes the following information: layer type, the number of output feature maps andconvolution kernel width and height. “fc” indicates the fully-connected layer, followed by the number ofoutputs in the layer name.
Figure 2: CNN based text candidate verification.
NVIDIA GeForce 860M) with a webcam.The rest of the paper is organized as follows: section 2
demonstrates the overall system design, section 3 presentsthe evaluation results by using opened benchmark datasetand section 4 conclude the paper with an outlook on futurework.
2. SYSTEM DESIGNIn this section, we will describe the detailed workflow of
the proposed system, and report evaluation results on IC-DAR 2015 Robust Reading Competition Challenge 2 - Task3 “Focused Scene Word Recognition” [12].
2.1 Text DetectionIn [20, 10, 9], the authors were intended to achieve a bet-
ter end-to-end text recognition accuracy. Therefore in thetext detection step, their systems have been tuned to pro-duce text candidates with high recall and the subsequentrecognition engines will further eliminate the false alarms.Since our goal is to design a text spotting system with thereal-time ability and the recognition procedure is often timeconsuming, we thus keep the text detection result as accu-rate as possible and only pass the text candidates with highconfidence to the recognition stage. We apply a MESRs [15]based detector to roughly detect character candidates fromthe input video frame with a high recall rate. In order toincrease the recall rate, we apply both RGB and HSV colorchannels for ERs (Extremal Regions) detection. All candi-date regions are further verified by using a grouping methodand a Convolutional Neural Network (CNN) [13] classifier(cf. Figure 2). We utilize the CNN architecture inspired
text recognition system if its response time is comparable toa human.
by [6] which consists of two convolution and three fully-connected layers, as shown in Figure 1. The correspondinghyperparameters such as filter size and the number of fil-ters are also given in the figure, which has been determinedempirically.
To prepare the training samples, we applied a coarse-to-fine strategy. We created text image samples by using oursample generator, which will be discussed in detail in thesubsequent sections. We first used 5000 samples for eachclass to train an initial text verification model. For collectingnon-text samples, we applied our initial model on ImageNet[4] images and further manually collected the false positives.We iteratively update the model by increasing both posi-tive and negative training samples, until the model achievedgood performance on the testing dataset (150k samples foreach class). We used around 2 million non-text samplesand 4 million text samples to train our final model. Theachieved classification accuracy and F1-score is about 99.3%and 99.0%, respectively.
2.2 Text SegmentationWe developed a novel skeleton-based approach for text
segmentation, which will simplify the further OCR process.In short, we determine the text gradient direction for eachtext candidate by analyzing the content distribution of theirskeleton maps. We then calculate the threshold value forseed-selection by using the skeleton map which has beencreated with the correct gradient direction. Subsequently,a seed-region growing procedure starts from each seed pixeland extends the seed-region in its north, south, east, andwest directions. The region iteratively grows until it reachesthe character boundary. This method achieved the firstplace in ICDAR 2011 text segmentation challenge for borndigital images. More detailed description of this method canbe found in [21].
2.3 Word RecognitionIn this step, we first separate the verified text candidates
into words. Then, the word recognition is accomplished byperforming joint-training of an RCNN (Recurrent Convolu-tion Neural Network) model, which consists of a CNN anda Long Short-Term Memory (LSTM) network [7], followedby a standard spell checker.
Figure 3 depicts the CNN part of the proposed recog-nition network, which serves as a visual feature extractor.This CNN network consists of five conv-layers and threefully-connected layers. Max-pooling is used after the 1st,2nd and 4th conv-layer; ReLU activation function [16] and
Figure 3: Network architecture of the recognition CNN model, which serves as the visual feature extractor inthe whole recognition network. The numbers in the layer names e.g., “conv1 64x5x5” describes the followinginformation: layer type, the number of output feature maps and convolution kernel width and height. “fc”indicates the fully-connected layer, followed by the number of outputs in the layer name.
Figure 4: Network architecture of the recognitionLSTM model. Both sentence (word embedding)and CNN visual features (visual embedding) are fedinto the LSTM layers, which conducts a multi-modal(visual-language) learning problem.
Batch Normalization (BN) [8] are applied after every conv-and fc-layers except the last one through the network. Thefinal output feature vector has a dimension of 1794, whichis further fed into a LSTM network, depicted in Figure 4.
In the LSTM network, the input “CNN image features”
denotes the CNN network output, which will be fed intothe LSTM layer “lstm joint”. The word input is embeddedby using an embedding layer, and then the embedded wordvectors are fed into an LSTM layer, where the sequentialword representation can be obtained. Subsequently, a jointrepresentation of image and text modalities are learned inthe LSTM layer “lstm joint”. The final output of the LSTMnetwork is the softmax distribution of 78 entries includingEnglish characters, numbers and other special characters.
To improve the performance of our recognition network,we developed a data engine, which generated 9 millions ofsynthetic training data by considering different scene factors,discussed in detail in the next section.
2.4 Synthetic Data GenerationThe commonly used benchmark datasets for scene text
recognition, such as the ICDAR dataset, consist of a trainingdataset containing images used for training the system anda test dataset that acts as the benchmark dataset. Thosetraining datasets, most of the time, do not have more than2000 image samples, which is not enough for building a deeplearning model with tens of millions of parameters. It is thusnecessary to get more training data that have the same vi-sual complexity as real-world images. Image portals such asFlickr3 or Instagram4 provide a lot of user-uploaded photosthat may contain scene text, but none of these images hasannotations. Therefore, a ground truth that we can use fortraining the deep model has to be created manually, whichis a time- and cost-consuming task. A better solution is togenerate synthetic data with very closed properties as thereal-world data. We thus can use the synthetic data as agood estimator for the data a trained network has to expectwhen being fed with real-world data. We developed an im-age data engine for this job that is capable of performingthe following operations:
• Generating arbitrary text strings using fonts with a
3https://www.flickr.com/4ttps://www.instagram.com
Figure 5: Comparison of generated samples withreal-world images from the ICDAR dataset of therobust reading challenge. (Left) Images generatedby our sample generation tool. (Right) Images takenfrom the ICDAR dataset.
broad range of varieties that we can also find in real-world images.
• To enhance the realistic effects, using different colors,sizes, shadows, borders with varying displacements tothe rendered texts.
• Using transformations such as rotation and distortionso that the model is robust to small projective distor-tions.
• Using background blending, reflection, and other vi-sual effects to enhance the similarity to the real-worldimages further.
We randomly select a font from Google fonts5 for creat-ing texts. We can generate words in two different ways:either supplying a wordlist file that contains words or sen-tences, or specifying which kinds of characters to use (lowercase characters, upper case characters, numbers) and thewordlist generation script will generate random words up toa specified maximum length. The sample generation processsubsequently renders text and a shadow image using the sup-plied font and the font size provided. Text image, shadowimage, and background image have randomly selected col-ors, but background color and shadow color are constrainedto have a minimum contrast compared to the text color.That minimum contrast value can be configured in the con-fig and could also be set to zero to allow any color as thebackground and shadow color. Next, the process will blendthe shadow image and text image using alpha compositing,after that we will get the base samples.
In the next step, the process applies image transforma-tions in the following order:
• Random Gaussian Blur The created text sample isblurred using gaussian blur, where the value of eachpixel is computed as the weighted average of the neigh-borhood of that pixel. The weighted average of allneighborhood pixels can be obtained by convolving theimage with a Gaussian filter. The Gaussian filter usedin this work uses a random blur radius.
5https://fonts.google.com/
• Random Distortion We add random distortions to theimage by applying perspective transformations. Thiscreates text image samples where the text is not per-fectly aligned but slightly skewed and incidental ori-ented.
• Random Rotation The data engine can further rotatethe already blurred and distorted text images with ar-bitrary degrees.
• Random Reflection We use predefined reflection over-lays that are alpha composited with the generated im-age and simulate a white glare on the image. It issafe to assume that a white glare is sufficient enoughto model a natural reflection as we are using grayscaleimages.
• Background Blending The last step in the generationprocess is to add a background to the sample. Thisbackground can be a plain color or a natural image.The first blending step is to add a plain colored back-ground with a random opacity to the generated textsample. In the second step, the tool decides whetherto add a natural background further or not. The opac-ity and probability of a natural background image toadd can be pre-configured.
The data engine can very well approximate the distributionof real-world data samples, and our experimental results alsoconfirm this conclusion. More details can be found in section3. Figure 5 shows a comparison of samples generated withour tool and samples from the ICDAR dataset.
2.5 Loss FunctionTo train the neural networks the commonly used loss func-
tion cross-entropy loss is applied in this paper. In binaryclassification, where the number of classes K equals 2, cross-entropy can be calculated as:
Lcross-entropy = −(y log(p) + (1− y) log(1− p))
For K > 2, we calculate a separate loss for each class labelper observation and sum the result, which is expressed asfollows:
Lcross-entropy = −K∑
k=1
yi,k log(pi,k)
where K denotes the number of classes, log is the naturallog function, y denotes the binary indicator (0 or 1) if classlabel k is the correct classification for the observation i, andp is the predicted probability observation i is of class c.
3. EVALUATION
3.1 Implementation DetailThe model training in this paper has been performed on
Ubuntu 14.04/64-bit platform on a workstation which hasan Intel(R) Core(TM) i7-6900K CPU, 64 GB RAM and 2TITAN X GPUs. We applied deep learning framework Caffe[11] to train the deep networks. The real-time demo wascompiled on a standard laptop computer with Intel CPU×43.2 GHz, 8G RAM, GPU NVIDIA GeForce 860M.
Description WRR
I.C.PPhotoOCR [2] 0.876Our result 0.857Jaderberg’s JOINT-model [9] 0.818
IC05
SRC-B-TextProcessingLab* 0.874Baidu-IDL* 0.872Megvii-Image++ [22] 0.8283PhotoOCR [2] 0.8283Our result 0.8237NESP 0.642PicRead 0.5799PLT 0.6237MAPS 0.6274Feild’s Method 0.4795PIONEER 0.537Baseline 0.453TextSpotter [17] 0.2685
Table 1: Evaluation result on IC05. The base-line method is from a commercially available OCRsystem. We are intended to include results thatare not constrained to a pre-defined lexicon. How-ever the methods marked with ∗ are not published,therefore they are not distinguishable. (Last access:05/12/2016)
3.2 Experimental ResultWe evaluated our word recognizer by using ICDAR 2015
Robust Reading Competition Challenge 2 - Task 3 “FocusedScene Word Recognition” dataset (refer to IC05) in an un-constrained manner6. We can provide the output of ourword recognizer in both case-sensitive and case-insensitivemode. For the ICDAR 2015 dataset, we strictly followedthe evaluation metric in a case-sensitive manner (refer to“Word Recognition Rate” (WRR)). We also created evalu-ation result ignored capitalization and punctuation differ-ences using this dataset (refer to I.C.P), and compared withthe best-known methods [2, 9].
Table 1 shows the comparison results to previous methodson the IC05 dataset. In the I.C.P evaluation, our currentresult outperforms JOINT-model from [9], but comes lightlybehind Google’s photoOCR by 1.9%. We didn’t considerthe DICT-model from [9], since its result is constrained tolexicons.
According to the IC05 ranking results, our approach iscurrently not able to outperform the results created by com-mercial organizations such as SRC-B, Baidu-IDL, Megvii,and Google, but improves on the next best one (NESP) by18% of WRR. Our result is still competitive to commercialorganizations, only 0.46% behind Google and Megvii, re-garding that we have only used synthetic training data, andapplied more succinct network architecture by taking intoaccount of the execution speed. This result confirms thatby using carefully implemented synthetic training data, wecan achieve excellent recognition results very close to themodels trained using real-world data, as e.g., [2] (Google’ssystem) applied several millions of manually labeled sam-ples, and the processing time of their system is around 1.4seconds per image. To process a 640 × 480 image, the sys-tem from [22] (Megvii) needs about 20 seconds on CPU or
6The OCR results are not constrained to a given lexicon.
Figure 6: Exemplary recognition result of our sys-tem by using a webcam in real-time.
1 second on GPU only for text localization. Therefore, oursystem is superior regarding running time, which can beproved by our demo video captured by using a laptop anda video camera 7. The OCR analysis has been performedon every input frame from the camera. Figure 6 demon-strates an exemplary recognition result. The methods from“Baidu-IDL” and “SRC-B” (marked with ∗ in Table 1) haveachieved excellent results in terms of WRR. However, sincethey are the unpublished methods, we thus can not makean accurate judgment. Based on experience, they shoulduse extremely sophisticated neural network architectures toensure high accuracy, which is less efficient than our method.
4. CONCLUSIONIn this paper, we presented a real-time video text recog-
nition system by taking advantages of the efficient classicalcomputer vision methods and highly accurate deep neuralnetworks. In particular, we proved that, by using carefullydesigned synthetic training data, we are able to successfullybuild our CRNN model, which achieves similar recognitionresults as the systems developed based on large-scale real-world datasets.
As the next step, we are thinking about two possible re-search ideas. First, are we able to integrate the detectionand recognition task into a unified neural network, and op-timize the whole problem in an end-to-end manner? Webelieve that this goal could be accomplished by using so-called multi-task learning techniques in the fully supervisedmanner. Moreover, are we able to optimize the detectiontask without bounding box annotations in a semi-supervisedway? The human vision system inspires this idea. We neverprovide fully supervised information, such as the exact lo-cation information and bounding box size of texts or otherobjects, to teach our kids. The human vision system canthus learn to find texts or other objects by cross-modalsupervision signals such as context information, attentioninformation, etc. We will try to explore the possibility ofimplementing a semi-supervised detection network by usingvisual attention information.
We will showcase the proposed video OCR system inter-actively. The input video stream will be captured by using
7https://youtu.be/fSacIqTrD9I
a live camera, and the OCR result will be directly displayedon the computer screen.
5. REFERENCES[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool.
Speeded-up robust features (surf). Comput. Vis.Image Underst., 110(3):346–359, June 2008.
[2] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven.Photoocr: Reading text in uncontrolled conditions. InThe IEEE International Conference on ComputerVision (ICCV), December 2013.
[3] A. Coates, B. Carpenter, C. Case, S. Satheesh,B. Suresh, T. Wang, D. J. Wu, and A. Y. Ng. Textdetection and character recognition in scene imageswith unsupervised feature learning. In Proc. ofInternational Conference on Document Analysis andRecognition, ICDAR ’11, pages 440–445, Washington,DC, USA, 2011. IEEE Computer Society.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.
[5] B. Epshtein, E. Ofek, and Y. Wexler. Detecting textin natural scenes with stroke width transform. InProc. of International Conference on Computer Visionand Pattern Recognition, pages 2963–2970, 2010.
[6] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, andV. Shet. Multi-digit number recognition from streetview imagery using deep convolutional neuralnetworks. arXiv preprint arXiv:1312.6082v4, 2014.
[7] S. Hochreiter and J. Schmidhuber. Long short-termmemory. Neural Comput., 9(8):1735–1780, Nov. 1997.
[8] S. Ioffe and C. Szegedy. Batch normalization:Accelerating deep network training by reducinginternal covariate shift. arXiv preprintarXiv:1502.03167, 2015.
[9] M. Jaderberg, K. Simonyan, A. Vedaldi, andA. Zisserman. Deep structured output learning forunconstrained text recognition. In InternationalConference on Learning Representations, 2015.
[10] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deepfeatures for text spotting. In Proc. of EuropeanConference on Computer Vision (ECCV). Springer,2014.
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Darrell.Caffe: Convolutional architecture for fast featureembedding. arXiv preprint arXiv:1408.5093, 2014.
[12] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou,S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas,L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait,S. Uchida, and E. Valveny. Icdar 2015 competition onrobust reading. In Proceedings of the 2015 13thInternational Conference on Document Analysis andRecognition (ICDAR), ICDAR ’15, pages 1156–1160,Washington, DC, USA, 2015. IEEE Computer Society.
[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to documentrecognition. Proceedings of the IEEE,86(11):2278–2324, 1998.
[14] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, Nov. 2004.
[15] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robustwide-baseline stereo from maximally stable extremalregions. Image and Vision Computing, 22(10):761 –767, 2004. British Machine Vision Computing 2002.
[16] V. Nair and G. E. Hinton. Rectified linear unitsimprove restricted boltzmann machines. InProceedings of the 27th international conference onmachine learning (ICML-10), pages 807–814, 2010.
[17] L. Neumann and J. Matas. Real-time scene textlocalization and recognition. In Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conferenceon, pages 3538–3545, June 2012.
[18] L. Neumann and J. Matas. Scene text localization andrecognition with oriented stroke detection. InComputer Vision (ICCV), 2013 IEEE InternationalConference on, pages 97–104, Dec 2013.
[19] K. Wang, B. Babenko, and S. Belongie. End-to-endscene text recognition. In Computer Vision (ICCV),2011 IEEE International Conference on, pages1457–1464, Nov 2011.
[20] T. Wang, D. Wu, A. Coates, and A. Ng. End-to-endtext recognition with convolutional neural networks. InPattern Recognition (ICPR), 2012 21st InternationalConference on, pages 3304–3308, Nov 2012.
[21] H. Yang, B. Quehl, and H. Sack. A skeleton basedbinarization approach for video text recognition. InImage Analysis for Multimedia Interactive Services(WIAMIS), 2012 13th International Workshop on,pages 1–4. IEEE, 2012.
[22] C. Yao, J. Wu, X. Zhou, C. Zhang, S. Zhou, Z. Cao,and Q. Yin. Incidental scene text understanding:Recent progresses on icdar 2015 robust readingcompetition challenge. arXiv preprintarXiv:1511.09207v2, 2016.
5
SEE
In this paper, we present an approach towards semi-supervised neural networks
for scene text detection and recognition, which can be optimized end-to-end. We
show the feasibility, by performing a range of experiments on standard benchmark
datasets, where we achieved state-of-the-art results.
Moreover, I prepared additional experimental results of the proposed approach
from our another paper [BYM17b] in section 5.3, which prove the robust gener-
alization ability of SEE.
5.1 Contribution to the Work
• Significantly contributed to the conceptual discussion and idea development.
• Guidance and supervision of the technical implementation
5.2 Manuscript
Additional to the manuscript, we prepared demo videos to represent the proposed
approach on different datasets.1
1SVHN: https://youtu.be/GSq3_GeDZKk
FSNS: https://youtu.be/5lt6dAbbsu4
ICDAR: https://youtu.be/LNNrZ7kcmbU
91
SEE: Towards Semi-SupervisedEnd-to-End Scene Text Recognition
Christian Bartz, Haojin Yang, Christoph MeinelHasso Plattner Institute, University of Potsdam
Prof.-Dr.-Helmert Straße 2-314482 Potsdam, Germany
{christian.bartz, haojin.yang, meinel}@hpi.de
Abstract
Detecting and recognizing text in natural scene images is achallenging, yet not completely solved task. In recent yearsseveral new systems that try to solve at least one of thetwo sub-tasks (text detection and text recognition) have beenproposed. In this paper we present SEE, a step towardssemi-supervised neural networks for scene text detection andrecognition, that can be optimized end-to-end. Most existingworks consist of multiple deep neural networks and severalpre-processing steps. In contrast to this, we propose to usea single deep neural network, that learns to detect and rec-ognize text from natural images, in a semi-supervised way.SEE is a network that integrates and jointly learns a spatialtransformer network, which can learn to detect text regions inan image, and a text recognition network that takes the iden-tified text regions and recognizes their textual content. Weintroduce the idea behind our novel approach and show itsfeasibility, by performing a range of experiments on standardbenchmark datasets, where we achieve competitive results.
IntroductionText is ubiquitous in our daily lives. Text can be found ondocuments, road signs, billboards, and other objects likecars or telephones. Automatically detecting and reading textfrom natural scene images is an important part of systems,that are to be used for several challenging tasks, such asimage-based machine translation, autonomous cars or im-age/video indexing. In recent years the task of detecting textand recognizing text in natural scenes has seen much inter-est from the computer vision and document analysis com-munity. Furthermore, recent breakthroughs (He et al. 2016a;Jaderberg et al. 2015b; Redmon et al. 2016; Ren et al. 2015)in other areas of computer vision enabled the creation ofeven better scene text detection and recognition systems thanbefore (Gomez and Karatzas 2017; Gupta, Vedaldi, and Zis-serman 2016; Shi et al. 2016). Although the problem of Op-tical Character Recognition (OCR) can be seen as solvedfor text in printed documents, it is still challenging to detectand recognize text in natural scene images. Images contain-ing natural scenes exhibit large variations of illumination,perspective distortions, image qualities, text fonts, diversebackgrounds, etc.
Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Recognizer
"Place"
"Erik"
"Satie"
Predicted Bounding Boxes
Input Image
Detector
Figure 1: Schematic overview of our proposed system. Theinput image is fed to a single neural network that consistsof a text detection part and a text recognition part. The textdetection part learns to detect text in a semi-supervised way,by being jointly trained with the recognition part.
The majority of existing research works developed end-to-end scene text recognition systems that consist of com-plex two-step pipelines, where the first step is to detect re-gions of text in an image and the second step is to recognizethe textual content of that identified region. Most of the ex-isting works only concentrate on one of these two steps.
In this paper, we present a solution that consists of a sin-gle Deep Neural Network (DNN) that can learn to detect andrecognize text in a semi-supervised way. In this setting thenetwork only receives the image and the textual labels as in-put. We do not supply any groundtruth bounding boxes. Thetext detection is learned by the network itself. This is con-trary to existing works, where text detection and text recog-nition systems are trained separately in a fully-supervisedway. Recent work (Dai, He, and Sun 2016) showed that Con-volutional Neural Networks (CNNs) are capable of learn-ing how to solve complex multi-task problems, while beingtrained in an end-to-end manner. Our motivation is to usethese capabilities of CNNs and create an end-to-end train-able scene text recognition system, that can be trained onweakly labelled data. In order to create such a system, welearn a single DNN that is able to find single characters,words or even lines of text in the input image and recog-nize their content. This is achieved by jointly learning a lo-calization network that uses a recurrent spatial transformer(Jaderberg et al. 2015b; Sønderby et al. 2015) as attentionmechanism and a text recognition network. Figure 1 pro-vides a schematic overview of our proposed system.
Our contributions are as follows: (1) We present a novel
The Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI-18)
6674
end-to-end trainable system for scene text detection andrecognition by integrating spatial transformer networks.(2) We propose methods that can improve and ease the workwith spatial transformer networks. (3) We train our pro-posed system end-to-end, in a semi-supervised way. (4) Wedemonstrate that our approach is able to reach competitiveperformance on standard benchmark datasets. (5) We pro-vide our code1 and trained models2 to the research commu-nity.
This paper is structured in the following way: We first out-line work of other researchers that is related to ours. Second,we describe our proposed system in detail. We then showand discuss our results on standard benchmark datasets andfinally conclude our findings.
Related WorkOver the course of years a rich environment of different ap-proaches to scene text detection and recognition have beendeveloped and published. Nearly all systems use a two-stepprocess for performing end-to-end recognition of scene text.The first step, is to detect regions of text and extract these re-gions from the input image. The second step, is to recognizethe textual content and return the text strings of the extractedtext regions.
It is further possible to divide these approaches into threebroad categories: (1) Systems relying on hand crafted fea-tures and human knowledge for text detection and textrecognition. (2) Systems using deep learning approaches,together with hand crafted features, or two different deepnetworks for each of the two steps. (3) Systems that do notconsist of a two step approach but rather perform text detec-tion and recognition using a single deep neural network. Foreach category, we will discuss some of these systems.
Hand Crafted Features In the beginning, methods basedon hand crafted features and human knowledge have beenused to perform text detection and recognition. These sys-tems used features like MSERs (Neumann and Matas 2010),Stroke Width Transforms (Epshtein, Ofek, and Wexler 2010)or HOG-Features (Wang, Babenko, and Belongie 2011) toidentify regions of text and provide them to the text recogni-tion stage of the system. In the text recognition stage slidingwindow classifiers (Mishra, Alahari, and Jawahar 2012) andensembles of SVMs (Yao et al. 2014) or k-Nearest Neighborclassifiers using HOG features (Wang and Belongie 2010)were used. All of these approaches use hand crafted featuresthat have a large variety of hyper parameters that need ex-pert knowledge to correctly tune them for achieving the bestresults.
Deep Learning Approaches More recent systems ex-change approaches based on hand crafted features in oneor both steps of recognition systems by approaches usingDNNs. Gomez and Karatzas (Gomez and Karatzas 2017)
1https://github.com/Bartzi/see2https://bartzi.de/research/see
propose a text-specific selective search algorithm that, to-gether with a DNN, can be used to detect (distorted) text re-gions in natural scene images. Gupta et al. (Gupta, Vedaldi,and Zisserman 2016) propose a text detection model basedon the YOLO-Architecture (Redmon et al. 2016) that uses afully convolutional deep neural network to identify text re-gions.
Bissacco et al. (Bissacco et al. 2013) propose a com-plete end-to-end architecture that performs text detectionusing hand crafted features. Jaderberg et al. (Jaderberg etal. 2015a; Jaderberg, Vedaldi, and Zisserman 2014) proposeseveral systems that use deep neural networks for text detec-tion and text recognition. In (Jaderberg et al. 2015a) Jader-berg et al. propose to use a region proposal network withan extra bounding box regression CNN for text detection. ACNN that takes the whole text region as input is used fortext recognition. The output of this CNN is constrained to apre-defined dictionary of words, making this approach onlyapplicable to one given language.
Goodfellow et al. (Goodfellow et al. 2014) propose a textrecognition system for house numbers, that has been refinedby Jaderberg et al. (Jaderberg, Vedaldi, and Zisserman 2014)for unconstrained text recognition. This system uses a sin-gle CNN, taking the whole extracted text region as input,and recognizing the text using one independent classifier foreach possible character in the given word. Based on this ideaHe et al. (He et al. 2016b) and Shi et al. (Shi, Bai, and Yao2016) propose text recognition systems that treat the recog-nition of characters from the extracted text region as a se-quence recognition problem. Shi et al. (Shi et al. 2016) laterimproved their approach by firstly adding an extra step thatutilizes the rectification capabilities of Spatial TransformerNetworks (Jaderberg et al. 2015b) for rectifying extractedtext lines. Secondly they added a soft-attention mechanismto their network that helps to produce the sequence of char-acters in the input image. In their work Shi et al. make use ofSpatial Transformers as an extra pre-processing step to makeit easier for the recognition network to recognize the text inthe image. In our system we use the Spatial Transformer asa core building block for detecting text in a semi-supervisedway.
End-to-End trainable Approaches The presented sys-tems always use a two-step approach for detecting and rec-ognizing text from scene text images. Although recent ap-proaches make use of deep neural networks they are still us-ing a huge amount of hand crafted knowledge in either of thesteps or at the point where the results of both steps are fusedtogether. Smith et al. (Smith et al. 2016) and Wojna et al.(Wojna et al. 2017) propose an end-to-end trainable systemthat is able to recognize text on French street name signs,using a single DNN. In contrast to our system it is not pos-sible for the system to provide the location of the text in theimage, only the textual content can be extracted. RecentlyLi et al. (Li, Wang, and Shen 2017) proposed an end-to-endsystem consisting of a single, complex DNN that is trainedend-to-end and can perform text detection and text recogni-tion in a single forward pass. This system is trained using
6675
groundtruth bounding boxes and groundtruth labels for eachword in the input images, which stands in contrast to ourmethod, where we only use groundtruth labels for each wordin the input image, as the detection of text is learned by thenetwork itself.
Proposed SystemA human trying to find and read text will do so in a sequen-tial manner. The first action is to put attention on a word,read each character sequentially and then attend to the nextword. Most current end-to-end systems for scene text recog-nition do not behave in that way. These systems rather tryto solve the problem by extracting all information from theimage at once. Our system first tries to attend sequentiallyto different text regions in the image and then recognizetheir textual content. In order to do this, we created a sin-gle DNN consisting of two stages: (1) text detection, and(2) text recognition. In this section we will introduce theattention concept used by the text detection stage and theoverall structure of the proposed system.
Detecting Text with Spatial TransformersA spatial transformer proposed by Jaderberg et al. (Jader-berg et al. 2015b) is a differentiable module for DNNs thattakes an input feature map I and applies a spatial transfor-mation to this feature map, producing an output feature mapO. Such a spatial transformer module is a combination ofthree parts. The first part is a localization network computinga function floc, that predicts the parameters θ of the spatialtransformation to be applied. These predicted parameters areused in the second part to create a sampling grid, which de-fines a set of points where the input map should be sampled.The third part is a differentiable interpolation method, thattakes the generated sampling grid and produces the spatiallytransformed output feature map O. We will shortly describeeach component in the following paragraphs.
Localization Network The localization network takes theinput feature map I ∈ RC×H×W , with C channels, heightH and width W and outputs the parameters θ of the transfor-mation that shall be applied. In our system we use the local-ization network (floc) to predict N two-dimensional affinetransformation matrices An
θ , where n ∈ {0, . . . , N − 1}:
floc(I) = Anθ =
[θn1 θn2 θn3θn4 θn5 θn6
](1)
N is thereby the number of characters, words or textlinesthe localization network shall localize. The affine transfor-mation matrices predicted in that way allow the network toapply translation, rotation, zoom and skew to the input im-age.
In our system the N transformation matrices Anθ are pro-
duced by using a feed-forward CNN together with a Recur-rent Neural Network (RNN). Each of the N transformationmatrices is computed using the globally extracted convolu-tional features c and the hidden state hn of each time-step of
the RNN:
c = f convloc (I) (2)
hn = frnnloc (c, hn−1) (3)
Anθ = gloc(hn) (4)
where gloc is another feed-forward/recurrent network. Weuse a variant of the well known ResNet architecture (Heet al. 2016a) as CNN for our localization network. We usethis network architecture, because we found that with thisnetwork structure our system learns faster and more suc-cessfully, as compared to experiments with other networkstructures, such as the VGGNet (Simonyan and Zisserman2015). We argue that this is due to the fact that the residualconnections of the ResNet help with retaining a strong gra-dient down to the very first convolutional layers. The RNNused in the localization network is a Long-Short Term Mem-ory (LSTM) (Hochreiter and Schmidhuber 1997) unit. ThisLSTM is used to generate the hidden states hn, which inturn are used to predict the affine transformation matrices.We used the same structure of the network for all our ex-periments we report in the next section. Figure 2 provides astructural overview of this network.
Rotation Dropout During our experiments, we found thatthe network tends to predict transformation parameters,which include excessive rotation. In order to mitigate sucha behavior, we propose a mechanism that works similarlyto dropout (Srivastava et al. 2014), which we call rotationdropout. Rotation dropout works by randomly dropping theparameters of the affine transformation, which are respon-sible for rotation. This prevents the localization network tooutput transformation matrices that perform excessive rota-tion. Figure 3 shows a comparison of the localization resultof a localization network trained without rotation dropout(top) and one trained with rotation dropout (middle).
Grid Generator The grid generator uses a regularlyspaced grid Go with coordinates yho
, xwo, of height Ho and
width Wo. The grid Go is used together with the affine trans-formation matrices An
θ to produce N regular grids Gn withcoordinates un
i , vnj of the input feature map I , where i ∈ Ho
and j ∈ Wo:(uni
vnj
)= An
θ
(xwo
yho
1
)=
[θn1 θn2 θn3θn4 θn5 θn6
](xwo
yho
1
)(5)
During inference we can extract the N resulting grids Gn,which contain the bounding boxes of the text regions foundby the localization network. Height Ho and width Wo canbe chosen freely.
Localization specific regularizers The datasets used byus, do not contain any samples, where text is mirrored ei-ther along the x- or y-axis. Therefore, we found it beneficialto add additional regularization terms that penalizes grid,which are mirrored along any axis. We furthermore foundthat the network tends to predict grids that get larger over
6676
Con
v
Con
v +
Avg
Pool [...] C
onv
Con
v + FC
Avg
Pool
Con
v
BLSTM
BLSTM
BLSTM
BLSTM
... ... ... ...
LSTM
LSTM
LSTM
LSTM
Localization Network Grid Generator
XSampler
Extracted Text Regions
Con
v
Con
v
+
Avg
Pool [...] C
onv
Con
v
+ FC
Avg
Pool
Con
v
FC
FC
FC
FC
Softmax
Softmax
Softmax
Softmax
Recognition Network
Output
"Chemin",
"Benech"
BBoxes of text regions
=
Figure 2: The network used in our work consists of two major parts. The first is the localization network that takes the inputimage and predicts N transformation matrices, which are used to create N different sampling grids. The generated samplinggrids are used in two ways: (1) for calculating the bounding boxes of the identified text regions (2) for extracting N text regions.The recognition network then performs text recognition on these extracted regions. The whole system is trained end-to-end byonly supplying information about the text labels for each text region.
the time of training, hence we included a further regularizerthat penalizes large grids, based on their area. Lastly, we alsoincluded a regularizer that encourages the network to pre-dict grids that have a greater width than height, as text isnormally written in horizontal direction and typically widerthan high. The main purpose of these localization specificregularizers is to enable faster convergence. Without theseregularizers, the network will eventually converge, but it willtake a very long time and might need several restarts of thetraining. Equation 7 shows how these regularizers are usedfor calculating the overall loss of the network.
Image Sampling The N sampling grids Gn produced bythe grid generator are now used to sample values of the fea-ture map I at the coordinates un
i , vnj for each n ∈ N . Nat-
urally these points will not always perfectly align with thediscrete grid of values in the input feature map. Because ofthat we use bilinear sampling and define the values of the Noutput feature maps On at a given location i, j where i ∈ Ho
and j ∈ Wo to be:
Onij =
H∑
h
W∑
w
Ihwmax(0, 1−|uni −h|)max(0, 1−|vnj −w|)
(6)This bilinear sampling is (sub-)differentiable, hence it ispossible to propagate error gradients to the localization net-work, using standard backpropagation.
The combination of localization network, grid generatorand image sampler forms a spatial transformer and can ingeneral be used in every part of a DNN. In our system we usethe spatial transformer as the first step of our network. Fig-ure 4 provides a visual explanation of the operation methodof grid generator and image sampler.
Text Recognition StageThe image sampler of the text detection stage produces a setof N regions, that are extracted from the original input im-
age. The text recognition stage (a structural overview of thisstage can be found in Figure 2) uses each of these N differ-ent regions and processes them independently of each other.The processing of the N different regions is handled by aCNN. This CNN is also based on the ResNet architecture aswe found that we could only achieve good results, while us-ing a variant of the ResNet architecture for our recognitionnetwork. We argue that using a ResNet in the recognitionstage is even more important than in the detection stage, be-cause the detection stage needs to receive strong gradientinformation from the recognition stage in order to success-fully update the weights of the localization network. TheCNN of the recognition stage predicts a probability distri-bution y over the label space Lε, where Lε = L ∪ {ε}, withL being the alphabet used for recognition, and ε representingthe blank label. The network is trained by running a LSTMfor a fixed number of T timesteps and calculating the cross-entropy loss for the output of each timestep. The choice ofnumber of timesteps T is based on the number of characters,of the longest word, in the dataset. The loss L is computedas follows:Lngrid = λ1 × Lar(G
n) + λ2 × Las(Gn) + Ldi(G
n) (7)
L =
N∑
n=1
(
T∑
t=1
(P (lnt |On)) + Lngrid) (8)
Where Lar(Gn) is the regularization term based on the
area of the predicted grid n, Las(Gn) is the regularization
term based on the aspect ratio of the predicted grid n, andLdi(G
n) is the regularization term based on the direction ofthe grid n, that penalizes mirrored grids. λ1 and λ2 are scal-ing parameters that can be chosen freely. The typical rangeof these parameters is 0 < λ1, λ2 < 0.5. lnt is the label l attime step t for the n-th word in the image.
Model TrainingThe training set X used for training the model consists ofa set of input images I and a set of text labels LI for each
6677
Figure 3: Top: predicted bounding boxes of network trainedwithout rotation dropout. Middle: predicted bounding boxesof network trained with rotation dropout. Bottom: visualiza-tion of image parts that have the highest influence on theoutcome of the prediction. This visualization has been cre-ated using Visualbackprop (Bojarski et al. 2016).
input image. We do not use any labels for training the textdetection stage. The text detection stage is learning to detectregions of text by using only the error gradients, obtainedby calculating the cross-entropy loss, of the predictions andthe textual labels, for each character of each word. Duringour experiments we found that, when trained from scratch,a network that shall detect and recognize more than two textlines does not converge. In order to overcome this problemwe designed a curriculum learning strategy (Bengio et al.2009) for training the system. The complexity of the sup-plied training images under this curriculum is gradually in-creasing, once the accuracy on the validation set has settled.
During our experiments we observed that the performanceof the localization network stagnates, as the accuracy of therecognition network increases. We found that restarting thetraining with the localization network initialized using theweights obtained by the last training and the recognition net-work initialized with random weights, enables the localiza-tion network to improve its predictions and thus improve theoverall performance of the trained network. We argue thatthis happens because the values of the gradients propagatedto the localization network decrease, as the loss decreases,leading to vanishing gradients in the localization networkand hence nearly no improvement of the localization.
ExperimentsIn this section we evaluate our presented network architec-ture on standard scene text detection/recognition benchmarkdatasets. While performing our experiments we tried to an-
I 1
O2
O
Figure 4: Operation method of grid generator and imagesampler. First the grid generator uses the N affine transfor-mation matrices An
θ to create N equally spaced samplinggrids (red and yellow grids on the left side). These samplinggrids are used by the image sampler to extract the imagepixels at that location, in this case producing the two outputimages O1 and O2. The corners of the generated samplinggrids provide the vertices of the bounding box for each textregion, that has been found by the network.
swer the following questions: (1) Is the concept of letting thenetwork automatically learn to detect text feasible? (2) Canwe apply the method on a real world dataset? (3) Can we getany insights on what kind of features the network is tryingto extract?
In order to answer these questions, we used differentdatasets. On the one hand we used standard benchmarkdatasets for scene text recognition. On the other hand wegenerated some datasets on our own. First, we performed ex-periments on the SVHN dataset (Netzer et al. 2011), that weused to prove that our concept as such is feasible. Second,we generated more complex datasets based on SVHN im-ages, to see how our system performs on images that containseveral words in different locations. The third dataset we ex-erimented with, was the French Street Name Signs (FSNS)dataset (Smith et al. 2016). This dataset is the most chal-lenging we used, as it contains a vast amount of irregular,low resolution text lines, that are more difficult to locate andrecognize than text lines from the SVHN datasets. We be-gin this section by introducing our experimental setup. Wewill then present the results and characteristics of the ex-periments for each of the aforementioned datasets. We willconclude this section with a brief explanation of what kindsof features the network seems to learn.
Experimental SetupLocalization Network The localization network used inevery experiment is based on the ResNet architecture (Heet al. 2016a). The input to the network is the image wheretext shall be localized and later recognized. Before the firstresidual block the network performs a 3 × 3 convolution,
6678
followed by batch normalization (Ioffe and Szegedy 2015),ReLU (Nair and Hinton 2010), and a 2× 2 average poolinglayer with stride 2. After these layers three residual blockswith two 3×3 convolutions, each followed by batch normal-ization and ReLU, are used. The number of convolutionalfilters is 32, 48 and 48 respectively. A 2 × 2 max-poolingwith stride 2 follows after the second residual block. Thelast residual block is followed by a 5 × 5 average poolinglayer and this layer is followed by a LSTM with 256 hiddenunits. Each time step of the LSTM is fed into another LSTMwith 6 hidden units. This layer predicts the affine transfor-mation matrix, which is used to generate the sampling gridfor the bilinear interpolation. We apply rotation dropout toeach predicted affine transformation matrix, in order to over-come problems with excessive rotation predicted by the net-work.
Recognition Network The inputs to the recognition net-work are N crops from the original input image, represent-ing the text regions found by the localization network. In ourSVHN experiments, the recognition network has the samestructure as the localization network, but the number of con-volutional filters is higher. The number of convolutional fil-ters is 32, 64 and 128 respectively. We use an ensemble ofT independent softmax classifiers as used in (Goodfellow etal. 2014) and (Jaderberg, Vedaldi, and Zisserman 2014) forgenerating our predictions. In our experiments on the FSNSdataset we found that using ResNet-18 (He et al. 2016a) sig-nificantly improves the obtained recognition accuracies.
Alignment of Groundtruth During training we assumethat all groundtruth labels are sorted in western readingdirection, that means they appear in the following order:1. from top to bottom, and 2. from left to right. We stressthat currently it is very important to have a consistent or-dering of the groundtruth labels, because if the labels are ina random order, the network rather predicts large boundingboxes that span over all areas of text in the image. We hopeto overcome this limitation, in the future, by developing amethod that allows random ordering of groundtruth labels.
Implementation We implemented all our experiments us-ing Chainer (Tokui et al. 2015). We conducted all our exper-iments on a work station which has an Intel(R) Core(TM) i7-6900K CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.
Experiments on the SVHN datasetWith our first experiments on the SVHN dataset (Netzer etal. 2011) we wanted to prove that our concept works. Wetherefore first conducted experiments, similar to the exper-iments in (Jaderberg et al. 2015b), on SVHN image cropswith a single house number in each image crop, that iscentered around the number and also contains backgroundnoise. Table 1 shows that we are able to reach competitiverecognition accuracies.
Based on this experiment we wanted to determine whetherour model is able to detect different lines of text that are ar-ranged in a regular grid, or placed at random locations in the
Method 64px(Goodfellow et al. 2014) 96.0%(Jaderberg et al. 2015b) 96.3%Ours 95.2%
Table 1: Sequence recognition accuracies on the SVHNdataset. When recognizing house number on crops of 64×64pixels, following the experimental setup of (Goodfellow etal. 2014)
Figure 5: Samples from our generated datasets, includingbounding boxes predicted by our model. Left: Sample fromregular grid dataset, Right: Sample from dataset with ran-domly positioned house numbers.
image. In Figure 5 we show samples from our two gener-ated datasets, that we used for our other experiments basedon SVHN data. We found that our network performs well onthe task of finding and recognizing house numbers that arearranged in a regular grid.
During our experiments on the second dataset, createdby us, we found that it is not possible to train a modelfrom scratch, which can find and recognize more than twotextlines that are scattered across the whole image. We there-fore resorted to designing a curriculum learning strategy thatstarts with easier samples first and then gradually increasesthe complexity of the train images.
Experiments on the FSNS datasetFollowing our scheme of increasing the difficulty of the taskthat should be solved by the network, we chose the FrenchStreet Name Signs (FSNS) dataset by Smith et al. (Smithet al. 2016) to be our next dataset to perform experimentson. The FSNS dataset contains more than 1 million im-ages of French street name signs, which have been extractedfrom Google Streetview. This dataset is the most challeng-ing dataset for our approach as it (1) contains multiple linesof text with varying length, which are embedded in naturalscenes with distracting backgrounds, and (2) contains a lotof images where the text is occluded, not correct, or nearlyunreadable for humans.
During our first experiments with that dataset, we foundthat our model is not able to converge, when trained on thesupplied groundtruth. We argue that this is because our net-work was not able to learn the alignment of the supplied la-bels with the text in the images of the dataset. We thereforechose a different approach, and started with experiments
6679
Figure 6: Samples from the FSNS dataset, these examples show the variety of different samples in the dataset and also howwell our system copes with these samples. The bottom row shows two samples, where our system fails to recognize the correcttext. The right image is especially interesting, as the system here tries to mix information, extracted from two different streetsigns, that should not be together in one sample.
Method Sequence Accuracy(Smith et al. 2016) 72.5%(Wojna et al. 2017) 84.2%Ours 78.0%
Table 2: Recognition accuracies on the FSNS benchmarkdataset.
where we tried to find individual words instead of textlineswith more than one word. Table 2 shows the performanceof our proposed system on the FSNS benchmark dataset.We are currently able to achieve competitive performanceon this dataset. We are still behind the results reported byWojna et al. (Wojna et al. 2017). This likely due to the factthat we used a feature extractor that is weaker (ResNet-18)compared to the one used by Wojna et al. (Inception-ResNetv2). Also recall that our method is not only able to determinethe text in the images, but also able to extract the location ofthe text, although we never explicitly told the network whereto find the text! The network learned this completely on itsown in a semi-supervised manner.
InsightsDuring the training of our networks, we used Visualback-prop (Bojarski et al. 2016) to visualize the regions that thenetwork deems to be the most interesting. Using this visual-ization technique, we could observe that our system seemsto learn different types of features for each subtask. Figure 3(bottom) shows that the localization network learns to ex-tract features that resemble edges of text and the recognitionnetwork learns to find strokes of the individual charactersin each cropped word region. This is an interesting obser-vation, as it shows that our DNN tries to learn features thatare closely related to the features used by systems based on
hand-crafted features.
ConclusionIn this paper we presented a system that can be seen as a steptowards solving end-to-end scene text recognition, only us-ing a single multi-task deep neural network. We trained thetext detection component of our model in a semi-supervisedway and are able to extract the localization results of the textdetection component. The network architecture of our sys-tem is simple, but it is not easy to train this system, as a suc-cessful training requires a clever curriculum learning strat-egy. We also showed that our network architecture can beused to reach competitive results on different public bench-mark datasets for scene text detection/recognition.
At the current state we note that our models are not fullycapable of detecting text in arbitrary locations in the image,as we saw during our experiments with the FSNS dataset.Right now our model is also constrained to a fixed numberof maximum words that can be detected with one forwardpass. In our future work, we want to redesign the networkin a way that makes it possible for the network to determinethe number of textlines in an image by itself.
ReferencesBengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009.Curriculum learning. In Proceedings of the 26th Annual Inter-national Conference on Machine Learning, ICML ’09, 41–48.New York, NY, USA: ACM.Bissacco, A.; Cummins, M.; Netzer, Y.; and Neven, H. 2013.Photoocr: Reading text in uncontrolled conditions. In Proceed-ings of the IEEE International Conference on Computer Vision,785–792.Bojarski, M.; Choromanska, A.; Choromanski, K.; Firner, B.;
6680
Jackel, L.; Muller, U.; and Zieba, K. 2016. Visualbackprop:efficient visualization of cnns. arXiv:1611.05418 [cs].Dai, J.; He, K.; and Sun, J. 2016. Instance-aware semantic seg-mentation via multi-task network cascades. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition, 3150–3158.Epshtein, B.; Ofek, E.; and Wexler, Y. 2010. Detecting text innatural scenes with stroke width transform. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2963–2970.Goodfellow, I.; Bulatov, Y.; Ibarz, J.; Arnoud, S.; and Shet, V.2014. Multi-digit number recognition from street view imageryusing deep convolutional neural networks. In ICLR2014.Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic datafor text localisation in natural images. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,2315–2324.Gomez, L., and Karatzas, D. 2017. Textproposals: A text-specific selective search algorithm for word spotting in the wild.Pattern Recognition 70:60–74.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residuallearning for image recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 770–778.He, P.; Huang, W.; Qiao, Y.; Loy, C. C.; and Tang, X. 2016b.Reading scene text in deep convolutional sequences. In Pro-ceedings of the Thirtieth AAAI Conference on Artificial Intelli-gence, 3501–3508. AAAI Press.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory. Neural Computation 9(8):1735–1780.Ioffe, S., and Szegedy, C. 2015. Batch normalization: Acceler-ating deep network training by reducing internal covariate shift.In Proceedings of The 32nd International Conference on Ma-chine Learning, 448–456.Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A.2015a. Reading text in the wild with convolutional neural net-works. International Journal of Computer Vision 116(1):1–20.Jaderberg, M.; Simonyan, K.; Zisserman, A.; and Kavukcuoglu,K. 2015b. Spatial transformer networks. In Advances in NeuralInformation Processing Systems 28, 2017–2025. Curran Asso-ciates, Inc.Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Deepfeatures for text spotting. In Computer Vision - ECCV 2014,number 8692 in Lecture Notes in Computer Science, 512–528.Springer International Publishing.Li, H.; Wang, P.; and Shen, C. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks.arXiv:1707.03985 [cs].Mishra, A.; Alahari, K.; and Jawahar, C. 2012. Scene textrecognition using higher order language priors. In BMVC 2012-23rd British Machine Vision Conference, 127.1–127.11. BritishMachine Vision Association.Nair, V., and Hinton, G. E. 2010. Rectified linear units im-prove restricted boltzmann machines. In Proceedings of the27th international conference on machine learning (ICML-10),807–814.Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng,A. Y. 2011. Reading digits in natural images with unsuper-
vised feature learning. In NIPS workshop on deep learning andunsupervised feature learning, volume 2011, 5.Neumann, L., and Matas, J. 2010. A method for text localiza-tion and recognition in real-world images. In Computer Vision- ACCV 2010, number 6494 in Lecture Notes in Computer Sci-ence, 770–783. Springer Berlin Heidelberg.Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016.You only look once: Unified, real-time object detection. In Pro-ceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, 779–788.Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn:Towards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Systems28, 91–99. Curran Associates, Inc.Shi, B.; Bai, X.; and Yao, C. 2016. An end-to-end trainableneural network for image-based sequence recognition and itsapplication to scene text recognition. IEEE Transactions onPattern Analysis and Machine Intelligence.Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2016. Robustscene text recognition with automatic rectification. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, 4168–4176.Simonyan, K., and Zisserman, A. 2015. Very deep convolu-tional networks for large-scale image recognition. In Interna-tional Conference on Learning Representations.Smith, R.; Gu, C.; Lee, D.-S.; Hu, H.; Unnikrishnan, R.; Ibarz,J.; Arnoud, S.; and Lin, S. 2016. End-to-end interpretationof the french street name signs dataset. In Computer Vision -ECCV 2016 Workshops, 411–426. Springer, Cham.Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; andSalakhutdinov, R. 2014. Dropout: a simple way to preventneural networks from overfitting. Journal of machine learningresearch 15(1):1929–1958.Sønderby, S. K.; Sønderby, C. K.; Maaløe, L.; and Winther,O. 2015. Recurrent spatial transformer networks.arXiv:1509.05329 [cs].Tokui, S.; Oono, K.; Hido, S.; and Clayton, J. 2015. Chainer:a next-generation open source framework for deep learning.In Proceedings of Workshop on Machine Learning Systems(LearningSys) in The Twenty-ninth Annual Conference on Neu-ral Information Processing Systems (NIPS).Wang, K., and Belongie, S. 2010. Word spotting in the wild. InComputer Vision - ECCV 2010, number 6311 in Lecture Notesin Computer Science, 591–604. Springer Berlin Heidelberg.Wang, K.; Babenko, B.; and Belongie, S. 2011. End-to-endscene text recognition. In 2011 International Conference onComputer Vision, 1457–1464.Wojna, Z.; Gorban, A.; Lee, D.-S.; Murphy, K.; Yu, Q.; Li, Y.;and Ibarz, J. 2017. Attention-based extraction of structuredinformation from street view imagery. arXiv:1704.03549 [cs].Yao, C.; Bai, X.; Shi, B.; and Liu, W. 2014. Strokelets: Alearned multi-scale representation for scene text recognition. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition, 4042–4049.
6681
5. SEE
5.3 Additional Experimental Results
In this section, we provide additional experimental results of our presented net-
work architecture on the ICDAR dataset [KGBN+15] for focused scene text recog-
nition, where we explored the performance of our model when it comes to find
and recognize single characters.
5.3.1 Experimental Setup
5.3.1.1 Localization Network
The localization network used in every experiment is based on the ResNet ar-
chitecture [HZRS16]. The input to the network is the image where text shall be
localized and later recognized. Before the first residual block, the network per-
forms a 3× 3 convolution followed by a 2× 2 average pooling layer with stride 2.
After these layers three residual blocks with two 3×3 convolutions, each followed
by batch normalization [IS15b], are used. The number of convolutional filters is
32, 48 and 48 respectively and ReLU is used as the activation function for each
convolutional layer. A 2 × 2 max-pooling with stride 2 follows after the second
residual block. The last residual block is followed by a 5×5 average pooling layer,
and a BLSTM follows this layer with 256 hidden units. For each time step of the
BLSTM, a fully connected layer with 6 hidden units follows. This layer predicts
the affine transformation matrix, that is used to generate the sampling grid for
the bilinear interpolation. As rectification of scene text is beyond the scope of
this work we disabled skew and rotation in the affine transformation matrices
by setting the corresponding parameters to 0. We will discuss the rectification
capabilities of Spatial Transformers for scene text detection in our future work.
5.3.1.2 Recognition Network
The inputs to the recognition network are N crops from the original input image
that represent the text regions found by the localization network. The recognition
network has the same structure as the localization network, but the number of
convolutional filters is higher. The number of convolutional filters is 32, 64 and
128 respectively. Depending on the experiment we either used an ensemble of
100
5.3 Additional Experimental Results
Method ICDAR 2013 SVT IIIT5K
PhotoOCR [BCNN13b] 87.6 78.0 -
CharNet [JVZ14b] 81.8 71.7 -
DictNet* [JSVZ15] 90.8 80.7 -
CRNN [SBY16] 86.7 80.8 78.2
RARE [SWL+16] 87.5 81.9 81.9
Ours 90.3 79.8 86
Table 5.1: Recognition accuracies on the ICDAR 2013, SVT and IIIT5K robust reading
benchmarks. Here we only report results that do not use per image lexicons. (*[JSVZ15]
is not lexicon-free in the strict sense as the outputs of the network itself are constrained to
a 90k dictionary.)
T independent softmax classifiers as used in [GBI+14] and [JVZ14a], where T
is the maximum length that a word may have, or we used CTC with best path
decoding as used in [HHQ+16] and [SBY16].
5.3.1.3 Implementation
We implemented all our experiments using MXNet [CLL+15]. We conducted
all our experiments on a workstation which has an Intel(R) Core(TM) i7-6900K
CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.
5.3.2 Experiments on Robust Reading Datasets
In our next experiments, we used datasets where text regions are already cropped
from the input images. We wanted to see whether our text localization network
can be used as an intelligent sliding window generator that adapts to irregularities
of the text in the cropped text region. Therefore we trained our recognition model
using CTC on a dataset of synthetic cropped word images, that we generated
using our data generator, that works similar to the data generator introduced by
Jaderberg [JSVZ14].
In Table 5.1 we report the recognition results of our model on the ICDAR
2013 robust reading [KSU+13], the Street View Text (SVT) [WBB11] and the
IIIT5K [MAJ12] benchmark datasets. For evaluation on the ICDAR 2013 and
101
5. SEE
Figure 5.1: Samples from ICDAR, SVT and IIIT5K datasets that show how well our
model finds text regions and is able to follow the slope of the words.
SVT datasets, we filtered all images that contain non-alphanumeric characters
and discarded all images that have less than 3 characters as done in [SWL+16,
WBB11]. We obtained our final results by post-processing the predictions using
the standard hunspell English (en-US) dictionary.
Overall we find that our model achieves state-of-the-art performance for un-
constrained recognition models on the ICDAR 2013 and IIIT5K dataset and
competitive performance on the SVT dataset. In Figure 5.1 we show that our
model learns to follow the slope of the individual text regions, proving that our
model intelligently produces sliding windows.
102
6
Learning Binary Neural
Networks with BMXNet
In this work, we developed an open source Binary Neural Networks (BNNs) based
on Apache MXNet. The implemented approaches can drastically reduce memory
size and accesses of a DL model, and replace the arithmetic operations by bit-
wise operations. It significantly improves the efficiency and lowers the energy
consumption at runtime, which enables the application of state-of-the-art deep
learning models on low power devices.
We further worked on increasing our understanding of the training process of
BNNs. We systematically evaluated different network architectures and hyper-
parameters to provide useful insights on how to train a binary neural network
based on BMXNet. Further, we present how we improved accuracy by increasing
the number of connections through the network.
6.1 Contribution to the Work
• Main contributor to the formulation and implementation of research ideas
• Main contributor of the conceptual and technical implementation
• Core maintainer of the software project
• Guidance and supervision of the further technical implementation
103
BMXNet: An Open-Source Binary Neural NetworkImplementation Based on MXNet
Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph MeinelHasso Plattner Institute (HPI), University of Potsdam, Germany
Potsdam D-14480{haojin.yang,christian.bartz,meinel}@hpi.de
martin.fritzsche@student.hpi.de
ABSTRACTBinary Neural Networks (BNNs) can drastically reduce memorysize and accesses by applying bit-wise operations instead of stan-dard arithmetic operations. Therefore it could significantly improvethe efficiency and lower the energy consumption at runtime, whichenables the application of state-of-the-art deep learning models onlow power devices. BMXNet is an open-source BNN library basedon MXNet, which supports both XNOR-Networks and QuantizedNeural Networks. The developed BNN layers can be seamlesslyapplied with other standard library components and work in bothGPU and CPU mode. BMXNet is maintained and developed by themultimedia research group at Hasso Plattner Institute and releasedunder Apache license. Extensive experiments validate the efficiencyand effectiveness of our implementation. The BMXNet library, sev-eral sample projects, and a collection of pre-trained binary deepmodels are available for download at https://github.com/hpi-xnor
CCS CONCEPTS• Software and its engineering→ Software libraries and repos-itories; •Computer systems organization→Neural networks;• Computing methodologies → Computer vision;
KEYWORDSOpen Source, Computer Vision, Binary Neural Networks, MachineLearning
ACM Reference format:Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph Meinel. 2017.BMXNet: An Open-Source Binary Neural Network Implementation BasedonMXNet. In Proceedings of MM ’17, Mountain View, CA, USA, October 23–27,2017, 4 pages.https://doi.org/10.1145/3123266.3129393
1 INTRODUCTIONIn recent years, deep learning technologies achieved excellent per-formance and many breakthroughs in both academia and industry.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.MM ’17, October 23–27, 2017, Mountain View, CA, USA© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-4906-2/17/10. . . $15.00https://doi.org/10.1145/3123266.3129393
However the state-of-the-art deep models are computational ex-pensive and consume large storage space. Deep learning is alsostrongly demanded by numerous applications from areas such asmobile platforms, wearable devices, autonomous robots and IoTdevices. How to efficiently apply deep models on such low powerdevices becomes a challenging research problem. The recently intro-duced Binary Neural Networks (BNNs) could be one of the possiblesolutions for this problem.
Several approaches [4, 7, 13, 15] introduce the usage of BNNs.These BNNs have the capability of decreasing the memory con-sumption and computational complexity of the neural network.This is achieved by on the one hand storing the weights, that aretypically stored as 32 bit floating point values, as binary values, bybinarizing the floating point values with the sign function, to be ofeither {0, 1} or {−1, 1}, and storing several of them in a single 32 bitfloat or integer. Computational complexity, on the other hand, isreduced by using xnor and popcount for performing matrix multi-plications used in convolutional and fully connected layers. Mostof the publicly available implementations of BNN do not store theweights in their binarized form [4, 7, 13, 15], nor use xnor andpopcount [7, 15] while performing the matrix multiplications inconvolutional and fully connected layers.
The deep learning library Tensorflow [8] tries to decrease thememory consumption and computational complexity of deep neuralnetworks, by quantizing the 32 bit floating point weights and inputsinto 8 bit integers. Together with the minimum and maximumvalue of the weight/input matrix, 4× less memory usage and alsodecreased computational complexity is achieved, as all operationsonly need to be performed on 8 bit values rather than 32 bit values.
BMXNet stores the weights of convolutional and fully connectedlayers in their binarized format, which enables us to store 32/64weights in a single 32/64 bit float/integer and use 32× less mem-ory. During training and inference we binarize the input to eachbinary convolution and fully connected layer in the same way asthe weights get binarized, and perform matrix multiplication usingbit-wise operations (xnor and popcount). Our implementation isalso prepared to use networks that store weights and use inputswith arbitrary bit widths as proposed by Zhou et al. [15].
The deep learning library MXNet [3] serves as a base for ourcode. MXNet is a high performance and modular deep learninglibrary, that is written in C++. MXNet provides Bindings for otherpopular programming languages like Python, R, Scala and Go, andis used by a wide range of researchers and companies.
Session: Open Source Software Competition MM’17, October 23-27, 2017, Mountain View, CA, USA
1209
2 FRAMEWORKBMXNet provides activation, convolution and fully connected lay-ers that support quantization and binarization of input data andweights. These layers are designed as drop-in replacements forthe corresponding MXNet variants and are called QActivation,QConvolution and QFullyConnected. They provide an additionalparameter, act_bit, which controls the bit width the layers calcu-late with.
A Python example usage of our framework in comparison toMXNet is shown in Listing 1 and 2. We do not use binary layers forthe first and last layer in the network, as we have confirmed theexperiments of [13] showing that this greatly decreases accuracy.The standard block structure of a BNN in BMXNet is conducted as:QActivation-QConv/QFC-BatchNorm-Pooling as shown in Listing 2.
Listing 1: LeNetdef get_lenet ():
data = mx.symbol.Variable(data)
# first conv layer
conv1 = mx.sym.Convolution (...)
tanh1 = mx.sym.Activation (...)
pool1 = mx.sym.Pooling (...)
bn1 = mx.sym.BatchNorm (...)
# second conv layer
conv2 = mx.sym.Convolution (...)
bn2 = mx.sym.BatchNorm (...)
tanh2 = mx.sym.Activation (...)
pool2 = mx.sym.Pooling (...)
# first fullc layer
flatten = mx.sym.Flatten (...)
fc1 = mx.symbol.FullyConnected ()
bn3 = mx.sym.BatchNorm (...)
tanh3 = mx.sym.Activation (...)
# second fullc
fc2 = mx.sym.FullyConnected (...)
# softmax loss
lenet = mx.sym.SoftmaxOutput (..)
return lenet
Listing 2: Binary LeNetdef get_binary_lenet ():
data = mx.symbol.Variable(data)
# first conv layer
conv1 = mx.sym.Convolution (...)
tanh1 = mx.sym.Activation (...)
pool1 = mx.sym.Pooling (...)
bn1 = mx.sym.BatchNorm (...)
# second conv layer
ba1 = mx.sym.QActivation(...)
conv2 = mx.sym.QConvolution(...)
bn2 = mx.sym.BatchNorm (...)
pool2 = mx.sym.Pooling (...)
# first fullc layer
flatten = mx.sym.Flatten (...)
ba2 = mx.symbol.QActivation(..)
fc1 = mx.symbol.QFullyConnected(..)
bn3 = mx.sym.BatchNorm (...)
tanh3 = mx.sym.Activation (...)
# second fullc
fc2 = mx.sym.FullyConnected (...)
# softmax loss
lenet = mx.sym.SoftmaxOutput (..)
return lenet
2.1 QuantizationThe quantization on bit widths ranging from 2 to 31 bit is availablefor experiments with training and prediction, using low precisionweights and inputs. The quantized data is still stored in the default32 bit float values and the standard MXNet dot product operationsare applied.
We quantize the weights following the linear quantization asshown by [15]. Equation 1 will quantize a real number input in therange [0, 1] to a number in the same range representable with a bitwidth of k bit.
quantize (input ,k ) =round ((2k − 1) ∗ input )
2k − 1 (1)
2.2 BinarizationThe extreme case of quantizing to 1 bit wide values is the bina-rization. Working with binarized weights and input data allowsfor highly performant matrix multiplications by utilizing the CPUinstructions xnor and popcount.
2.2.1 Dot Product with xnor and popcount. Fully connectedand convolution layers heavily rely on dot products of matrices,
which in turn require massive floating point operations. Most mod-ern CPUs are optimized for these types of operations. But especiallyfor real time applications on embedded or less powerful devices(cell phones, IoT devices) there are optimizations that improveperformance, reduce memory and I/O footprint and lower powerconsumption [2].
To calculate the dot product of two binary matrices A ◦ B, nomultiplication operation is required. The element-wise multiplica-tion and summation of each row of A with each column of B canbe approximated by first combining them with the xnor operationand then counting the number of bits set to 1 in the result which isthe population count [13].
Listing 3: Baseline xnor GEMM Kernelvoid xnor_gemm_baseline_no_omp(int M, int N, int K,
BINARY_WORD *A, int lda ,
BINARY_WORD *B, int ldb ,
float *C, int ldc){
for (int m = 0; m < M; ++m) {
for (int k = 0; k < K; k++) {
BINARY_WORD A_PART = A[m*lda+k];
for (int n = 0; n < N; ++n) {
C[m*ldc+n] += __builtin_popcountl(∼(A_PART ∧ B[k*ldb+n]));
}
}
}
}
We can approximate the multiplication and addition of two times64 matrix elements in very concise processor instructions on x64CPUs and two times 32 elements on x86 and ARMv7 processors.This is enabled by hardware support for the xnor and popcountoperations. They translate directly into a single assembly command.The population count instruction is available on x86 and x64 CPUssupporting SSE4.2, while on ARM architecture it is included in theNEON instruction set.
An unoptimized GEMM (General Matrix Multiplication) imple-mentation utilizing these instructions is shown in Listing 3. Thecompiler intrinsic __builtin_popcount is supported by both gccand clang compilers and translates into the machine instruction onsupported hardware. BINARY_WORD is the packed data type storing32 (x86 and ARMv7) or 64 (x64) matrix elements, each representedby a single bit. We implemented several optimized versions of xnorGEMM kernel. We leverage processor cache hierarchies by blockingand packing the data, use unrolling and parallelization techniques.
2.2.2 Training. We carefully designed the binarized layers (uti-lizing xnor and population count operations) to exactly match theoutput of the built-in layers of MXNet (computing with BLAS dotproduct operations) when limiting those to the discrete values -1and +1. This enables massively parallel training with GPU supportby utilizing CuDNN on high performance clusters. The trainedmodel can then be used on less powerful devices where the forwardpass for prediction will calculate the dot product with the xnor andpopcount operations instead of multiplication and addition.
The possible values after performing an xnor and popcountmatrix multiplication A
(m×n)◦ B(n×k )
are in the range [0,+n] with
the step size 1, whereas a normal dot product of matrices limited todiscrete values -1 and +1 will be in the range [−n,+n] with the stepsize 2. To enable GPU supported training we modify the training
Session: Open Source Software Competition MM’17, October 23-27, 2017, Mountain View, CA, USA
1210
Figure 1: Processing time comparison of GEMMmethods
Figure 2: Speedup comparison based on naive gemmmethodby varying filter number of the convolution layer. The inputchannel size is fixed to 256 while the kernel size and batchsize are set to 5×5 and 200 respectively.
process. After calculation of the dot product we map the result backto the range [0,+n] to match the xnor dot product, as in Equation 2.
outputxnor_dot =outputdot + n
2(2)
2.2.3 Model Converter. After training a network with BMXNet,the weights are stored in 32 bit float variables. This is also the casefor networks trained with a bit width of 1 bit. We provide a modelconverter1 that reads in a binary trained model file and packs theweights of QConvolution and QFullyConnected layers. After thisconversion only 1 bit of storage and runtime memory is used perweight. A ResNet-18 network with full precision weights has a sizeof 44.7MB. The conversion with our model converter achieves 29×compression resulting in a file size of 1.5MB (cf. Table 1).
3 EVALUATIONIn this section we report the evaluation results of both efficiencyanalysis and classification accuracy overMNIST [12], CIFAR-10 [11]and ImageNet [5] datasets using BMXNet.
1https://github.com/hpi-xnor/BMXNet/tree/master/smd_hpi/tools/model-converter
Figure 3: Speedup comparison based on naive gemmmethodby varying kernel size of the convolution layer. The inputchannel size, batch size and filter number are set to 256, 200and 64 respectively.
3.1 Efficiency AnalysisAll the experiments in this section have been performed on Ubuntu16.04/64-bit platform with Intel 2.50GHz × 4 CPU with popcnt in-struction (SSE4.2) and 8G RAM.
In the current deep neural network implementations, most ofthe fully connected and convolution layers are implemented us-ing GEMM. According to the evaluation result from [9], over 90%of the processing time of the Caffe-AlexNet [10] model is spenton such layers. We thus conducted experiments to measure theefficiency of different GEMM methods. The measurements wereperformed within a convolution layer, where we fixed the parame-ters as follows: filter number=64, kernel size=5×5, batch size=200,and the matrix sizes M, N, K are 64, 12800, kernelw × kernelh ×inputChannelSize , respectively. Figure 1 shows the evaluation re-sults. The colored columns denote the processing time in millisec-onds across varying input channel size; xnor_32 and xnor_64 de-note the xnor_gemm operator in 32 bit and 64 bit; xnor_64_ompdenotes the 64 bit xnor_gemm accelerated by using the OpenMP2parallel programming library; binarize input and xnor_64_ompfurther accumulated the processing time of input data binarization.From the results we can determine that xnor_64_omp achievedabout 50× and 125× acceleration in comparison to Cblas(Atlas3)and naive gemm kernel, respectively. By accumulating the bina-rization time of input data we still achieved about 13× accelerationcompared with Cblas method.
Figures 2 and 3 illustrate the speedup achieved by varying filternumber and kernel size based on the naive gemm method.
3.2 Classification AccuracyWe further conducted experiments with our BNNs on the MNIST,CIFAR-10 and ImageNet datasets. The experiments were performedon a work station which has an Intel(R) Core(TM) i7-6900K CPU,64 GB RAM and 4 TITAN X (Pascal) GPUs.
By following the same strategy as applied in [7, 13, 15] wealways avoid binarization at the first convolution layer and the
2http://www.openmp.org/3http://math-atlas.sourceforge.net/
Session: Open Source Software Competition MM’17, October 23-27, 2017, Mountain View, CA, USA
1211
Architecture Test Accuracy (Binary/Full Precision) Model Size (Binary/Full Precision)MNIST Lenet 0.97/0.99 206kB/4.6MB
CIFAR-10 ResNet-18 0.86/0.90 1.5MB/44.7MBTable 1: Classification test accuracy of binary and full precision models trained on MNIST and CIFAR-10 dataset. No pre-training or data augmentation was used.
Full Pre-cisionStage
Val-acc-top-1 Val-acc-top-5 Model Size
none 0.42 0.66 3.6MB1st 0.48 0.73 4.1MB2nd 0.44 0.69 5.6MB3rd 0.49 0.73 11.3MB4th 0.47 0.71 36MB1st, 2nd 0.49 0.73 6.2MBAll 0.61 0.84 47MB
Table 2: Classification test accuracy of binary, partially bi-narized and full precision models trained on ImageNet.ResNet-18 architecture was used in the experiment.
last fully connected layer. Table 1 depicts the classification testaccuracy of our binary, as well as full precision models trainedon MNIST and CIFAR10. The table shows that the size of binarymodels is significantly reduced, while the accuracy is still compet-itive. Table 2 demonstrates the validation accuracy of our binary,partially-binarized and full precision models trained on ImageNet.The ResNet implementation in MXNet consists of 4 ResUnit stages,we thus also report the results of a partially-binarized model withspecific full precision stages. The partially-binarized model withthe first full precision stage shows a great accuracy improvementwith very minor model size increase, compared to the full binarizedmodel.
4 EXAMPLE APPLICATIONS4.1 Python ScriptsThe BMXNet repository[1] contains python scripts that can trainand validate binarized neural networks. The script smd_hpi/examples/binary_mnist/mnist_cnn.py will train a binary LeNet[14]with theMNIST[12] data set. To train a networkwith the CIFAR10[11]or ImageNet[5] data set there is a python script based on theResNet18[6] architecture. Find it at smd_hpi/examples/binary-imagenet1k/train_cifar10/train_[dataset].py. For furtherinformation and example invocation see the corresponding README.md
4.2 Mobile Applications4.2.1 Image Classification. The Android application android-
image-classificationand iOS application ios-image-classificationcanclassify the live camera feed based on a binarized ResNet18 modeltrained on the ImageNet dataset.
4.2.2 HandwrittenDigit Detection. The iOS application ios-mnistcan classify handwritten numbers based on a binarized LeNet modeltrained on the MNIST dataset.
5 CONCLUSIONWe introduced BMXNet, an open-source binary neural networkimplementation in C/C++ based on MXNet. The evaluation resultsshow up to 29× model size saving and much more efficient xnorGEMM computation. In order to demonstrate the applicability wedeveloped sample applications for image classification on Androidas well as iOS using a binarized ResNet-18 model. Source code, doc-umentation, pre-trained models and sample projects are publishedon GitHub [1].
REFERENCES[1] 2017. BMXNet: an open-source binary neural network library. https://github.
com/hpi-xnor. (2017).[2] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2016. YodaNN:
An ultra-low power convolutional neural network accelerator based on binaryweights. In VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on. IEEE,236–241.
[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, TianjunXiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexibleand Efficient Machine Learning Library for Heterogeneous Distributed Systems.CoRR abs/1512.01274 (2015).
[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryCon-nect: Training Deep Neural Networks with binary weights during propagations.In Advances in Neural Information Processing Systems 28. 3123–3131.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: ALarge-Scale Hierarchical Image Database. In CVPR09.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. 770–778.
[7] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and YoshuaBengio. 2016. Binarized Neural Networks. In Advances in Neural InformationProcessing Systems 29. 4107–4115.
[8] Google Inc. 2015. TensorFlow: Large-Scale Machine Learning on HeterogeneousSystems. (2015). http://tensorflow.org/ Software available from tensorflow.org.
[9] Yangqing Jia. 2014. Learning Semantic Image Representations at a Large Scale.Ph.D. Dissertation. EECS Department, University of California, Berkeley.
[10] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: ConvolutionalArchitecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
[11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. CIFAR-10 (CanadianInstitute for Advanced Research). (2014).
[12] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. (2010).http://yann.lecun.com/exdb/mnist/
[13] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016.XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Net-works. In Computer Vision - ECCV 2016. 525–542.
[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.Going deeper with convolutions. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. 1–9.
[15] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou.2016. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks withLow Bitwidth Gradients. arXiv:1606.06160 [cs] (2016).
Session: Open Source Software Competition MM’17, October 23-27, 2017, Mountain View, CA, USA
1212
Learning to Train a Binary Neural Network
Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
Hasso Plattner Institute, University of Potsdam, GermanyP.O. Box 900460, Potsdam D-14480
{joseph.bethge,haojin.yang,christian.bartz,meinel}@hpi.de
Abstract. Convolutional neural networks have achieved astonishing re-sults in different application areas. Various methods which allow us touse these models on mobile and embedded devices have been proposed.Especially binary neural networks seem to be a promising approach forthese devices with low computational power. However, understandingbinary neural networks and training accurate models for practical ap-plications remains a challenge. In our work, we focus on increasing ourunderstanding of the training process and making it accessible to every-one. We publish our code and models based on BMXNet for everyoneto use1. Within this framework, we systematically evaluated differentnetwork architectures and hyperparameters to provide useful insights onhow to train a binary neural network. Further, we present how we im-proved accuracy by increasing the number of connections in the network.
1 Introduction
Nowadays, significant progress through research is made towards automatingdifferent tasks of our everyday lives. From vacuum robots in our homes to entireproduction facilities run by robots, many tasks in our world are already highlyautomated. Other advances, such as self-driving cars, are currently being devel-oped and depend on strong machine learning solutions. Further, more and moreordinary devices are equipped with embedded chips (with limited resources) forvarious reasons, such as smart home devices. Even operating systems and appson smartphones adopt deep learning techniques for tackling several problems andwill likely continue to do so in the future. All these devices have limited compu-tational power, often while trying to achieve minimal energy consumption, andmight provide future applications for machine learning.
Consider a fully automated voice controlled coffee machine that identifiesusers by their face and remembers their favorite beverage. The machine couldbe connected to a cloud platform which runs the machine learning models andstores user information. The machine transfers the voice or image data to theserver for processing, and receives the action to take or which settings to load.
There are a few requirements for this setup, which can be enumerated easily:A stable internet connection with sufficient bandwidth is required. Furthermore,the users have to agree on sending the required data to the company hosting the
1 https://github.com/Jopyth/BMXNet
arX
iv:1
809.
1046
3v1
[cs
.LG
] 2
7 Se
p 20
18
2 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
Table 1: Comparison of available implementations for binary neural networks
Title GPU CPUPythonAPI
C++API
SaveBinaryModel
DeployonMobile
OpenSource
CrossPlatform
BNNs [1] X X XDoReFa-Net [2] X X X X XXNOR-Net [3] X XBMXNet [4] X X X X X X X X
cloud platform. This not only requires trust from the users, but data privacycan be an issue, too, especially in other potential application areas, such ashealthcare or finances.
All of these potential problems can be avoided by hosting the machine learn-ing models directly on the coffee machine itself. However, there are other chal-lenges, such as limited computational resources and limited memory, in additionto a possible reliance on battery power. We focus on solving these challengesby training a Binary Neural Network (BNN). In a BNN the commonly usedfull-precision weights of a convolutional neural network are replaced with binaryweights. This results in a storage compression by a factor of 32× and allowsfor more efficient inference on CPU-only architectures. We discuss existing ap-proaches, which have promising results, in Section 2. However, architectures,design choices, and hyperparameters are often presented without thorough ex-planation or experiments. Often, there is no source code for actual BNN imple-mentations present (see Table 1). This makes follow-up experiments and buildingactual applications based on BNNs difficult.
Therefore we provide our insights on existing network architectures and pa-rameter choices, while striving to achieve a better understanding of BNNs (Sec-tion 3). We evaluate these choices and our novel ideas based on the open sourceframework BMXNet [4]. We discuss the results of a set of experiments on theMNIST, CIFAR10 and ImageNet datasets (Section 4). Finally, we examine fu-ture ideas, such as quantized neural networks, wherein the binary weights arereplaced with lower precision floating point numbers (Section 5).
Summarized, our contributions presented in this paper are:
– We provide novel empirical proof for choice of methods and parameters com-monly used to train BNNs, such as how to deal with bottleneck architecturesand the gradient clipping threshold.
– We found that dense shortcut connections can improve the classificationaccuracy of BNNs significantly and show how to create efficient models withthis architecture.
– We offer our work as a contribution to the open source framework BMXNet [4],from which both academia and industry can take advantage from. We shareour code and developed models in this paper for research use.
– We present an overview about performance of commonly used network ar-chitectures with binary weights.
Learning to Train a Binary Neural Network 3
2 Related Work
In this section we first present two network architectures, Residual Networks [5]and Densely Connected Networks [6], which focus on increasing information flowthrough the network. Afterwards we give an overview about networks and tech-niques which were designed to allow execution on mobile or embedded devices.
Residual Networks [5] combine the information of all previous layers withshortcut connections leading to increased information flow. This is done throughaddition of identity connections to the outputs of previous layers together withthe output of the current layer. Consequently, the shortcut connections addneither extra weights nor computational cost.
In Densely Connected Networks [6] the shortcut connections are instead builtby concatenating the outputs of previous layers and the current layer. There-fore, new information gained in one layer can be reused throughout the entiredepth of the network. To reduce the total model size, the original full-precisionarchitecture includes a bottleneck design, which reduces the number of filters intransition layers. These effectively keep the network at a very small total size,even though the concatenation adds new information into the network every fewlayers.
There are two main approaches which allow for execution on mobile devices:On the one hand, information in a CNN can be compressed through compactnetwork design. These designs rely on full-precision floating point numbers, butreduce the total number of parameters with a clever network design, while pre-venting loss of accuracy. On the other hand, information can be compressed byavoiding the common usage of full-precision floating point weights, which use32 bit of storage. Instead, quantized floating-point number with lower precision(e.g. 8 bit of storage) or even binary (1 bit of storage) weights are used in theseapproaches.
We first present a selection of techniques which utilize the former method.The first of these approaches, SqueezeNet, was presented by Iandola et al. [7] in2016. The authors replace a large portion of 3×3 filters with smaller 1×1 filtersin convolutional layers and reduce the number of input channels to the remaining3×3 filters for a reduced number of parameters. Additionally, they facilitate latedownsampling to maximize their accuracy based on the lower number of weights.Further compression is done by applying deep compression [8] to the model foran overall model size of 0.5 MB.
A different approach, MobileNets, was implemented by Howard et al. [9].They use a depth-wise separable convolution where convolutions apply a sin-gle 3×3 filter to each input channel. Then, a 1×1 convolution is applied tocombine their outputs. Zhang et al. [10] use channel shuffling to achieve groupconvolutions in addition to depth-wise convolution. Their ShuffleNet achievescomparably lower error rate for the same number of operations needed for Mo-bileNets. These approaches reduce memory requirements, but still require GPUhardware for efficient training and inference. Specific acceleration strategies forCPUs still need to be developed for these methods.
4 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
In contrast to this, approaches which use binary weights instead of full-precision weights can achieve compression and acceleration. However, the draw-back usually is a severe drop in accuracy. For example, the weights and acti-vations in Binarized Neural Networks are restricted to either +1 or -1, as pre-sented by Hubara et al. [1]. They further provide efficient calculation methodsof the equivalent of a matrix multiplication by using XNOR and popcount oper-ations. XNOR-Nets are built on a similar idea and were published by Rastegariet al. [3]. They include a channel-wise scaling factor to improve approxima-tion of full-precision weights, but require weights between layers to be storedas full-precision numbers. Another approach, called DoReFa-Net, was presentedby Zhou et al. [2]. They focus on quantizing the gradients together with differ-ent bit-widths (down to binary values) for weights and activations and replacethe channel-wise scaling factor with one constant scalar for all filters. Anotherattempt to remove everything except binary weights is taken in ABC-Nets byLin et al. [11]. This approach achieves a drop in top1-accuracy of only about 5%on the ImageNet dataset compared to a full-precision network using the ResNetarchitecture. They suggest to use between 3 to 5 binary weight bases to ap-proximate full-precision weights, which increases model capacity, but also modelcomplexity and size. Therefore finding a way to accurately train a binary neuralnetwork still remains an unsolved task.
3 Methodology
In alignment with our goal to contribute to open-source frameworks, we publishthe code and models and offer them as a contribution to the BMXNet framework.A few implementation details are provided here. We use the sign function foractivation (and thus transform from real-valued values into binary values):
sign(x) =
{+1 if x ≥ 0,
−1 otherwise,(1)
The implementation uses a Straight-Through Estimator (STE) [12] which cancelsthe gradients, when they get too large, as proposed by Hubara et al. [1]. Let cdenote the objective function, ri be a real number input, and ro ∈ {−1,+1}a binarized output. Furthermore tclip is a threshold for clipping gradients. Inprevious works the clipping threshold was set to tclip = 1 [1]. Then, the straight-through estimator is:
Forward: ro = sign(ri) Backward:∂c
∂ri=
∂c
∂ro1|ri|≤tclip (2)
Usually in full-precision networks a large amount of calculations is spenton calculating dot products of matrices, as is needed by fully connected andconvolutional layers. The computational cost of binary neural networks can behighly reduced by using the XNOR and popcount CPU instructions. Both oper-ations combined approximate the calculation of dot products of matrices. That
Learning to Train a Binary Neural Network 5
is because element-wise multiplication and addition of a dot product can be re-placed with the XNOR instruction and then counting all bits, which are set to 1(popcount) [3]. Let x,w ∈ {−1,+1}n denote the input and weights respectively(with n being the number of inputs). Then the matrix multiplication x · w canbe replaced as follows:
x · w = 2 · bitcount(xnor(x,w))− n (3)
Preliminary experiments showed, that an implementation as custom CUDA ker-nels was slower than using the highly optimized cuDNN implementation. But theabove simplification means, that we can still use normal training methods withGPU acceleration. We simply need to convert weights from {−1,+1} to {0, 1}before deployment in a CPU architecture. Afterwards we can take advantage ofthe CPU implementation.
In the following sections we describe which parameters we evaluate and howwe gain explanations about the whole system. First, we discuss common trainingparameters, such as including a scaling factor during training and the thresholdfor clipping the gradients. Secondly, we examine different deep neural networkarchitectures, such as AlexNet [13], Inception [14,15], ResNet [5], DenseNet [6].During this examination, we focus on the effect of reducing weights in favor ofincreasing the number of connections on the example of the DenseNet architec-ture. Thirdly, we determine the differences of learned features between binaryneural networks and full-precision networks with feature visualization.
3.1 Network Architectures
Before thinking about model architectures, we must consider the main aspects,which are necessary for binary neural networks. First of all, the information den-sity is theoretically 32 times lower, compared to full-precision networks. Researchsuggests, that the difference between 32 bits and 8 bits seems to be minimal and8-bit networks can achieve almost identical accuracy as full-precision networks[8]. However, when decreasing bit-width to four or even one bit (binary), the ac-curacy drops significantly [1]. Therefore, the precision loss needs to be alleviatedthrough other techniques, for example by increasing information flow throughthe network. This can be successfully done through shortcut connections, whichallow layers later in the network to access information gained in earlier layersdespite of information loss through binarization. These shortcut connections,were proposed for full-precision model architectures in Residual Networks [5]and Densely Connected Networks [6] (see Fig. 1a, c).
Following the same idea, network architectures including bottlenecks alwaysare a challenge to adopt. The bottleneck architecture reduces the number offilters and values significantly between the layers, resulting in less informationflow through binary neural networks. Therefore we hypothesize, that either weneed to eliminate the bottleneck parts or at least increase the number of filtersin these bottleneck parts for accurate binary neural networks to achieve bestresults (see Fig. 1b, d).
6 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
1⨉1
3⨉3
1⨉1
+
1⨉1
3⨉3
1⨉1
+
(a) ResNet(bottleneck)
3⨉3
3⨉3
+
3⨉3
3⨉3
+
(b) ResNet
1⨉1
3⨉3
1⨉1
3⨉3
(c) DenseNet(bottleneck)
3⨉3
3⨉3
(d) DenseNet
Fig. 1: Two (identical) building blocks of different network architectures. (a) Theoriginal ResNet design features a bottleneck architecture (length of bold blackline represents number of filters). A low number of filters reduces informationcapacity for binary neural networks. (b) A variation of the ResNet architecturewithout the bottleneck design. The number of filters is increased, but with onlytwo convolutions instead of three. (c) The original DenseNet design with a bot-tleneck in the second convolution operation. (d) The DenseNet design without abottleneck. The two convolution operations are replaced by one 3×3 convolution
To increase the information flow, the blocks which add or derive new featuresto ResNet and DenseNet (see Fig. 1) have to be modified. In full-precision net-works, the size of such a block ranges from 64 to 512 for ResNet [5]. The authorsof DenseNet call this parameter growth rate and set it to k = 32 [6]. Our prelim-inary experiments showed, that reusing the full-precision DenseNet architecturefor binary neural networks and only removing the bottleneck architecture, isnot achieving satisfactory performance. There are different possibilities to in-crease the information flow for a DenseNet architecture. The growth rate canbe increased (e.g. k = 64, k = 128), we can use a larger number of blocks, or acombination of both (see Fig. 2). Both approaches add roughly the same amountof parameters to the network. It is not exactly the same, since other layers alsodepend on the growth rate parameter (e.g. the first fully-connected layer whichalso changes the size of the final fully-connected layer and the transition lay-ers). Our hypothesis of favoring an increased number of connections over simplyadding more weights indicates, that in this case increasing the number of blocksshould provide better results (or a reduction of the total number of parametersfor equal model performance) compared to increasing the growth rate.
3.2 Common Hyperparameters
One technique which was used in binary neural networks before, is a scalingfactor [2,3]. The result of a convolution operation is multiplied by this scaling
Learning to Train a Binary Neural Network 7
3⨉3, 128
(a)
3⨉3, 64
3⨉3, 64
(b)
3⨉3, 32
3⨉3, 32
3⨉3, 32
3⨉3, 32
(c)
Fig. 2: Different ways to extract information with 3×3 convolutions. (a) A largeblock which generates a high amount of features through one convolution. (b)Splitting one large block in two, which are half as large and generate half asmany features respectively. This allows the features generated in the first blockto be used by the second block. (c) This process can be repeated until a minimaldesirable block size is found (e.g. 32 for binary neural networks)
factor. This should help binary weights to act more similarly to full-precisionweights, by increasing the value range of the convolution operation. However,this factor was applied in different ways. We evaluated whether this scalingfactor proves useful in all cases, because it adds additional complexity to thecomputation and the implementation in Section 4.1.
Another parameter specific to binary neural networks, is the clipping thresh-old tclip. The value of this parameter influences which gradients are canceled andwhich are not. Therefore the parameter has a significant influence on the trainingresult, and we evaluated different values for this parameter (also in Section 4.1).
3.3 Visualization of Trained Models
We used an implementation of the deep dream visualization [16] to visualize whatthe trained models had learned (see Fig. 5). The core idea is a normal forwardpass followed by specifying an optimization objective, such as maximizing acertain neuron, filter, layer, or class during the backward pass.
Another tool we used for visualization is VisualBackProp [17]. It uses thehigh certainty about relevancy of information of the later layers in the networktogether with the higher resolution of earlier layers to efficiently identify thoseparts in the image which contribute most to the prediction.
8 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
4 Experiments and Results
Following the structure of the previous section, we provide our experimentalresults to compare the various parameters and techniques. First, we focus onclassification accuracy as a measure to determine which parameter choices arebetter. Afterwards, we examine the results of our feature visualization tech-niques.
4.1 Classification Accuracy
In this section we apply classification accuracy as the general measurement toevaluate the different architectures, hyperparameters etc. We use the MNIST [18],CIFAR-10 [19] and ImageNet [20] datasets in terms of different levels of taskcomplexity. The experiments were performed on a work station which has anIntel(R) Core(TM) i9-7900X CPU, 64 GB RAM and 4×Geforce GTX1080TiGPUs.
As a general experiment setup, we use full-precision weights for the first (oftena convolutional) layer and the last layer (often a fully connected layer which hasa number of output neurons equal to the number of classes) for all involveddeep networks. We did not apply a scaling factor as proposed by Rastegari et al.in [3] in our experiments. Instead we examined a (similar) scaling factor methodproposed by Zhou et al. [2]. However, as shown in our hyperparameter evaluation(page 12) we chose not to apply this scaling factor for our other experiments.Further, the results of a binary LeNet for the MNIST dataset and a binaryDenseNet with 21 layers can be seen in Table 2.
Popular Deep Architectures In this experiment our intention is to evaluatea selection of popular deep learning architectures by using binary weights andactivations. We wanted to discover positive and negative design patterns withrespect to training binary neural networks. The first experiment is based onAlexNet [13], InceptionBN [21] and ResNet [5] (see Table 3). Using the AlexNetarchitecture, we were not able to achieve similar results as presented by Raste-gari et al. [3]. This might be due to us disregarding their scaling factor approach.Further, we were quite surprised that InceptionBN achieved even worse resultsthan AlexNet. Our assumption for the bad result is that the Inception seriesapplies “bottleneck” blocks intended to reduce the number of parameters and
Table 2: Evaluation of model performance on the MNIST and CIFAR-10 datasets.
Architecture Accuracy Model Size (Binary/Full Precision)
MNIST LeNet 99.3% 202KB/4.4MB
CIFAR-10 DenseNet-21 87.1% 1.9MB/51MB
Learning to Train a Binary Neural Network 9
Table 3: Classification accuracy (Top-1 and Top-5) of several popular deeplearning architectures using binary weights and activations in their convolutionand fully connected layers. Full-precision results are denoted with FP. ResNet-34-thin applies a lower number of filters (64, 64, 128, 256, 512), whereas ResNet-34-wide and ResNet-68-wide use a higher number of filters (64, 128, 256, 512, 1024).
Architecture Top-1 Top-5 EpochModel Size
(Binary/Full Precision)Top-1
FPTop-5
FP
AlexNet 30.2% 54.0% 70 22MB/233MB 62.5% 83.0%
InceptionBN 24.8% 48.0% 80 8MB/44MB - 92.1%
ResNet-18 42.0% 66.2% 37 3.4MB/45MB - -
ResNet-18 (from [11]) 42.7% 67.6% - - - -
ResNet-26 bottleneck 25.2% 47.1% 40 - - -
ResNet-34-thin 44.3% 69.1% 40 4.8MB/84MB 78.2% 94.3%
ResNet-34-wide 54.0% 77.2% 37 15MB/329MB - -
ResNet-68-wide 57.5% 80.3% 40 25MB/635MB - -
computational costs, which may negatively impact information flow. With thisidea, we continued the experiments with several ResNet models, and the resultsseem to verify our conjecture. If the ResNet architecture is used for full-precisionnetworks, gradually increasing the width and depth of the network yields im-provements in accuracy. On the contrary, when using binary neural networks,the bottleneck design seems to limit the performance as is expected. We werenot able to obtain higher accuracy with the ResNet-26 bottleneck architecturecompared to ResNet-18. Additionally, if we only increase the depth, without in-creasing the number of filters, we were not able to obtain a significant increase inaccuracy (ResNet-34-thin compared to ResNet-18 ). To test our theory, that thebottleneck design hinders information flow, we enlarged the number of filtersthroughout the network from (64, 64, 128, 256, 512) to (64, 128, 256, 512, 1024).This achieves almost 10% top-1 accuracy gain in a ResNet architecture with34 layers (ResNet-34-wide). Further improvements can be obtained by usingResNet-68-wide with both increased depth and width. This suggests, that net-work width and depth should be increased simultaneously for best results.
We also conducted experiments on further architectures such as VGG-Net [22],Inception-resnet [23] and MobileNet [9]. Although we applied batch normaliza-tion, the VGG-style networks with more than 10 layers have to be trained ac-cumulatively (layer by layer), since the models did not achieve any result whenwe trained them from scratch. Other networks such as Inception-ResNet andMobileNet are also not appropriate for the binary training due to their designedarchitecture (bottleneck design and models with a low number of filters). Weassume that the shortcut connections of the ResNet architecture can retain theinformation flow unobstructed during the training. This is why we could directlytrain a binary ResNet model from scratch without additional support. Accord-ing to the confidence of the results obtained in our experiment, we achieved the
10 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
Table 4: Classification accuracy comparison by using binary DenseNet andResNet models for the ImageNet dataset. The amount of parameters are kepton a similar level for both architectures to verify that improvements are basedon the increased number of connections and not an increase of parameters.
Architecture Top-1 Top-5 Epoch Model Size (FP) Number of Parameters
DenseNet-21 50.0% 73.3% 49 44MB 11 498 086
ResNet-18 42.0% 66.2% 37 45MB 11 691 950
DenseNet-45 57.9% 80.0% 52 250MB 62 611 886
ResNet-34-wide 54.0% 77.2% 37 329MB 86 049 198
ResNet-68-wide 57.5% 80.3% 35 635MB 166 283 182
same level in terms of classification accuracy comparing to the latest result fromABC-Net [11] (ResNet-18 result with weight base 1 and activation base 1).
As we learned from the previous experiments, we consider the shortcut con-nections as a useful compensation for the reduced information flow. But thisraised the following question: could we improve the model performance furtherby simply increasing the number of shortcut connections? To answer this ques-tion, we conducted further experiments based on the DenseNet [6] architecture.
Shortcut Connections Driven Accuracy Gain In our first experiment wecreated binary models using both DenseNet and ResNet architectures with sim-ilar complexities. We keep the amount of parameters on a roughly equal level toverify that the improvements obtained by using the DenseNet architecture arecoming from the increased number of connections and not a general increase ofparameters. Our evaluation results show that these dense connections can sig-nificantly compensate for the information loss from binarization (see Table 4).The gained improvement by using DenseNet-21 compared to ResNet-18 is upto 8%2, whereas the number of utilized parameters is even lower. Furthermore,when we compare binary ResNet-68-wide to DenseNet-45, the latter has lessthan half the number of parameters compared to the former, but can achieve avery similar result in terms of classification accuracy.
In our second set of experiments, we wanted to confirm our hypothesis, thatincreasing the number of blocks is more efficient than just increasing block size onthe example of a DenseNet architecture. We distinguish the four architecturesthrough the two main parameters relevant for this experiment: growth rate kper block, number of blocks per unit b, and total number of layers n, wheren = 8·b+5. The four architectures we are comparing are: DenseNet-13 (k = 256,b = 1), DenseNet-21 (k = 128, b = 2), DenseNet-37 (k = 64, b = 4), andDenseNet-69 (k = 32, b = 8).
2 We note, that this is significantly more, than the improvement between two full-precision models with a similar number of parameters (DenseNet-264 and ResNet-50 ), which is less than 2% (22.15% and 23.9% top 1 error rate, reported by [6]).
Learning to Train a Binary Neural Network 11
16.3%
35.4%
49.2%
72.8%
Model: 8.0 MB
Weights: 13.2 M
16.5%
35.9%
48.5%
72.2%
Model: 6.5 MB
Weights: 9.7 M
16.5%
35.6%47.4%
71.2%
Model: 5.8 MB
Weights: 8.2 M
16.8%
36.6%48.3%
71.7%
Model: 5.5 MB
Weights: 7.4 M
DenseNet−13 (k=256, b=1) DenseNet−21 (k=128, b=2) DenseNet−37 (k=64, b=4) DenseNet−69 (k=32, b=8)
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
20%
40%
60%
Epoch
Acc
urac
y
Top−1 Accuracy Top−5 Accuracy
Fig. 3: Model performance of binary DenseNet models with different growthrates k and number of blocks b. Increasing b, while decreasing k leads to smallermodels, without a significant decrease in accuracy, since the reduction of weightsis compensated by increasing the number of connections
Despite a 31% reduction in model size between DenseNet-13 and DenseNet-69, the accuracy loss is only 1% (see Fig. 3). We further conclude, that thissimilarity is not randomness, since all architectures perform very similarly overthe whole training process. We note again, that for all models the first convolu-tional layer and the final fully-connected layer use full-precision weights. Further,we set the size of the former layer depending on the growth rate k, with a num-ber of filters equal to 2 ·k. Therefore, a large portion of the model size reductioncomes from reducing the size of first convolutional layer, which subsequently alsoreduces the size of the final fully connected layer.
However, a larger fully-connected layer could simply add additional duplicateor similar features, without affecting performance. This would mean, that thereduction of model size in our experiments comes from a different independentvariable. To elimnate this possibility, we ran a post-hoc analysis to check whetherwe can reduce the size of the first layer without impacting performance. We usedDenseNet-13 with a reduced first layer, which has the same size as for DenseNet-69 (which uses k = 32), so 2 ·k = 64 filters. Even though the performance of themodel is similar for the first few epochs, the accuracy does not reach comparablelevels: after 31 Epochs, its Top-1 accuracy is only 47.1% (2.1% lower) and itsTop-5 accuracy is only 70.7% (2.1% lower). In addition to degrading the accuracymore than as if increasing connections, it only reduces the model size by 6% (0.4MB), since the transition layers are unchanged. This confirms our hypothesis,that we can eliminate the usual reduction in accuracy of a binary neural networkwhen reducing the number of weights by increasing the number of connections.
In summary, we have learned two important findings from the previous ex-periments for training an accurate binary network:
– Increasing information flow through the network improves classification ac-curacy of a binary neural network.
– We found two ways to realize this: Increase the network width appropriatelywhile increasing depth or increasing the number of shortcut connections.
12 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
0.0
0.2
0.4
0.6
0 10 20 30 40
Epoch
Top
5 A
ccur
acy
0.1
0.25
0.5
0.75
1
2
(a)
0.0
0.2
0.4
0.6
0 10 20 30 40
Epoch
Top
5 A
ccur
acy
ResNet−18 (N)
ResNet−18 (FB)
ResNet−18 (B)
(b)
Fig. 4: (a) Classification accuracy by varying gradient clipping threshold. Theapplied validation model is trained on ImageNet with ResNet-18 topology. (b)Accuracy evaluation by using scaling factor on network weights in three differ-ent modes: (N ) no scaling, (B) use scaling factor on weights only in backwardcomputation, and (FB) apply weight scaling in both forward and backward pass.
Specific Hyperparameter Evaluation In this section we evaluated two spe-cific hyperparameters for training a binary neural network: the gradient clippingthreshold and usage of a scaling factor.
Using a gradient clipping threshold tclip was originally proposed byHubara et al. [1], and reused in more recent work [3,11] (see Section 3.2). Inshort, when using STE we only let the gradients pass through if the input risatisfies |ri| ≤ tclip. Setting tclip = 1 is presented in the literature with only cur-sory explanation. Thus, we evaluated it by exploring a proper value range (seeFig. 4a). We used classification accuracy as the evaluation metric and selectedthresholds from the value range of [0.1, 2.0] empirically. The validation modelis trained on the ImageNet dataset with the ResNet-18 network architecture.From the results we can recognize that tclip = 1 is suboptimal, the optimum isbetween 0.5 and 0.75. We thus applied tclip = 0.5 to the all other experimentsin this paper.
Scaling factors have been proposed by Rastegari et al. [3]. In their work, thescaling factor is the mean of absolute values of each output channel of weights.Subsequently, Zhou et al. [2] proposed a scaling factor, which is intended to scaleall filters instead of performing channel-wise scaling. The intuition behind bothmethods is to increase the value range of weights with the intention of solvingthe information loss problem during training of a binary network. We conductedan evaluation of accuracy according to three running modes according to theimplementation of Zhou et al. [2]: (N) no scaling, (B) use the scaling factoron weights only in backward computation, (FB) apply weight scaling in bothforward and backward pass. The result indicates that no accuracy gain can beobtained by using a scaling factor on the ResNet-18 network architecture (seeFig. 4b). Therefore we did not apply a scaling factor in our other experiments.
Learning to Train a Binary Neural Network 13
(a) DenseNet (FP) (b) DenseNet-21
(c) ResNet-18 (d) ResNet-68
Fig. 5: The deep dream [16] visualization of binary models with different com-plexity and size (best viewed digitally with zoom). The DenseNet full precisionmodel (a) is the only one, which produces visualizations of animal faces and ob-jects. Additional models and samples can be seen in the supplementary material
4.2 Visualization Results
To better understand the differences between binary and full-precision networks,and the various binary architectures, we created several visualizations. The re-sults show, that the full-precision version of the DenseNet captures overall con-cepts, since rough objects, such as animal faces, can be recognized in the Deep-Dream visualization (see Fig. 5a). The binary networks perform much worse.Especially the ResNet architecture (see Fig. 5c) with 18 layers seems to learnmuch more noisy and less coherent shapes. Further, we can see small and largeareas of gray, which hints at the missing information flow in certain parts of thenetwork. This most likely comes from the loss of information through binariza-tion which stops neurons from activating. This issue is less visible for a largerarchitecture, but even there, small areas of gray appear (see Fig. 5d). Howeverthe DenseNet architecture (see Fig. 5b) with 21 layers, which has a comparablenumber of parameters, produces more object-like pictures with less noise. Theareas without any activations seem to not exist, indicating that the informationcan be passed through the network more efficiently in a binary neural network.
The visualization with VisualBackprop shows a similar difference in qualityof the learned features (see Fig. 6). It reflects the parts of the image, whichcontributed to the final prediction of the model. The visualization of a full-precision ResNet-18 clearly highlights the remarkable features of the classes tobe detected (e.g. the outline of lighthouse, or the head of a dog). In contrast, thevisualization of a binary ResNet-18 only highlights small relevant parts of theimage, and considers other less relevant elements in the image (e.g. a horizon
14 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
Fig. 6: Two samples of the ImageNet dataset visualized with VisualBackProp ofbinary neural network architectures (from top to bottom): full-precision ResNet-18, binary ResNet-18, binary DenseNet-21. Each depiction shows (from left toright): original image, activation map, composite of both (best viewed digitallywith zoom). Additional samples can be seen in the supplementary material
behind a lighthouse). The binary DenseNet-21 model also achieves less claritythan the full-precision model, but highlights more of the relevant features (e.g.parts of the outline of a dog).
5 Conclusion
In this paper, we presented our insights on training binary neural networks. Ouraim is to fill the information gap between theoretically designing binary neu-ral networks, by communicating our insights in this work and providing accessto our code and models, which can be used on mobile and embedded devices.We evaluated hyperparameters, network architectures and different methods oftraining a binary neural network. Our results indicate, that increasing the num-ber of connections between layers of a binary neural network can improve itsaccuracy in a more efficient way than simply adding more weights.
Based on these results, we would like to explore more methods of increas-ing the number of connections in binary neural networks in our future work.Additionally similar ideas for quantized networks can be explored, for example,how networks with multiple binary bases work in comparison to quantized lowbit-width networks. The information density should be equal in theory, but arethere differences in practice, when training these networks?
Learning to Train a Binary Neural Network 15
References
1. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarizedneural networks. In: Advances in neural information processing systems. (2016)4107–4115
2. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: Training LowBitwidth Convolutional Neural Networks with Low Bitwidth Gradients. 1 (2016)1–14
3. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classi-fication using binary convolutional neural networks. In: European Conference onComputer Vision, Springer (2016) 525–542
4. Yang, H., Fritzsche, M., Bartz, C., Meinel, C.: Bmxnet: An open-source binaryneural network implementation based on mxnet. In: Proceedings of the 2017 ACMon Multimedia Conference, ACM (2017) 1209–1212
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition.(2016) 770–778
6. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connectedconvolutional networks. In: Proceedings of the IEEE conference on computer visionand pattern recognition. Volume 1. (2017) 3
7. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB modelsize. (2016) 1–13
8. Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing Deep NeuralNetworks with Pruning, Trained Quantization and Huffman Coding. (2015) 1–14
9. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks forMobile Vision Applications. (2017)
10. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An Extremely Efficient Convo-lutional Neural Network for Mobile Devices. (2017) 1–10
11. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network.In: Advances in Neural Information Processing Systems. (2017) 344–352
12. Hinton, G.: Neural Networks for Machine Learning, Coursera. URL:http://coursera.org/course/neuralnets (last accessed 2018-03-13) (2012)
13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in neural information processing systems.(2012) 1097–1105
14. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,Vanhoucke, V., Rabinovich, A., Others: Going deeper with convolutions, Cvpr(2015)
15. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep-tion Architecture for Computer Vision. (2015)
16. Mordvintsev, A., Olah, C., Tyka, M.: Inceptionism: Going Deeper into Neu-ral Networks. URL: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html (last accessed 2018-03-13) (2015)
17. Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller,U., Zieba, K.: VisualBackProp: efficient visualization of CNNs. (2016)
18. LeCun, Y., Cortes, C.: MNIST handwritten digit database. (2010)19. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (canadian institute for advanced
research) (2014)
16 Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
20. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09. (2009)
21. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. In: International conference on machine learning.(2015) 448–456
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. In: ICLR. (2015)
23. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnetand the impact of residual connections on learning. In: AAAI. Volume 4. (2017)12
Back to Simplicity: How to Train Accurate BNNs from Scratch?
Joseph Bethge∗, Haojin Yang∗, Marvin Bornstein, Christoph MeinelHasso Plattner Institute, University of Potsdam, Germany
{joseph.bethge,haojin.yang,meinel}@hpi.de, {marvin.bornstein}@student.hpi.de
Abstract
Binary Neural Networks (BNNs) show promising progressin reducing computational and memory costs but suffer fromsubstantial accuracy degradation compared to their real-valued counterparts on large-scale datasets, e.g., ImageNet.Previous work mainly focused on reducing quantization er-rors of weights and activations, whereby a series of approxi-mation methods and sophisticated training tricks have beenproposed. In this work, we make several observations thatchallenge conventional wisdom. We revisit some commonlyused techniques, such as scaling factors and custom gradi-ents, and show that these methods are not crucial in trainingwell-performing BNNs. On the contrary, we suggest severaldesign principles for BNNs based on the insights learnedand demonstrate that highly accurate BNNs can be trainedfrom scratch with a simple training strategy. We proposea new BNN architecture BinaryDenseNet, which signifi-cantly surpasses all existing 1-bit CNNs on ImageNet with-out tricks. In our experiments, BinaryDenseNet achieves18.6% and 7.6% relative improvement over the well-knownXNOR-Network and the current state-of-the-art Bi-RealNet in terms of top-1 accuracy on ImageNet, respectively.https://github.com/hpi-xnor/BMXNet-v2
1. Introduction
Convolutional Neural Networks have achieved state-of-the-art on a variety of tasks related to computer vision, for ex-ample, classification [17], detection [7], and text recogni-tion [15]. By reducing memory footprint and acceleratinginference, there are two main approaches which allow forthe execution of neural networks on devices with low com-putational power, e.g. mobile or embedded devices: Onthe one hand, information in a CNN can be compressedthrough compact network design. Such methods use full-precision floating point numbers as weights, but reduce thetotal number of parameters and operations through clevernetwork design, while minimizing loss of accuracy, e.g.,
∗Authors contributed equally
SqueezeNet [13], MobileNets [10], and ShuffleNet [30]. Onthe other hand, information can be compressed by avoidingthe common usage of full-precision floating point weightsand activations, which use 32 bits of storage. Instead,quantized floating-point numbers with lower precision (e.g.4 bit of storage) [31] or even binary (1 bit of storage)weights and activations [12, 19, 22, 23] are used in theseapproaches. A BNN achieves up to 32× memory savingand 58× speedup on CPUs by representing both weightsand activations with binary values [23]. Furthermore, com-putationally efficient, bitwise operations such as xnor andbitcount can be applied for convolution computation in-stead of arithmetical operations. Despite the essential ad-vantages in efficiency and memory saving, BNNs still suf-fer from the noticeable accuracy degradation that preventstheir practical usage. To improve the accuracy of BNNs,previous approaches mainly focused on reducing quanti-zation errors by using complicated approximation meth-ods and training tricks, such as scaling factors [23], multi-ple weight/activation bases [19], fine-tuning a full-precisionmodel, multi-stage pre-training, or custom gradients [22].These work applied well-known real-valued network archi-tectures such as AlexNet, GoogLeNet or ResNet to BNNswithout thorough explanation or experiments on the designchoices. However, they don’t answer the simple yet essen-tial question: Are those real-valued network architecturesseamlessly suitable for BNNs? Therefore, appropriate net-work structures for BNNs should be adequately explored.
In this work, we first revisit some commonly used tech-niques in BNNs. Surprisingly, our observations do notmatch conventional wisdom. We found that most of thesetechniques are not necessary to reach state-of-the-art per-formance. On the contrary, we show that highly accurateBNNs can be trained from scratch by “simply” maintainingrich information flow within the network. We present howincreasing the number of shortcut connections improves theaccuracy of BNNs significantly and demonstrate this by de-signing a new BNN architecture BinaryDenseNet. Withoutbells and whistles, BinaryDenseNet reaches state-of-the-artby using standard training strategy which is much more ef-ficient than previous approaches.
1
arX
iv:1
906.
0863
7v1
[cs
.LG
] 1
9 Ju
n 20
19
Table 1: A general comparison of the most related methods to this work. Essential characteristics such as value space ofinputs and weights, numbers of multiply-accumulate operations (MACs), numbers of binary operations, theoretical speeduprate and operation types, are depicted. The results are based on a single quantized convolution layer from each work. β and αdenote the full-precision scaling factor used in proper methods, whilst m, n, k denote the dimension of weight (W ∈ Rn×k)and input (I ∈ Rk×m). The table is adapted from [28].
Methods Inputs Weights MACs Binary Operations Speedup OperationsFull-precision R R n×m×k 0 1× mul,add
BC [3] R {−1, 1} n×m×k 0 ∼ 2× sign,addBWN [23] R {−α, α} n×m×k 0 ∼ 2× sign,addTTQ [33] R {−αn, 0, αp} n×m×k 0 ∼ 2× sign,add
DoReFa [31] {0, 1}×4 {0, α} n×k 8×n×m×k ∼ 15× and,bitcountHORQ [18] {−β, β}×2 {−α, α} 4×n×m 4×n×m×k ∼ 29× xor,bitcountTBN [28] {−1, 0, 1} {−α, α} n×m 3×n×m×k ∼ 40× and,xor,bitcount
XNOR [23] {−β, β} {−α, α} 2×n×m 2×n×m×k ∼ 58× xor,bitcountBNN [12] {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcount
Bi-Real [22] {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcountOurs {−1, 1} {−1, 1} 0 2×n×m×k ∼ 64× xor,bitcount
Summarized, our contributions in this paper are:
• We show that highly accurate binary models can betrained by using standard training strategy, which chal-lenges conventional wisdom. We analyze why apply-ing common techniques (as e.g., scaling methods, cus-tom gradient, and fine-tuning a full-precision model)is ineffective when training from scratch and provideempirical proof.
• We suggest several general design principles for BNNsand further propose a new BNN architecture Binary-DenseNet, which significantly surpasses all existing 1-bit CNNs for image classification without tricks.
• To guarantee the reproducibility, we contribute toan open source framework for BNN/quantized NN.We share codes, models implemented in this paperfor classification and object detection. Additionally,we implemented the most influential BNNs including[12, 19, 22, 23, 31] to facilitate follow-up studies.
The rest of the paper is organized as follows: We de-scribe related work in Section 2. We revisit common tech-niques used in BNNs in Section 3. Section 4 and 5 presentour approach and the main result.
2. Related workIn this section, we roughly divide the recent efforts for bi-narization and compression into three categories: (i) com-pact network design, (ii) networks with quantized weights,(iii) and networks with quantized weights and activations.Compact Network Design. This sort of methods use full-precision floating point numbers as weights, but reduce thetotal number of parameters and operations through com-pact network design, while minimizing loss of accuracy.
The commonly used techniques include replacing a largeportion of 3×3 filters with smaller 1×1 filters [13]; Usingdepth-wise separable convolution to reduce operations [10];Utilizing channel shuffling to achieve group convolutions inaddition to depth-wise convolution [30]. These approachesstill require GPU hardware for efficient training and infer-ence. A strategy to accelerate the computation of all thesemethods for CPUs has yet to be developed.Quantized Weights and Real-valued Activations. Recentefforts from this category, for instance, include BinaryCon-nect (BC) [3], Binary Weight Network (BWN) [23], andTrained Ternary Quantization (TTQ) [33]. In these work,network weights are quantized to lower precision or evenbinary. Thus, considerable memory saving with relativelylittle accuracy loss has been achieved. But, no noteworthyacceleration can be obtained due to the real-valued inputs.Quantized Weights and Activations. On the contrary, ap-proaches adopting quantized weights and activations canachieve both compression and acceleration. Remarkableattempts include DoReFa-Net [31], High-Order ResidualQuantization (HORQ) [18] and SYQ [6], which reportedpromising results on ImageNet [4] with 1-bit weights andmulti-bits activations.Binary Weights and Activations. BNN is the extremecase of quantization, where both weights and activationsare binary. Hubara et al. proposed Binarized Neural Net-work (BNN) [12], where weights and activations are re-stricted to +1 and -1. They provide efficient calculationmethods for the equivalent of matrix multiplication by us-ing xnor and bitcount operations. XNOR-Net [23] im-proved the performance of BNNs by introducing a channel-wise scaling factor to reduce the approximation error offull-precision parameters. ABC-Nets [19] used multiple
Table 2: The influence of using scaling, a full-precisiondownsampling convolution, and the approxsign function onthe CIFAR-10 dataset based on a binary ResNetE18. Us-ing approxsign instead of sign slightly boosts accuracy, butonly if training a model with scaling factors.
Usescalingof [23]
Downsampl.convolution
Useapproxsignof [22]
AccuracyTop1/Top5
nobinary yes 84.9%/99.3%
no 87.2%/99.5%
full-precision yes 86.1%/99.4%no 87.6%/99.5%
yesbinary yes 84.2%/99.2%
no 83.6%/99.2%
full-precision yes 84.4%/99.3%no 84.7%/99.2%
weight bases and activation bases to approximate their full-precision counterparts. Despite the promising accuracy im-provement, the significant growth of weight and activationcopies offsets the memory saving and speedup of BNNs.Wang et al. [28] attempted to use binary weights andternary activations in their Ternary-Binary Network (TBN).They achieved a certain degree of accuracy improvementwith more operations compared to fully binary models. InBi-Real Net, Liu et al. [22] proposed several modificationson ResNet. They achieved state-of-the-art accuracy by ap-plying an extremely sophisticated training strategy that con-sists of full-precision pre-training, multi-step initialization(ReLU→leaky clip→clip [21]), and custom gradients.
Table 1 gives a thorough overview of the recent effortsin this research domain. We can see that our work followsthe most straightforward binarization strategy as BNN [12],that achieves the highest theoretical speedup rate and thehighest compression ratio. Furthermore, we directly train abinary network from scratch by adopting a simple yet effec-tive strategy.
3. Study on Common Techniques
In this section, to ease the understanding, we first provide abrief overview of the major implementation principles of abinary layer (see supplementary materials for more details).We then revisit three commonly used techniques in BNNs:scaling factors [23, 31, 28, 27, 18, 33, 19], full-precisionpre-training [31, 22], and approxsign function [22]. Wedidn’t observe accuracy gain as expected. We analyze whythese techniques are not as effective as previously presentedwhen training from scratch and provide empirical proof.The finding from this study motivates us to explore moreeffective solutions for training accurate BNNs.
0.1
-0.7
0.5
0.3
0.5 -0.5 -0.1
-0.1 0.5 0.5
-0.4 -0.7 0.3
0.3 -0.1 -0.7
Input Weight
*
0.01 -0.78 -0.42Result
1
-1
1
1
1 -1 -1
-1 1 1
-1 -1 1
1 -1 -1
Input Weight
*
2 -4 -2Result 0.26 -0.72 -0.32Result
0.33 0.45 0.4Weight Scaling
0.4Input Scaling
1.01 -0.95 -0.06Normalized 1.25 -1.0 -0.25 1.2 -1.06 -0.14
Error 0.24 0.05 0.19 Error 0.19 0.11 0.08
Normalized Normalized
Error 1.99 3.22 1.58 Error 0.25 0.06 0.1>
≈
Full-precision Binary Binary with scaling
Figure 1: An exemplary implementation shows that nor-malization minimizes the difference between a binary con-volution with scaling (right column) and one without (mid-dle column). In the top row, the columns from left to rightrespectively demonstrate the gemm results of full-precision,binary, and binary with scaling. The bottom row shows theirresults after normalization. Errors are the absolute differ-ence between full-precision and binary results. The resultsindicate that normalization dilutes the effect of scaling.
3.1. Implementation of Binary Layers
We apply the sign function for binary activation, thus trans-forming floating-point values into binary values:
sign(x) =
{+1 if x ≥ 0,
−1 otherwise.(1)
The implementation uses a Straight-Through Estimator(STE) [1] with the addition, that it cancels the gradients,when the inputs get too large, as proposed by Hubara et al.[12]. Let c denote the objective function, ri be a real num-ber input, and ro ∈ {−1,+1} a binary output. Furthermore,tclip is the threshold for clipping gradients, which was setto tclip = 1 in previous works [31, 12]. Then, the resultingSTE is:
Forward: ro = sign(ri) . (2)
Backward:∂c
∂ri=
∂c
∂ro1|ri|≤tclip . (3)
3.2. Scaling Methods
Binarization will always introduce an approximation errorcompared to a full-precision signal. In their analysis, Zhouet al. [32] show that this error linearly degrades the accuracyof a CNN.
Consequently, Rastegari et al. [23] propose to scale theoutput of the binary convolution by the average absoluteweight value per channel (α) and average absolut activationover all input channels (K).
x ∗w ≈ binconv(sign(x), sign(w)) ·K · α (4)
Table 3: The influence of using scaling, a full-precisiondownsampling convolution, and the approxsign function onthe ImageNet dataset based on a binary ResNetE18.
Usescalingof [23]
Downsampl.convolution
Useapproxsignof [22]
AccuracyTop1/Top5
nobinary yes 54.3%/77.6%
no 54.5%/77.8%
full-precision yes 56.6%/79.3%no 58.1%/80.6%
yesbinary yes 53.3%/76.4%
no 52.7%/76.1%
full-precision yes 55.3%/78.3%no 55.6%/78.4%
The scaling factors should help binary convolutions toincrease the value range. Producing results closer to thoseof full-precision convolutions and reducing the approxima-tion error. However, these different scaling values influencespecific output channels of the convolution. Therefore, aBatchNorm [14] layer directly after the convolution (whichis used in all modern architectures) theoretically minimizesthe difference between a binary convolution with scalingand one without. Thus, we hypothesize that learning a use-ful scaling factor is made inherently difficult by BatchNormlayers. Figure 1 demonstrates an exemplary implementationof our hypothesis.
We empirically evaluated the influence of scaling factors(as proposed by Rastegari et al. [23]) on the accuracy ofour trained models based on the binary ResNetE architec-ture (see Section 4.2). First, the results of our CIFAR-10[17] experiments verify our hypothesis, that applying scal-ing when training a model from scratch does not lead tobetter accuracy (see Table 2). All models show a decreasein accuracy between 0.7% and 3.6% when applying scal-ing factors. Secondly, we evaluated the influence of scalingfor the ImageNet dataset (see Table 3). The result is sim-ilar, scaling reduces model accuracy ranging from 1.0% to1.7%. We conclude that the BatchNorm layers followingeach convolution layer absorb the effect of the scaling fac-tors. To avoid the additional computational and memorycosts, we don’t use scaling factors in the rest of the paper.
3.3. Full-Precision Pre-Training
Fine-tuning a full-precision model to a binary one is ben-eficial only if it yields better results in comparable, totaltraining time. We trained our binary ResNetE18 in threedifferent ways: fully from scratch (1), by fine-tuning a full-precision ResNetE18 with ReLU (2) and clip (proposed by[22]) (3) as activation function (see Figure 2). The full-precision trainings followed the typical configuration of
0 5 10 15 20 25 30 35 40time (epoch)
10
20
30
40
50
60
top-
1 ac
cura
cy (i
n %
)
57.0 56.3 55.1
fp relu 1b signfp clip 1b sign1b sign
Figure 2: Top-1 validation accuracy per epoch of trainingbinary ResNetE18 from scratch (red, 40 epochs, Adam),from a full-precision pre-training (20 epochs, SGD) withclip (green) and ReLU (blue) as activation function. Thedegradation peak of the green and blue curve at epoch 20depicts a heavy “re-learning” effect when we start fine-tuning a full-precision model to a binary one.
momentum SGD with weight decay over 20 epochs withlearning rate decay of 0.1 after 10 and 15 epochs. Forall binary trainings, we used Adam [16] without weightdecay with learning rate updates at epoch 10 and 15 forthe fine-tuning and 30 and 38 for the full binary training.Our experiment shows that clip performs worse than ReLUfor fine-tuning and in general. Additionally, the trainingfrom scratch yields a slightly better result than with pre-training. Pre-training inherently adds complexity to thetraining procedure, because the different architecture of bi-nary networks does not allow to use published ReLU mod-els. Thus, we advocate the avoidance of fine-tuning full-precision models. Note that our observations are based onthe involved architectures in this work. A more comprehen-sive evaluation of other networks remains as future work.
3.4. Backward Pass of the Sign Function
Liu et al. [22] claim that a differentiable approximationfunction, called approxsign, can be made by replacing thebackward pass with
∂c
∂ri=
∂c
∂ro1|ri|≤tclip ·
{2− 2ri if ri ≥ 0,
2 + 2ri otherwise.(5)
Since this could also benefit when training a binary networkfrom scratch, we evaluated this in our experiments. Wecompared the regular backward pass sign with approxsign.First, the results of our CIFAR-10 experiments seem to de-pend on whether we use scaling or not. If we use scal-ing, both functions perform similarly (see Table 2). With-out scaling the approxsign function leads to less accuratemodels on CIFAR-10. In our experiments on ImageNet,the performance difference between the use of the func-tions is minimal (see Table 3). We conclude that applyingapproxsign instead of sign function seems to be specific to
Table 4: Comparison of our binary ResNetE18 model to state-of-the-art binary models using ResNet18 on the ImageNetdataset. The top-1 and top-5 validation accuracy are reported. For the sake of fairness we use the ABC-Net result with 1weight base and 1 activation base in this table.
Downsampl.convolution Size Our result Bi-Real [22] TBN [28] HORQ [18] XNOR [23] ABC-Net (1/1) [19]
full-precision 4.0 MB 58.1%/80.6% 56.4%/79.5% 55.6%/74.2% 55.9%/78.9% 51.2%/73.2% n/abinary 3.4 MB 54.5%/77.8% n/a n/a n/a n/a 42.7%/67.6%
fine-tuning from full-precision models [22]. We thus don’tuse approxsign in the rest of the paper for simplicity.
4. Proposed ApproachIn this section, we present several essential design princi-ples for training accurate BNNs from scratch. We then prac-ticed our design philosophy on the binary ResNetE model,where we believe that the shortcut connections are essen-tial for an accurate BNN. Based on the insights learned wepropose a new BNN model BinaryDenseNet which reachesstate-of-the-art accuracy without tricks.
4.1. Golden Rules for Training Accurate BNNs
As shown in Table 4, with a standard training strategy ourbinary ResNetE18 model outperforms other state-of-the-artbinary models by using the same network structure. Wesuccessfully train our model from scratch by following sev-eral general design principles for BNNs, summarized as fol-lows:• The core of our theory is maintaining rich information
flow of the network, which can effectively compensatethe precision loss caused by quantization.
• Not all the well-known real-valued network architec-tures can be seamlessly applied for BNNs. The net-work architectures from the category compact networkdesign are not well suited for BNNs, since their de-sign philosophies are mutually exclusive (eliminatingredundancy↔ compensating information loss).
• Bottleneck design [26] should be eliminated in yourBNNs. We will discuss this in detail in the followingparagraphs (also confirmed by [2]).
• Seriously consider using full-precision downsamplinglayer in your BNNs to preserve the information flow.
• Using shortcut connections is a straightforward way toavoid bottlenecks of information flow, which is partic-ularly essential for BNNs.
• To overcome bottlenecks of information flow, weshould appropriately increase the network width (thedimension of feature maps) while going deeper (ase.g., see BinaryDenseNet37/37-dilated/45 in Table 7).However, this may introduce additional computationalcosts.
• The previously proposed complex training strategies,as e.g. scaling factors, approxsign function, FP pre-training are not necessary to reach state-of-the-art per-formance when training a binary model directly fromscratch.
Before thinking about model architectures, we must con-sider the main drawbacks of BNNs. First of all, the infor-mation density is theoretically 32 times lower, comparedto full-precision networks. Research suggests, that the dif-ference between 32 bits and 8 bits seems to be minimaland 8-bit networks can achieve almost identical accuracy asfull-precision networks [8]. However, when decreasing bit-width to four or even one bit (binary), the accuracy dropssignificantly [12, 31]. Therefore, the precision loss needsto be alleviated through other techniques, for example byincreasing information flow through the network. We fur-ther describe three main methods in detail, which help topreserve information despite binarization of the model:
First, a binary model should use as many shortcut con-nections as possible in the network. These connections al-low layers later in the network to access information gainedin earlier layers despite of precision loss through binariza-tion. Furthermore, this means that increasing the numberof connections between layers should lead to better modelperformance, especially for binary networks.
Secondly, network architectures including bottlenecksare always a challenge to adopt. The bottleneck design re-duces the number of filters and values significantly betweenthe layers, resulting in less information flow through BNNs.Therefore we hypothesize that either we need to eliminatethe bottlenecks or at least increase the number of filters inthese bottleneck parts for BNNs to achieve best results.
The third way to preserve information comes from re-placing certain crucial layers in a binary network with fullprecision layers. The reasoning is as follows: If layersthat do not have a shortcut connection are binarized, theinformation lost (due to binarization) can not be recov-ered in subsequent layers of the network. This affects thefirst (convolutional) layer and the last layer (a fully con-nected layer which has a number of output neurons equalto the number of classes), as learned from previous work[23, 31, 22, 28, 12]. These layers generate the initial infor-mation for the network or consume the final information for
1⨉1
3⨉3
1⨉1
+
(a) ResNet(bottleneck)
3⨉3
3⨉3
+
(b) ResNet(no bottleneck)
3⨉3
3⨉3
+
+
(c) ResNetE(added shortcut)
1⨉1
3⨉3
(d) DenseNet(bottleneck)
3⨉3
(e) DenseNet(no bottleneck)
3⨉3
3⨉3
(f) BinaryDenseNet
Figure 3: A single building block of different network ar-chitectures (the length of bold black lines represents thenumber of filters). (a) The original ResNet design features abottleneck architecture. A low number of filters reduces in-formation capacity for BNNs. (b) A variation of the ResNetwithout the bottleneck design. The number of filters is in-creased, but with only two convolutions instead of three.(c) The ResNet architecture with an additional shortcut, firstintroduced in [22]. (d) The original DenseNet design witha bottleneck in the second convolution operation. (e) TheDenseNet design without a bottleneck. The two convolu-tion operations are replaced by one 3 × 3 convolution. (f)Our suggested change to a DenseNet where a convolutionwith N filters is replaced by two layers with N
2 filters each.
the prediction, respectively. Therefore, full-precision layersfor the first and the final layer are always applied previously.Another crucial part of deep networks is the downsamplingconvolution which converts all previously collected infor-mation of the network to smaller feature maps with morechannels (this convolution often has stride two and outputchannels equal to twice the number of input channels). Anyinformation lost in this downsampling process is effectivelyno longer available. Therefore, it should always be consid-ered whether these downsampling layers should be in full-precision, even though it slightly increases model size andnumber of operations.
4.2. ResNetE
ResNet combines the information of all previous layers withshortcut connections. This is done by adding the input ofa block to its output with an identity connection. As sug-gested in the previous section, we remove the bottleneckof a ResNet block by replacing the three convolution layers(kernel sizes 1, 3, 1) of a regular ResNet block with two3 × 3 convolution layers with a higher number of filters(see Figure 3a, b). We subsequently increase the number
3⨉3, Δ(2⨉2)
+
1⨉1, Δ(1⨉1)
2⨉2 AvgPool
(a) ResNet
2⨉2 AvgPool
1⨉1 Conv
(b) DenseNet
2⨉2 MaxPool
ReLU1⨉1 Conv
(c)BinaryDenseNet
Figure 4: The downsampling layers of ResNet, DenseNetand BinaryDenseNet. The bold black lines mark the down-sampling layers which can be replaced with FP layers. If weuse FP downsampling in a BinaryDenseNet, we increase thereduction rate to reduce the number of channels (the dashedlines depict the number of channels without reduction). Wealso swap the position of pooling and Conv layer that effec-tively reduces the number of MACs.
of connections by reducing the block size from two convo-lutions per block to one convolution per block, as inspiredby [22]. This leads to twice the amount of shortcuts, as thereare as many shortcuts as blocks, if the amount of layers iskept the same (see Figure 3c). However, [22] also incorpo-rates other changes to the ResNet architecture. Therefore wecall this specific change in the block design ResNetE (shortfor Extra shortcut). The second change is using the full-precision downsampling convolution layer (see Figure 4a).In the following we conduct an ablation study for testing theexact accuracy gain and the impact of the model size.
We evaluated the difference between using binary andfull-precision downsampling layers, which has been oftenignored in the literature. First, we examine the results of bi-nary ResNetE18 on CIFAR-10. Using full-precision down-sampling over binary leads to an accuracy gain between0.2% and 1.2% (see Table 2). However, the model size alsoincreases from 1.39 MB to 2.03 MB, which is arguably toomuch for this minor increase of accuracy. Our results showa significant difference on ImageNet (see Table 3). The ac-curacy increases by 3% when using full-precision down-sampling. Similar to CIFAR-10, the model size increasesby 0.64 MB, in this case from 3.36 MB to 4.0 MB. Thelarger base model size makes the relative model size differ-ence lower and provides a stronger argument for this trade-off. We conclude that the increase in accuracy is significant,especially for ImageNet.
Inspired by the achievement of binary ResNetE, we nat-urally further explored the DenseNet architecture, which issupposed to benefit even more from the densely connectedlayer design.
4.3. BinaryDenseNet
DenseNets [11] apply shortcut connections that, contraryto ResNet, concatenate the input of a block to its output
Table 5: The difference of performance for different Bi-naryDenseNet models when using different downsamplingmethods evaluated on ImageNet.
Blocks,growth-rate
Modelsize(binary)
Downsampl.convolution,reduction
AccuracyTop1/Top5
16, 128 3.39 MB binary, low 52.7%/75.7%3.03 MB FP, high 55.9%/78.5%
32, 64 3.45 MB binary, low 54.3%/77.3%3.08 MB FP, high 57.1%/80.0%
(see Figure 3d, b). Therefore, new information gained inone layer can be reused throughout the entire depth of thenetwork. We believe this is a significant characteristic formaintaining information flow. Thus, we construct a novelBNN architecture: BinaryDenseNet.
The bottleneck design and transition layers of the origi-nal DenseNet effectively keep the network at a smaller totalsize, even though the concatenation adds new informationinto the network every layer. However, as previously men-tioned, we have to eliminate bottlenecks for BNNs. Thebottleneck design can be modified by replacing the twoconvolution layers (kernel sizes 1 and 3) with one 3 × 3convolution (see Figure 3d, e). However, our experimentsshowed that DenseNet architecture does not achieve satis-factory performance, even after this change. This is due tothe limited representation capacity of binary layers. Thereare different ways to increase the capacity. We can increasethe growth rate parameter k, which is the number of newlyconcatenated features from each layer. We can also use alarger number of blocks. Both individual approaches addroughly the same amount of parameters to the network.To keep the number of parameters equal for a given Bina-ryDenseNet we can halve the growth rate and double thenumber of blocks at the same time (see Figure 3f) or viceversa. We assume that in this case increasing the number ofblocks should provide better results compared to increasingthe growth rate. This assumption is derived from our hy-pothesis: favoring an increased number of connections oversimply adding weights.
Another characteristic difference of BinaryDenseNetcompared to binary ResNetE is that the downsampling layerreduces the number of channels. To preserve informationflow in these parts of the network we found two options:On the one hand, we can use a full-precision downsamplinglayer, similarly to binary ResNetE. Since the full-precisionlayer preserves more information, we can use higher reduc-tion rate for downsampling layers. To reduce the numberof MACs, we modify the transition block by swapping theposition of pooling and convolution layers. We use Max-Pool→ReLU→1×1-Conv instead of 1×1-Conv→AvgPool
Table 6: The accuracy of different BinaryDenseNet modelsby successively splitting blocks evaluated on ImageNet. Asthe number of connections increases, the model size (andnumber of binary operations) changes marginally, but theaccuracy increases significantly.
Blocks Growth-rate
Model size(binary)
AccuracyTop1/Top5
8 256 3.31 MB 50.2%/73.7%16 128 3.39 MB 52.7%/75.7%32 64 3.45 MB 55.5%/78.1%
in the transition block (see Figure 4c, b). On the otherhand, we can use a binary downsampling conv-layer in-stead of a full-precision layer with a lower reduction rate, oreven no reduction at all. We coupled the decision whetherto use a binary or a full-precision downsampling convo-lution with the choice of reduction rate. The two vari-ants we compare in our experiments (see Section 4.3.1) arethus called full-precision downsampling with high reduction(halve the number of channels in all transition layers) andbinary downsampling with low reduction (no reduction inthe first transition, divide number of channels by 1.4 in thesecond and third transition).
4.3.1 Experiment
Downsampling Layers. In the following we present ourevaluation results of a BinaryDenseNet when using a full-precision downsampling with high reduction over a binarydownsampling with low reduction. The results of a Bina-ryDenseNet21 with growth rate 128 for CIFAR-10 resultshow an accuracy increase of 2.7% from 87.6% to 90.3%.The model size increases from 673 KB to 1.49 MB. Thisis an arguably sharp increase in model size, but the modelis still smaller than a comparable binary ResNet18 with amuch higher accuracy. The results of two BinaryDenseNetarchitectures (16 and 32 blocks combined with 128 and 64growth rate respectively) for ImageNet show an increase ofaccuracy ranging from 2.8% to 3.2% (see Table 5). Fur-ther, because of the higher reduction rate, the model size de-creases by 0.36 MB at the same time. This shows a highereffectiveness and efficiency of using a FP downsamplinglayer for a BinaryDenseNet compared to a binary ResNet.Splitting Layers. We tested our proposed architecturechange (see Figure 3f) by comparing BinaryDenseNet mod-els with varying growth rates and number of blocks (andthus layers). The results show, that increasing the num-ber of connections by adding more layers over simply in-creasing growth rate increases accuracy in an efficient way(see Table 6). Doubling the number of blocks and halv-ing the growth rate leads to an accuracy gain ranging from2.5% to 2.8%. Since the training of a very deep Binary-
Table 7: Comparison of our BinaryDenseNet to state-of-the-art 1-bit CNN models on ImageNet.
Modelsize Method Top-1/Top-5
accuracy
∼4.0MB
XNOR-ResNet18 [23] 51.2%/73.2%TBN-ResNet18 [28] 55.6%/74.2%Bi-Real-ResNet18 [22] 56.4%/79.5%BinaryResNetE18 58.1%/80.6%BinaryDenseNet28 60.7%/82.4%
∼5.1MB
TBN-ResNet34 [28] 58.2%/81.0%Bi-Real-ResNet34 [22] 62.2%/83.9%BinaryDenseNet37 62.5%/83.9%BinaryDenseNet37-dilated∗ 63.7%/84.7%
7.4MB BinaryDenseNet45 63.7%/84.8%46.8MB Full-precision ResNet18 69.3%/89.2%249MB Full-precision AlexNet 56.6%/80.2%
∗ BinaryDenseNet37-dilated is slightly different to other modelsas it applies dilated convolution kernels, while the spatial dimen-tion of the feature maps are unchanged in the 2nd, 3rd and 4thstage that enables a broader information flow.
DenseNet becomes slow (it is less of a problem during in-ference, since no additional memory is needed during in-ference for storing some intermediate results), we have nottrained even more highly connected models, but highly sus-pect that this would increase accuracy even further. The to-tal model size slightly increases, since every second half ofa split block has slightly more inputs compared to those ofa double-sized normal block. In conclusion, our techniqueof increasing number of connections is highly effective andsize-efficient for a BinaryDenseNet.
5. Main ResultsIn this section, we report our main experimental resultson image classification and object detection using Binary-DenseNet. We further report the computation cost in com-parison with other quantization methods. Our implementa-tion is based on the BMXNet framework first presented byYang et al. [29]. Our models are trained from scratch usinga standard training strategy. Due to space limitations, moredetails of the experiment can be found in the supplementarymaterials.Image Classification. To evaluate the classification accu-racy, we report our results on ImageNet [4]. Table 7 showsthe comparison result of our BinaryDenseNet to state-of-the-art BNNs with different sizes. For this comparison, wechose growth and reduction rates for BinaryDenseNet mod-els to match the model size and complexity of the corre-sponding binary ResNet architectures as closely as possible.Our results show that BinaryDenseNet surpass all the exist-ing 1-bit CNNs with noticeable margin. Particularly, Bi-naryDenseNet28 with 60.7% top-1 accuracy, is better thanour binary ResNetE18, and achieves up to 18.6% and 7.6%
Table 8: Object detection performance (in mAP) of our Bi-naryDenseNet37/45 and other BNNs on VOC2007 test set.
Method Ours†
37/45TBN∗
ResNet34XNOR-Net∗
ResNet34
Binary SSD 66.4/68.2 59.5 55.1Full-precision
SSD512/faster rcnn/yolo 76.8/73.2/66.4
∗ SSD300 result read from [28], † SSD512 result
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00number of operations 1e9
40
45
50
55
60
65
70
75
top-
1 Im
ageN
et a
ccur
acy
in %
BinaryResNetE18 (ours)XNOR-NetBi-Real NetHORQBinaryDenseNet{28, 37, 45} (ours)ResNet18 (FP)ABC-Net {1/1, 5/5}DoReFa (W:1,A:4)SYQ (W:1,A:8)TBN
Figure 5: The trade-off of top-1 validation accuracy on Im-ageNet and number of operations. All the binary/quantizedmodels are based on ResNet18 except BinaryDenseNet.
relative improvement over the well-known XNOR-Networkand the current state-of-the-art Bi-Real Net, even thoughthey use a more complex training strategy and additionaltechniques, e.g., custom gradients and a scaling variant.Preliminary Result on Object Detection. We adopted theoff-the-shelf toolbox Gluon-CV [9] for the object detectionexperiment. We change the base model of the adopted SSDarchitecture [20] to BinaryDenseNet and train our mod-els on the combination of PASCAL VOC2007 trainval andVOC2012 trainval, and test on VOC2007 test set [5]. Ta-ble 8 illustrates the results of binary SSD as well as someFP detection models [20, 25, 24].Efficiency Analysis. For this analysis, we adopted thesame calculation method as [22]. Figure 5 shows that ourbinary ResNetE18 demonstrates higher accuracy with thesame computational complexity compared to other BNNs,and BinaryDenseNet28/37/45 achieve significant accuracyimprovement with only small additional computation over-head. For a more challenging comparison we include mod-els with 1-bit weight and multi-bits activations: DoReFa-Net (w:1, a:4) [31] and SYQ (w:1, a:8) [6], and a modelwith multiple weight and multiple activation bases: ABC-Net {5/5}. Overall, our BinaryDenseNet models showsuperior performance while measuring both accuracy andcomputational efficiency.
In closing, although the task is still arduous, we hope theideas and results of this paper will provide new potentialdirections for the future development of BNNs.
References[1] Y. Bengio, N. Leonard, and A. C. Courville. Estimating or
propagating gradients through stochastic neurons for condi-tional computation. CoRR, abs/1308.3432, 2013. 3
[2] J. Bethge, H. Yang, C. Bartz, and C. Meinel. Learning totrain a binary neural network. CoRR, abs/1809.10463, 2018.5
[3] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect:Training deep neural networks with binary weights duringpropagations. In Advances in neural information processingsystems, pages 3123–3131, 2015. 2
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 248–255. Ieee, 2009. 2, 8
[5] M. Everingham, L. Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. Int. J. Comput. Vision, 88(2):303–338, June 2010. 8
[6] J. Faraone, N. Fraser, M. Blott, and P. H. Leong. Syq:Learning symmetric quantization for efficient deep neuralnetworks. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2018. 2, 8
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2014. 1
[8] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-pressing deep neural networks with pruning, trained quanti-zation and huffman coding. In International Conference onLearning Representations (ICLR), 2016. 5
[9] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li. Bagof tricks for image classification with convolutional neuralnetworks. arXiv preprint arXiv:1812.01187, 2018. 8
[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-cient convolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861, 2017. 1, 2
[11] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.Densely connected convolutional networks. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, volume 1, page 3, 2017. 6
[12] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, andY. Bengio. Binarized neural networks. In Advances in neuralinformation processing systems, 2016. 1, 2, 3, 5
[13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracywith 50x fewer parameters and 0.5 mb model size. arXivpreprint arXiv:1602.07360, 2016. 1, 2
[14] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456, 2015. 4
[15] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep featuresfor text spotting. In Computer Vision – ECCV 2014, pages512–528, Cham, 2014. Springer International Publishing. 1
[16] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In ICLR, 2015. 4
[17] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10. URLhttp://www. cs. toronto. edu/kriz/cifar. html, 2010. 1, 4
[18] Z. Li, B. Ni, W. Zhang, X. Yang, and W. Gao. Perfor-mance guaranteed network acceleration via high-order resid-ual quantization. In Proceedings of the IEEE InternationalConference on Computer Vision, 2017. 2, 3, 5
[19] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convo-lutional neural network. In Advances in Neural InformationProcessing Systems, pages 344–352, 2017. 1, 2, 3, 5
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,C. Fu, and A. C. Berg. SSD: single shot multibox detector.In ECCV, pages 21–37, 2016. 8
[21] Z. Liu, W. Luo, B. Wu, X. Yang, W. Liu, and K. Cheng.Bi-real net: Binarizing deep network towards real-networkperformance. CoRR, abs/1811.01335, 2018. 3
[22] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng.Bi-real net: Enhancing the performance of 1-bit cnns withimproved representational capability and advanced trainingalgorithm. In ECCV, September 2018. 1, 2, 3, 4, 5, 6, 8
[23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neu-ral networks. In European Conference on Computer Vision,pages 525–542. Springer, 2016. 1, 2, 3, 4, 5, 8
[24] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.You only look once: Unified, real-time object detection.In 2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,2016, pages 779–788, 2016. 8
[25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-wards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Sys-tems 28, pages 91–99, 2015. 8
[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, andOthers. Going deeper with convolutions. Cvpr, 2015. 5
[27] W. Tang, G. Hua, and L. Wang. How to Train a CompactBinary Neural Network with High Accuracy. AAAI, 2017. 3
[28] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, andH. Tao Shen. Tbn: Convolutional neural network withternary inputs and binary weights. In ECCV, September2018. 2, 3, 5, 8
[29] H. Yang, M. Fritzsche, C. Bartz, and C. Meinel. Bmxnet:An open-source binary neural network implementation basedon mxnet. In Proceedings of the 2017 ACM on MultimediaConference, pages 1209–1212. ACM, 2017. 8
[30] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: Anextremely efficient convolutional neural network for mobiledevices. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2018. 1, 2
[31] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.Dorefa-net: Training low bitwidth convolutional neuralnetworks with low bitwidth gradients. arXiv preprintarXiv:1606.06160, 2016. 1, 2, 3, 5, 8
[32] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, andP. Frossard. Adaptive Quantization for Deep Neural Net-work. 2017. 3
[33] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternaryquantization. arXiv preprint arXiv:1612.01064, 2016. 2, 3
7
Image Captioner
In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-
LSTM) model to address the image captioning problem. We demonstrate that
Bi-LSTM models achieve state-of-the-art performance on both caption generation
and image-sentence retrieval task. Our experiments also prove that multi-task
learning is beneficial to increase model generality and gain performance. Our
model significantly outperforms previous methods on the Pascal1K dataset.
7.1 Contribution to the Work
• Contributor to the formulation and implementation of research ideas
• Significantly contributed to the conceptual discussion and implementation.
• Guidance and supervision of the technical implementation
7.2 Manuscript
Additional to the manuscript, we prepared a demo video to represent the proposed
real-time system.1
1https://youtu.be/a0bh9_2LE24
135
40
Image Captioning with Deep Bidirectional LSTMs
and Multi-Task Learning
CHENG WANG, HAOJIN YANG, and CHRISTOPH MEINEL, Hasso Plattner Institute,
University of Potsdam
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision,natural language processing, and multimedia communities. In this work, we propose an end-to-end trainabledeep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combin-ing a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable oflearning long-term visual-language interactions by making use of history and future context informationat high-level semantic space. We also explore deep multimodal bidirectional models, in which we increasethe depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Dataaugmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models “translate” image to sentence, we visualizeand qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generalityof proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1Kdatasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both captiongeneration and image-sentence retrieval even without integrating an additional mechanism (e.g., object de-tection, attention model). Our experiments also prove that multi-task learning is beneficial to increase modelgenerality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTMmodel significantly outperforms previous methods on the Pascal1K dataset.
CCS Concepts: • Computing methodologies→ Natural language generation; Neural networks; Com-
puter vision representations;
Additional Key Words and Phrases: Deep learning, LSTM, multimodal representations, image captioning,mutli-task learning
ACM Reference format:
Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image Captioning with Deep Bidirectional LSTMsand Multi-Task Learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s, Article 40 (April 2018),20 pages.https://doi.org/10.1145/3115432
1 INTRODUCTION
It is challenging to describe an image using sentence-level captions (Karpathy and Li 2015;Karpathy et al. 2014; Kiros et al. 2014b; Kuznetsova et al. 2012, 2014; Mao et al. 2015; Socher et al.2014; Vinyals et al. 2015), where the task is to map the input image to a sentence output thatpossesses its own structure. Inspired by the success of machine translation:translate source lan-guage to target language, image captioning system tries to “translate” an image to a sentence. It
Authors’ addresses: C. Wang, H. Yang, and C. Meinel, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany; emails: {Cheng.Wang, Haojin.Yang, Christoph.Meinel}@hpi.de.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from permissions@acm.org.© 2018 ACM 1551-6857/2018/04-ART40 $15.00https://doi.org/10.1145/3115432
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:2 C. Wang et al.
Fig. 1. Deep Multimodal Bidirectional LSTM. L1: sentence embedding layer. L2: Text-LSTM (T-LSTM) layer
which receives text only. L3:Multimodal-LSTM (M-LSTM) layer which receives both image and text input. L4:
Softmax layer. We feed sentence in both forward (blue arrows) and backward (red arrows) order which allows
our model to summarize context information from both the left and right sides for generating a sentence
word by word over time. Our model is end-to-end trainable by minimizing a joint loss.
requires not only the recognition of visual objects in an image and the semantic interactions be-tween objects, but the ability to capture visual-language interactions and learn how to “translate”the visual understanding to sensible sentence descriptions. A general approach is to train a visualmodel using images and train a language model using provided captions. By learning a multi-modal joint representation on images and captions, the semantic similarity of images and captionscan be measured and thus recommend the most descriptive caption for a given input image. Themost important part at the center of this visual-language modeling is to capture the semantic cor-relations across image and text modalities. While some previous works (Li et al. 2011; Kulkarniet al. 2013; Mitchell et al. 2012; Kuznetsova et al. 2012, 2014) have been proposed to address theproblem of image captioning, they mostly use sentence templates, or treat image captioning as aretrieval task through ranking the best matching sentence in the database as the caption. Thoseapproaches usually suffer difficulties in generating variable-length and novel sentences. Recentwork (Karpathy and Li 2015; Karpathy et al. 2014; Kiros et al. 2014b; Mao et al. 2015; Socher et al.2014; Vinyals et al. 2015) indicates that embedding visual and language to common semantic spacewith relatively shallow recurrent neural network (RNN) yields promising results.
In this work, we propose novel architectures to generate novel image descriptions. The overviewof architecture is shown in Figure 1. Different from previous approaches, we learn a visual-language space where sentence embeddings are encoded using bidirectional Long Short-TermMemory (Bi-LSTM) and visual embeddings are encoded with Convolutional Neural Network(CNN). Typically, in unidirectional sentence generation, one general way of predicting next wordwt with visual context I and history textual contextw1:t−1 is to maximize log P (wt |I ,w1:t−1). Whilethe unidirectional model includes past context, it is still limited to retaining future contextwt+1:T
that can be used for reasoning previous word wt by maximizing log P (wt |I ,wt+1:T ). The bidirec-tional model tries to overcome the shortcomings that each unidirectional (forward and backwarddirection) model suffers on its own and exploits the past and future dependence to give a pre-diction. As in Figure 2, two example images with bidirectionally generated sentences intuitivelysupport our assumption that bidirectional captions are complementary; combining them can gen-erate more sensible captions. Thus, our Bi-LSTM is able to summarize long-range visual-languageinteractions from forward and backward directions.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:3
Fig. 2. Illustration of generated captions. Two example images from Flickr8K dataset and their best match-
ing captions that generated in forward order (blue) and backward order (red). Bidirectional models capture
different levels of visual-language interactions (more evidence see Section 4.7). The final caption is the sen-
tence with higher probabilities (histogram under sentence). In both examples, backward caption is selected
as final caption for corresponding images.
Inspired by the architectural depth of human brain, to learn higher level visual-language em-beddings, we also explore the deeper bidirectional LSTM architectures where we increase the non-linearity by adding a hidden-to-hidden transformation layer. All of our proposed models can betrained in an end-to-end way by optimizing a joint loss in forward and backward directions. Inaddition, we design multi-task learning (Caruana 1998) and transfer learning (Pan and Yang 2010)to increase the generality of the proposed method on different datasets.
The core contributions of this work are fourfold:
—We propose an end-to-end trainable multimodal bidirectional LSTM and its deeper variantmodels (see Section 3.3) that embed image and sentence into a high-level semantic spaceby exploiting both long-term history and future context. The code, networks, and examplesfor this work can be found at our Github repository.1
—We evaluate the effectiveness of proposed models on three benchmark datasets: Flickr8K,Flickr30K, and MSCOCO. Our experimental results show that bidirectional LSTM modelsachieve highly competitive performance on caption generation (Section 4.6).
—We explore the generality on multi-task/transfer learning models on Pascal1K (Section 4.5).It demonstrates that transferring a multi-task joint model on Flickr8K, Flickr30K, andMSCOCO to Pascal1K is beneficial and performs significantly better than recent methods(see Section 4.6).
—We visualize the evolution of hidden states of bidirectional LSTM units to qualitativelyanalyze and understand how to generate a sentence that is conditioned by visual contextinformation over time (see Section 4.7).
The rest of the article is organized as follows. In Section 2, we review the related work on imagecaptioning using deep architectures. In Section 3, we introduce the proposed deep multimodalbidirectional LSTM for image captioning and explore its deeper variant models. Section 4 presentsseveral groups of experiments to illustrate the effectiveness of proposedmethods. In Section 4.6, wecompare our models with state-of-the-art methods; it shows that Bi-LSTM models achieve verycompetitive performance. In Section 4.7, we visualize the internal states of LSTM hidden unitsand show how our methods generalize to new datasets with multi-task/transfer learning; we alsoprovide some illustrative examples. Section 5 summarizes our methods and presents future work.
2 RELATEDWORK
This section gives the related knowledge. It starts by introducing Recurrent Neural Network (RNN)which equips neural networks with memories, followed by the review of recently proposed ap-proaches on the image captioning task.
1https://github.com/deepsemantic/image_captioning.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:4 C. Wang et al.
2.1 RNN
RNN is a powerful network architecture for processing sequential data. It has been widely usedin natural language processing (Socher et al. 2011), speech recognition (Graves et al. 2013), andhandwriting recognition (Graves et al. 2009) in recent years. In RNN, it allows cyclical connectionand reuse of the weights across different instances of neurons; each of them is associated withdifferent timesteps. This idea can explicitly support the network to learn the entire history ofprevious states andmap them to current states.With this property, RNN is able to map an arbitrarylength sequence to a fixed length vector.
LSTM (Long short-term memory) (Hochreiter and Schmidhuber 1997) is a particular form oftraditional RNN. Compared to traditional RNN, LSTM can learn the long-term dependencies be-tween inputs and outputs; it can also effectively prevent backpropagation errors from vanishingor exploding. LSTM has increasing popularity in the field of machine translation (Cho et al. 2014),speech recognition (Graves et al. 2013), and sequence learning (Sutskever et al. 2014) recently. An-other special type of RNN is Gated Recurrent Unit (GRU) (Cho et al. 2014). GRU simplifies LSTM byremoving the memory cell and provides a different way to prevent the vanishing gradient problem.GRU has been recently explored in language modeling (Chung et al. 2015), face aging (Wang et al.2016a), face alignment (Wang et al. 2016b), and speech synthesis (Wu and King 2016). Motivatedby those works, in the context of automatic image captioning, our networks build on bidirectionalLSTM in order to learn the long-term interaction across image and sentence from both history andfuture information.
2.2 Image Captioning
Multimodal representation learning (Ngiam et al. 2011; Srivastava and Salakhutdinov 2012; Wanget al. 2016c) has significant value in multimedia understanding and retrieval. The shared con-cept across modalities plays an important role in bridging the “semantic gap” of multimodal data(Rasiwasia et al. 2007; Yang et al. 2015, 2016). Image captioning falls into this general category oflearning multimodal representations.
Recently, several approaches have been proposed for image captioning. We can roughly classifythose methods into three categories. The first category is template-based approaches that generatecaption templates through detecting objects and discovering attributes in an image. For example,the work Li et al. (2011) was proposed to parse a whole sentence into several phrases, and learn therelationships between phrases and objects in an image. In Kulkarni et al. (2013), conditional ran-dom field (CRF) was used to correspond objects, attributes, and prepositions of image content andpredict the best label. Other similar methods were presented in Mitchell et al. (2012), Kuznetsovaet al. (2012, 2014). These methods are typically hard-designed and rely on a fixed template, whichmostly lead to poor performance in generating variable-length sentences. The second category isretrieval-based approaches. This sort of method treats image captioning as a retrieval task by lever-aging a distance metric to retrieve similar captioned images, and then modifying and combiningretrieved captions to generate a caption (Kuznetsova et al. 2014). But these approaches generallyneed additional procedures such as modification and generalization process to fit image query.
Inspired by the recent success of CNN (Krizhevsky et al. 2012; Zeiler and Fergus 2014) and RNN(Mikolov et al. 2010, 2011; Bahdanau et al. 2015), the third category emerged as neural networkbased methods (Vinyals et al. 2015; Xu et al. 2015; Kiros et al. 2014b; Karpathy et al. 2014; Karpathyand Li 2015). Our work also belongs to this category. Thework conducted by Kiros et al. (2014a) canbe seen as a pioneer work to use neural network for image captioning with a multimodal neurallanguage model. In their follow-up work (Kiros et al. 2014b), Kiros et al. introduced an encoder-decoder pipeline where a sentence was encoded by LSTM and decoded with a structure-content
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:5
neural language model (SC-NLM). Socher et al. (2014) presented a DT-RNN (Dependency Tree-Recursive Neural Network) to embed a sentence into a vector space in order to retrieve images.Later on, Mao et al. (2015) proposed m-RNN which replaces the feed-forward neural languagemodel in Kiros et al. (2014b). Similar architectures were introduced in NIC (Vinyals et al. 2015) andLRCN (Donahue et al. 2015); both approaches use LSTM to learn text context. But NIC only feedsvisual information at the first timestep while Mao et al. (2015) and LRCN (Donahue et al. 2015)consider image context at each timestep. Another group of neural network based approaches hasbeen introduced in Karpathy et al. (2014) and Karpathy and Li (2015) where object detection withR-CNN (region-CNN) (Girshick et al. 2014) was used for inferring the alignment between imageregions and descriptions.
Most recently, Fang et al. (2015) used multi-instance learning and a traditional maximum-entropy language model for image description generation. Chen and Zitnick (2015) proposed tolearn visual representation with RNN for generating image captions. Xu et al. (2015) introducedan attention mechanism of human visual system into an encoder-decoder framework (Cho et al.2015). It is shown that an attention model can visualize what the model “sees” and yields sig-nificant improvements on image caption generation. In You et al. (2016), the authors proposed asemantic attention model by combining top-down and bottom-up approaches in the framework ofrecurrent neural networks. In the bottom-up approach, semantic concepts or attributes are usedas candidates. In the top-down approach, visual features are employed to guide where and whenattention should be activated.
Unlike those models, our model directly assumes the mapping relationship between visual-semantic is antisymmetric and dynamically learns long-term bidirectional and hierarchical visual-semantic interactions with deep LSTM models. This is proved to be very effective in generationand retrieval tasks as we demonstrate in Section 4.
3 MODEL
In this section, we describe our multimodal Bi-LSTM model and explore its deeper variants. Wefirst briefly introduce LSTM; the LSTM we used is described in Zaremba and Sutskever (2014).
3.1 Long Short-Term Memory
Our model builds on the LSTM cell; as shown in Figure 3, the reading and writing memory cell cis controlled by a group of sigmoid gates. At given timestep t , LSTM receives inputs from differentsources: current input x, the previous hidden state of all LSTM units ht−1, as well as previousmemory cell state ct−1. The updating of those gates at timestep t for given inputs xt , ht−1, and ct−1is as follows:
it = σ (Wxixt +Whiht−1 + bi ), (1)
ft = σ (Wxf xt +Whf ht−1 + bf ), (2)
ot = σ (Wxoxt +Whoht−1 + bo ), (3)
gt = ϕ (Wxcxt +Whcht−1 + bc ), (4)
ct = ft ⊙ ct−1 + it ⊙ gt , (5)
ht = ot ⊙ ϕ (ct ), (6)
where without considering the optional peephole connections, W is the weight matrix learnedfrom the network and b is the bias term. σ is the sigmoid activation function σ (x ) = 1
1+exp(−x )
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:6 C. Wang et al.
Fig. 3. Long Short-Term Memory (LSTM) cell. It consists of an input gate i , a forget gate f , a memory cell c ,
and an output gate o. The input gate decides to let an incoming signal go through to the memory cell or block
it. The output gate can allow new output or prevent it. The forget gate decides to remember or forget the
cell’s previous state. Updating cell states is performed by feeding previous cell output to itself by recurrent
connections in two consecutive timesteps.
and ϕ presents hyperbolic tangent ϕ (x ) = exp(x )−exp(−x )exp(x )+exp(−x ) . ⊙ denotes the products with a gate value.
The LSTM hidden output ht = {htk }Kk=0, ht ∈ RK will be used to predict the next word by Softmaxfunction with parametersWs and bs :
F (pt i ;Ws , bs ) =exp(Wsht i + bs )∑Kj=1 exp(Wsht j + bs )
, (7)
where pt i is the probability distribution for predicted word.Our key motivation of chosen LSTM is that it can learn long-term temporal activities and avoid
quick exploding and vanishing problems that traditional RNN suffers from during backpropagationoptimization.
3.2 Bidirectional LSTM
In order to make use of both the past and future context information of a word in sentence predic-tion, we propose a bidirectional model by feeding a sentence to LSTM from forward and backwardorder. Figure 1 presents the overview of our model; it is comprised of three modules: a CNN forencoding image inputs, a Text-LSTM (T-LSTM) for encoding sentence inputs, and a MultimodalLSTM (M-LSTM) for embedding visual and textual vectors to a common semantic space and de-coding to sentence. The bidirectional LSTM is implemented with two separate LSTM layers for
computing forward hidden sequences−→h and backward hidden sequences
←−h . The forward LSTM
starts at time t = 1 and the backward LSTM starts at time t = T . Formally, our model works as
follows: for a given raw image input I , forward order sentence−→S , and backward order sentence
←−S ,
the encoding performs as
It = C (I ;Θv ),−→h 1t = T (
−→E−→S ;−→Θl ),
←−h 1t = T (
←−E←−S ;←−Θl ), (8)
where C, T represent CNN, T-LSTM, respectively, and Θv , Θl are their corresponding weights.Following previous work (Mao et al. 2015; Donahue et al. 2015), It is considered at all timesteps as
visual context information.−→E and
←−E are bidirectional embedding matrices learned from network.
Encoded visual and textual representations are then embedded to multimodal LSTM by−→h 2t =M
(−→h 1t , It ;−→Θm
),←−h 2t =M
(←−h 1t , It ;←−Θm
), (9)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:7
Fig. 4. Illustrations of proposed deep architectures for image captioning. The network in (a) is commonly
used in previous work. (b) Our proposed Bidirectional LSTM (Bi-LSTM). (c) Our proposed Bidirectional
Stacked LSTM (Bi-S-LSTM). (d) Our proposed Bidirectional LSTM with full connected (FC) transition layer
(Bi-F-LSTM). T-LSTM receives text input only and M-LSTM receives both image and text input.
whereM presents M-LSTM and its weightΘm .M aims to capture the correlation of visual contextand words at different timesteps. We feed visual vector It to the model at each timestep for cap-turing strong visual-word correlation. On the top of M-LSTM are Softmax layers with parametersWs and bs which compute the probability distribution of the next predicted word by
−→p t+1 = F(−→h 2t ;−→Ws ,−→b s
), ←−p t+1 = F
(←−h 2t ;←−Ws ,←−b s
), (10)
where p ∈ RK and K is the vocabulary size.
3.3 Deeper LSTM Architecture
The recent success of deep CNN in image classification and object detection (Krizhevsky et al. 2012;Simonyan and Zisserman 2014b) demonstrates that deep, hierarchical models can be more efficientat learning representation than shallower ones. This motivated our work to explore deeper LSTMarchitectures in the context of learning bidirectional visual-language embeddings. As claimed inPascanu et al. (2013), if we consider LSTM as a composition of multiple hidden layers that unfoldedin time, LSTM is already a deep network. But this is a way of increasing the “horizontal depth”in which network weightsW are reused at each timestep and limited to learn more representa-tive features such as increasing the “vertical depth” of the network. To design deep LSTM, onestraightforward way is to stack multiple LSTM layers as a hidden-to-hidden transition. Alterna-tively, instead of stacking multiple LSTM layers, we propose to add multilayer perceptron (MLP)as an intermediate transition between LSTM layers. This can not only increase LSTM networkdepth, but can also prevent the parameter size from growing dramatically because the number ofrecurrent connections at a hidden layer can be largely decreased.
Directly stacking multiple LSTMs on top of each other leads to Bi-S-LSTM (Figure 4(c)). In addi-tion, we propose to use a fully connected layer as an intermediate transition layer. Our motivationcomes from the finding of Pascanu et al. (2013), in which DT(S)-RNN (deep transition RNN withshortcut) is designed by adding a hidden-to-hidden multilayer perceptron (MLP) transition. It isarguably easier to train such network. Inspired by this, we extend Bi-LSTM (Figure 4(b)) with afully connected layer that we called Bi-F-LSTM (Figure 4(d)); a shortcut connection between theinput and hidden states is introduced to make it easier to train the model. The aim of the extension
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:8 C. Wang et al.
Fig. 5. Transition for Bi-S-LSTM (left) and Bi-F-LSTM (right).
models is to learn an extra hidden transition function Fh . Formally, in Bi-S-LSTM
hl+1t = Fh(hl−1t , h
lt−1)= Uhl−1t + Vhlt−1, (11)
where hlt presents the hidden states of the l-th layer at time t , and U and V are matrices connectedto the transition layer (also see Figure 5 (left)). For readability, we consider one direction trainingand suppress bias terms. Similarly, in Bi-F-LSTM, to learn a hidden transition function Fh by
hl+1t = Fh(hl−1t
)= ϕr(Whl−1t ⊕
(V(Uhl−1t
)), (12)
where ⊕ is the operator that concatenates hl−1t and its abstractions to a long hidden state (alsosee Figure 5 (right)). ϕr represents the rectified linear unit (Relu) activation function for transitionlayer, which performs ϕr (x ) = max(0,x ).
3.4 Data Augmentation
One of the most challenging aspects of training deep bidirectional LSTM models is preventingoverfitting. Since our largest dataset has only 80K images (Lin et al. 2014) which might cause over-fitting easily, we adopted several techniques such as fine-tuning on a pre-trained visual model,weight decay, dropout, and early stopping that were commonly used in previous work. Addition-ally, it has been proved that data augmentation such as randomly cropping and horizontal mirror(Simonyan and Zisserman 2014a; Lu et al. 2014), adding noise, blur, and rotation (Wang et al. 2015)can effectively alleviate overfitting. Inspired by this, we designed new data augmentation tech-niques to increase the number of image-sentence pairs. Our implementation performs on a visualmodel, as follows:
—Multi-Corp: Instead of randomly cropping on input image, we crop at the four cornersand center region because we found that random cropping tends to select center regionand cause overfitting easily. By cropping four corners and center, the variations of networkinput can be increased to alleviate overfitting.
—Multi-Scale: To further increase the number of image-sentence pairs, we rescale input im-age to multiple scales. For each input image I with sizeH ×W , it is resized to 256× 256, thenwe randomly select a region with a size of s ∗ H × s ∗W , where s ∈ [1, 0.925, 0.875, 0.85] isthe scale ratio. s = 1 means we do not multi-scale operation on a given image. Finally, weresize it to AlexNet input size 227 × 227 or VGG-16 input size 224 × 224.
—Vertical Mirror: Motivated by the effectiveness of the widely used horizontal mirror, it isnatural to also consider the vertical mirror of image for the same purpose.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:9
Those augmentation techniques are implemented in a real-time fashion. Each input image israndomly transformed using one of the augmentations to network input for training. In principle,our data augmentation can increase image-sentence training pairs by roughly 40 times (5 × 4 × 2).We report the evaluation of data augmentation in Section 4.4.
3.5 Multi-Task/Transfer Learning
Although our data augmentation can reduce overfitting in training deep LSTM network, it onlyhelps to a certain extent. Increasing the effective training size with fresh training examples canfurther enlarge the variations of training data. This can effectively prevent training loss from go-ing down quickly and reduce overfitting. On the other hand, it is also beneficial to increase themodel robustness and generality. To address this issue, we propose to combine the training exam-ples from different datasets; for example, in our case, Dmulti = Df l ickr8K
⋃Df l ickr30K
⋃Dmscoco .
With combined dataset Dmulti , we train a multi-task joint modelMmulti , then we evaluate modelperformance on validation/test sets of different datasets, respectively.
In order to further test the generality and performance of multi-task joint model Mmulti intransferring knowledge learned on Dmulti to new dataset, we propose to useMmulti to performimage captioning and image-sentence retrieval on target dataset Dpascal1K . Here, we do not useany images from Pascal1K for training, only for validation. We report the evaluation of multi-task/transfer learning of Bi-LSTM in Section 4.5.
3.6 Training and Inference
Our model is end-to-end trainable by using Stochastic Gradient Descent (SGD). The joint loss
function L =−→L +←−L is computed by accumulating the Softmax losses of forward and backward
directions. Our objective is to minimize L, which is equivalent to maximizing the probabilities ofcorrectly generated sentences. We compute the gradient ▽L with the Back-Propagation ThroughTime (BPTT) algorithm (Werbos 1990).The trained model is used to predict a word wt with given image context I and previous word
contextw1:t−1 by P (wt |w1:t−1, I ) in forward order, or by P (wt |wt+1:T , I ) in backward order. We setw1=wT=0 at the start point for forward and backward directions, respectively. Ultimately, withgenerated sentences from two directions, we decide the final sentence for a given image p (w1:T |I )according to the average of word probability within the sentence:
p (w1:T |I ) = max ��
1
T
T∑
t=1
(−→p (wt |I )), 1
T
∑T
t=1(←−p (wt |I ))�
�, (13)
−→p (wt |I ) =
T∏
t=1
p (wt |w1,w2, . . . ,wt−1, I ), (14)
←−p (wt |I ) =
T∏
t=1
p (wt |wt+1,wt+2, . . . ,wT , I ). (15)
Following previous work, we adopted beam search to consider the best k candidate sentences attime t to infer the sentence at next timestep. In our work, we fix k = 1 on all experiments, althoughthe average of 2 BLEU (Papineni et al. 2002) points out that better results can be achieved withk = 20 compared to k = 1 as reported in Vinyals et al. (2015).
4 EXPERIMENTS
In this section, we design several groups of experiments to accomplish the following objectives:
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:10 C. Wang et al.
—Measure the benefits and performance of the proposed bidirectional model and its deepervariant models so that we increase their nonlinearity depth in different ways.
—Examine the influences of data augmentation and multi-task/transfer learning on bidirec-tional LSTM.
—Compare our approach with state-of-the-art methods in terms of sentence generation andimage-sentence retrieval tasks on popular benchmark datasets.
—Qualitatively analyze and understand how bidirectional multimodal LSTM learns to gener-ate a sentence conditioned by visual context information over time.
4.1 Datasets
To validate the effectiveness, generality, and robustness of our models, we conduct experimentson four benchmark datasets: Flickr8K (Hodosh et al. 2013), Flickr30K (Young et al. 2014), MSCOCO(Lin et al. 2014), and Pascal1K (Rashtchian et al. 2010) (used only for transfer learning experiment).
Flickr8K. It consists of 8,000 images and each of them has five sentence-level captions. Wefollow the standard dataset divisions provided by authors; 6,000/1,000/1,000 images for training/validation/testing, respectively.
Flickr30K. An extension version of Flickr8K. It has 31,783 images and each of them has fivecaptions. We follow the publicly accessible2 dataset division by Karpathy and Li (2015). In thisdataset split, 29,000/1,000/1,000 images are used for training/validation/testing, respectively.
MSCOCO. This is a recent released dataset that covers 82,783 images for training and 40,504images for validation. Each of the images has five sentence annotations. Since there is a lack ofstandard splits, we also follow the splits provided by Karpathy and Li (2015). Namely, 80,000 train-ing images and 5,000 images for both validation and testing.
Pascal1K. This dataset is only used for evaluating the generalities of models in our transferlearning experiment. It is a subset of images from the PASCAL VOC challenge. It contains 1,000images; each of them has five sentence descriptions. We do not use any images from this datasetfor training. Following the protocol in Socher et al. (2014), we randomly selected 100 images forvalidation.
4.2 Implementation Details
Visual feature. We use two visual models for encoding images: Caffe (Jia et al. 2014) refer-ence model which is pre-trained with AlexNet (Krizhevsky et al. 2012) and 16-layer VGG model(Simonyan and Zisserman 2014b). We extract features from the last fully connected layer and feedthem to train the visual-language model with LSTM. Previous work (Vinyals et al. 2015; Mao et al.2015) has demonstrated that more powerful image models such as GoogleNet (Szegedy et al. 2015)and ResNet (He et al. 2016) can achieve promising improvements. To make a fair comparison withrecent works, we selected two widely used models for experiments.
Textual feature.We first represent each wordw within a sentence as a one-hot vector,w ∈ RK ,where K is the vocabulary size built on training sentences for a given dataset. By performingbasic tokenization and removing the words that occur less than five times in the training set, wehave 2,028, 7,400, and 8,801 words for Flickr8K, Flickr30K, and MSCOCO dataset vocabularies,respectively.
Our work uses the LSTM implementation of Donahue et al. (2015) on the Caffe framework. Allof our experiments were conducted on Ubuntu 14.04, 16G RAM and single Titan X GPU with 12Gmemory. Our LSTMs use 1,000 hidden units and weights were initialized uniformly from [−0.08,0.08]. The batch sizes are 150, 100, and 100 for Bi-LSTM, Bi-S-LSTM, and Bi-F-LSTM, respectively,
2http://cs.stanford.edu/people/karpathy/deepimagesent/.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:11
Fig. 6. METEOR/CIDEr scores on data augmentation.
when we use AlexNet as the visual model. When we use VGG as the visual model, the batch size isset to 32. Models are trained with learning rates η = 0.01 (AlexNet-based training) and η = 0.005(VGG-based training), weight decay λ is 0.0005, and we used momentum 0.9. Each model is trainedfor 18–35 epochs with early stopping.
4.3 Evaluation Metrics
We evaluate our models mainly on caption generation; we follow previous work to use BLEU-N(N=1,2,3,4) scores (Papineni et al. 2002):
BN = min(1, e1−
r
c
)· e 1
N
∑N
n=1 logpn , (16)
where r , c represent the length of the reference sentence and the generated sentence, respectively,and pn is the modified n-gram precisions. We also report the METETOR (Lavie 2014) and CIDEr(Vedantam et al. 2015) scores for further comparison. To evaluate the generality of our models, weconduct a transfer learning experiment using Pascal1K on image-sentence retrieval3 (image querysentence and vice versa). It performs by computing the score of each image-sentence pair, andranking the scores to obtain the top-K (K = 1,5,10) retrieved results. We adopt R@K and Mean r asthe evaluation metrics. R@K is the recall rate R at top K candidates and Mean r is the mean rank.All mentioned metric scores are computed by the MSCOCO caption evaluation server,4 which iscommonly used for image captioning challenge.5
4.4 Experiments on Data Augmentation
In this subsection, we design a group of experiments to examine the effects of utilized data aug-mentation techniques. To this end, we use Bi-S-LSTM for experiment, because it has deeper LSTMand we believe that training a deeper LSTM network on limited data is more challenging andhelpful to measure the benefits brought by data augmentation. In this experiment, we turn off theintroduced augmentation techniques in Section 3.4 and keep other configurations unchanged. TheBLEU performance is reported in Table 1 and Table 2; METEOR/CIDEr performance is reportedin Figure 6 (shown as Bi-S-LSTMA,−D ). It is clear to see that without using data augmentation,the model performance drops significantly on all metrics. Those results also reveal how data aug-mentation affects datasets at different scales. For example, the model performance on small-scale
3Although this work focuses on image captioning task, we conduct an image-sentence retrieval experiment here to examinethe generality of our models across datasets and tasks. The task has been discussed widely in our previous work (Wanget al. 2016d).4https://github.com/tylin/coco-caption.5http://mscoco.org/home/.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:12 C. Wang et al.
Table 1. BLEU-N Performance Comparison on Flickr8K and Flickr30K (High Score is Good)
Flickr8K Flickr30K
Models B-1 B-2 B-3 B-4 B-1 B-2 B-3 B-4
NIC (Vinyals et al. 2015)G,‡ 63 41 27.2 - 66.3 42.3 27.7 18.3X. Chen et al. (Chen and Zitnick 2014) - - - 14.1 - - - 12.6LRCN (Donahue et al. 2015)A,‡ - - - - 58.8 39.1 25.1 16.5DeepVS (Karpathy and Li 2015)V 57.9 38.3 24.5 16 57.3 36.9 24.0 15.7m-RNN (Mao et al. 2015)A,‡ 56.5 38.6 25.6 17.0 54 36 23 15m-RNN (Mao et al. 2015)V ,‡ - - - - 60 41 28 19Hard-Attention (Xu et al. 2015)V 67 45.7 31.4 21.3 66.9 43.9 29.6 19.9ATT-FCN (You et al. 2016)G - - - - 64.7 46.0 32.4 23.0
C. Wang et al. (Wang et al. 2016d)V 65.5 46.8 32.0 21.5 62.1 42.6 28.1 19.3Bi-LSTMA 63.7 44.7 31 20.9 61.0 40.9 27.1 18.1Bi-S-LSTMA 65.1 45.0 29.3 18.4 60.0 40.3 27.1 18.2Bi-F-LSTMA 63.9 44.6 30.2 19.9 60.7 41.0 27.5 18.5Bi-LSTMV 66.7 48.3 33.7 23 63.3 44.1 29.6 20.1Bi-S-LSTMV 66.9 48.8 33.3 22.8 63.6 44.8 30.4 20.5Bi-F-LSTMV 66.5 48.4 32.8 22.4 63.4 44.3 30.1 20.4
Bi-LSTMA,+M 58.4 42.1 28.6 18.2 61.0 41.4 27.8 18.5Bi-S-LSTMA,−D 55.4 38.0 24.6 15.3 58.2 39.0 25.1 16.3
The superscript “A”means the visual model is AlexNet (or similar network), “V” is VGG-16, “G” is GoogleNet, “-D”meanswithout using data augmentations in Section 3.4, “+M” means using multi-task learning in Section 3.5, “-” indicatesunknown value, “‡” means different data splits.6 The best results are marked in bold and the second best results with anunderline (the superscripts are also applicable to Tables 2, 3, and 4).
dataset Flickr8K is worse than that on Flickr30K and MSCOCO. This confirms that data augmen-tation is beneficial in preventing overfitting and particularly helpful on small-scale dataset.
4.5 Experiments on Multi-Task/Transfer Learning
In addition to using data augmentation to increase the variations of training examples and re-duce overfitting, another effective way should be multi-task learning. Inspired by Simonyan andZisserman (2014a) and Donahue et al. (2015) in which datasets were combined to train a jointmodel, we combine the training set of Flickr8K, Flickr30K, and MSCOCO in order to increase thenumber of training examples. Then we train a multi-task joint model with combined training setsand evaluate on each validation set to examine its performance and generality. To save trainingtime, we initialize the training of the multi-task joint model with the best-performing pre-trainedMSCOCO model. We change the input unit numbers of embedding layers and the output unitsnumber of the last fully connected layer; they are the vocabulary size (11,557) of the combinedtraining set.
To compare with baseline models without using multi-task learning, we select the best-performing models7 for Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets, respectively. Thecomparison with baseline models in terms of BLEU scores is reported in Table 1 and Table 2. Theresults show that the multi-task joint model (shown as Bi-LSTMA,+M ) did not improve the BLEU
6On the MSCOCO dataset, NIC uses 4K images for validation and test. LRCN randomly selects 5K images from MSCOCOvalidation set for validation and test. m-RNN uses 4K images for validation and 1K as test.7The model from 100000th iterations has the best performance on Flickr8K, the model from 90000th iterations performsbest on the rest of datasets.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:13
Table 2. BLEU-N, METEOR and CIDEr Performance Comparison on MSCOCO
MSCOCO
Models B-1 B-2 B-3 B-4 METEOR CIDEr
NIC (Vinyals et al. 2015)G,‡ 66.6 46.1 32.9 24.6 - -X. Chen et al. (Chen and Zitnick 2014) - - - 19.0 20.4 -LRCN (Donahue et al. 2015)A,‡ 62.8 44.2 30.4 - - -DeepVS (Karpathy and Li 2015)V 62.5 45 32.1 23 19.5 66.0m-RNN (Mao et al. 2015)V ,‡ 67 49 35 25 - -Hard-Attention (Xu et al. 2015)V 71.8 50.4 35.7 25 23.0 -ATT-FCN (You et al. 2016)G 70.9 53.7 40.2 30.4 24.3 -C. Wang et al. (Wang et al. 2016d)V 67.2 49.2 35.2 24.4 21.6 71.0Bi-LSTMA 65.1 45.0 29.3 18.4 20.0 64.1Bi-S-LSTMA 64.1 45.4 31.3 21.1 20.7 68.1Bi-F-LSTMA 64.0 45.5 31.5 21.5 20.5 67.5Bi-LSTMV 68.5 50.5 36.0 25.3 22.1 73.0Bi-S-LSTMV 68.7 50.9 36.4 25.8 22.9 73.9
Bi-F-LSTMV 68.2 50.6 36.1 25.6 22.6 73.5
Bi-LSTMA,+M 65.6 47.4 33.3 23.0 21.1 69.5Bi-S-LSTMA,−D 62.8 44.4 30.2 20.0 19.7 60.7
Fig. 7. METEOR/CIDEr scores on multi-task learning.
score on small dataset Flickr8K. We conjecture the reason is, on the one hand, multi-task jointmodel helps to make diversity of training examples and increase model generality. On the otherhand, it enlarges the differences between training and validation data. Those factors lead to worseBLEU performance on Flickr8K even though the generated sentences are highly descriptive andsensible (also see examples in Figure 11). However, the multi-task joint model shows promising im-provements on Flickr30K and MSCOCO. In addition, in Table 1 and Table 2 we also found that themulti-task joint model tends to improve B-2, B-3, and B-4 performance (2.4, 4.0, and 4.6 points in-creased on MSCOCO). In Figure 7, The METEOR/CIDEr performance is improved with multi-tasklearning except METETOR on Flickr8K.
4.6 Comparison with State-of-The-Art Methods
4.6.1 Model Performance. Now we compare with state-of-the-art methods. Table 1 and Table 2summarize the comparison results in terms of BLEU-N. Our approach achieves very competitive
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:14 C. Wang et al.
Table 3. Image Captioning Performance Comparison on Pascal1K
Methods BLEU METEOR
Midge (Mitchell et al. 2012) 2.89 8.80Baby talk (Kulkarni et al. 2011) 0.49 9.69RNN (Chen and Zitnick 2014) 2.79 10.08
RNN+IF (Chen and Zitnick 2014) 10.16 16.43RNN+IF+FT (Chen and Zitnick 2014) 10.18 16.45X.Chen et al. (Chen and Zitnick 2014) 10.48 16.69
X.Chen et al.+FT (Chen and Zitnick 2014) 10.77 16.87Bi-LSTMA(transfer) 16.4 18.30
performance on evaluated datasets, although with a less powerful visual model—AlexNet. Increas-ing the depth of LSTM is beneficial on generation task. Deeper variant models mostly obtain bet-ter performance compared to Bi-LSTM, but they are inferior to the latter one in B-3 and B-4 onFlickr8K. We believe it should be the reason that Flick8K is a relatively small dataset which suffersdifficulty in training deep models with limited data. One of the interesting facts we found is thatstacking multiple LSTM layers is generally superior to LSTM with a fully connected transitionlayer, although Bi-S-LSTM needs more training time. Replacing AlexNet with VGG-16 results insignificant improvement on all BLEU evaluation metrics. We should be aware that a recent inter-esting work (Xu et al. 2015) achieves the best results on B-1 by integrating an attention mechanism(LeCun et al. 2015; Xu et al. 2015). Semantic attention (You et al. 2016) with GoogleNet achievesthe best performance on B-2, B-3, and B-4.
Regarding METEOR and CIDEr performance, our baseline model (Bi-LSTMA) outperformsDeepVSV (Karpathy and Li 2015) in a certain margin. It achieves 19.1/51.8 on Flickr8K (compare to16.7/31.8 of DeepVSV ) and 16.1/29.0 on Flickr30K (15.3/24.7 of DeepVSV ). On MSCOCO, our bestresults are 22.9/73.9; the METEOR score is sightly inferior to 23.0 in Xu et al. (2015) and 24.3 inYou et al. (2016) but exceeds the rest of the methods. Although we believe incorporating an atten-tion mechanism into our framework can make further improvements, note that our current modelachieves competitive results while the small gap between ourmodel and the attention-basedmodel(Xu et al. 2015; You et al. 2016) existed.
Comparing to our prior work (Wang et al. 2016d), we use the mean probability in Equation (13),rather than the sum probability of all words when selecting the final caption from the bidirection-ally generated captions. This sightly improves our model performance on nearly all metrics by anaverage 1.7 points on Flickr8K, 1.2 points on Flickr30K, and 1.1 points on MSCOCO.
4.6.2 Model Generality. In order to further evaluate the generality of our model on image cap-tioning, we test our jointmodel on the Pascal1K validation dataset. Table 3 presents the comparisonwith related work on BLEU and METEOR. We can see that even with our base model Bi-LSTMA,the performance on generation task exceeds previous approaches in a certain margin even withoutusing the training images of Pascal1K.
On the same dataset, we also examine the generality of our model on a different task: image-sentence retrieval. The results are reported in Table 4. It shows that without using any trainingimages from Pascal1K, our model substantially outperforms previous work in all metrics. Partic-ularly on R@1, transfer learning achieves more than 20 points on both image-to-sentence andsentence-to-image retrieval tasks.
Those experiments demonstrate that although with less powerful visual model, our simplestnetwork (Bi-LSTM) achieves the best performance on both image captioning and image-sentenceretrieval tasks.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:15
Table 4. Image-Sentence Retrieval Performance Comparison on Pascal1K
Image to Sentence Sentence to Image
Methods R@1 R@5 R@10 M_r R@1 R@5 R@10 M_r
Random Ranking 4.0 9.0 12.0 71.0 1.6 5.2 10.6 50.0
KCCA (Socher et al. 2014) 21.0 47.0 61.0 18.0 16.4 41.4 58.0 15.9
DeViSE (Frome et al. 2013) 17.0 57.0 68.0 11.9 21.6 54.6 72.4 9.5
SDT-RNN (Socher et al. 2014) 25.0 56.0 70.0 13.4 35.4 65.2 84.4 7.0
DeepFE (Karpathy et al. 2014) 39.0 68.0 79.0 10.5 23.6 65.2 79.8 7.6
RNN+IF (Chen and Zitnick 2014) 31.0 68.0 87.0 6.0 27.2 65.4 79.8 7.0
X. Chen et al. (Chen and Zitnick 2014) 25.0 71.0 86.0 5.4 28.0 65.4 82.2 6.8
X. Chen et al. (T+I) (Chen and Zitnick 2014) 30.0 75.0 87.0 5.0 28.0 67.4 83.4 6.2
Bi-LSTMA (transfer) 65.0 90.0 95.0 2.0 52.8 86.0 95.4 2.1
Fig. 8. Visualization of LSTM cell. The horizontal axis corresponds to timesteps. The vertical axis is cell index.
Here we visualize the gates and cell states of the first 32 Bi-LSTM units of T-LSTM in forward directional
over 11 timesteps.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:16 C. Wang et al.
Fig. 9. Pattern of the first 96 hidden units chosen at each layer of Bi-LSTM in both forward and backward
directions. The vertical axis presents timesteps. The horizontal axis corresponds to different LSTM units.
In this example, we visualize the T-LSTM layer for text only, the M-LSTM layer for both text and image,
and the Softmax layer for word prediction. The model was trained on Flickr 30K dataset for generating a
sentence word by word at each timestep. In (g), we provide the predicted words at different timesteps and
their corresponding index in vocabulary where we can also read from (e) and (f) (the highlight point at each
row). Word with highest probability is selected as the predicted word.
Fig. 10. Examples of generated captions for a given query image on MSCOCO validation set. Blue captions
are generated in forward direction and red captions are generated in backward direction. The final caption
is selected according to Equation (13) which selects the sentence with the higher mean probability. The final
captions are marked in bold.
4.7 Visualization and Qualitative Analysis
The aim of this set experiment is to visualize the properties of the proposed bidirectional LSTMmodel and explain how it works in generating a sentence word by word over time.
First, we examine the temporal evolution of internal gate states and understand how bidirec-tional LSTM units retain valuable context information and attenuate unimportant information.Figure 8 shows input and output data, the pattern of three sigmoid gates (input, forget, and out-put), as well as cell states. We can clearly see that dynamic states are periodically distilled to unitsfrom timestep t = 0 to t = 11. At t = 0, the input data are sigmoid modulated to input gate i(t )
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:17
Fig. 11. Examples of generated captions for given query images on Flickr8K, Flickr30K, MSCOCO, and Pas-
cal1K validation set. Left: input images. Right: → and ← present the generated captions in forward and
backward direction, respectively. The superscript M or T means the captions generated with multi-task or
transfer learning. The final captions are marked in bold.
where values lie within in [0,1]. At this step, the values of forget gates f (t ) of different LSTM unitsare zeros. Along with the increasing of timestep, forget gate starts to decide which unimportantinformation should be forgotten, and meanwhile, decide to retain useful information. Then thememory cell states c(t ) and output gate o(t ) gradually absorb the valuable context informationover time and make a rich representation h(t ) of the output data.
Next, we examine how visual and textual features are embedded to common semantic spaceand used to predict words over time. Figure 9 shows the evolution of hidden units at differentlayers. For the T-LSTM layer where LSTM units are conditioned by textual context from the past
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:18 C. Wang et al.
and future, It performs as the encoder of forward and backward sentences. At the M-LSTM layer,LSTM units are conditioned by both visual and textual context. It learns the correlations betweeninput word sequence and visual information that were encoded by CNN. At a given timestep, byremoving unimportant information that makes less contribution to correlate input word and visualcontext, the units tend to appear sparsity pattern and learn more discriminative representationsfrom inputs. At higher layer, embedded multimodal representations are used to compute the prob-ability distribution of next predict word with Softmax. It should be noted, for a given image, thenumber of words in a generated sentence from forward and backward direction can be different.
Figure 10 presents some example images with generated captions. From generated captions, wefound bidirectionally generated captions cover different semantic information; for example, in (b)the forward sentence captures “couch” and “table” while the backward one describes “chairs” and“table.” We also found that a significant proportion (88% by randomly selected 1,000 images onMSCOCO validation set) of generated sentences are novel (do not appear in training set). But gen-erated sentences are highly similar to ground-truth captions; for example, in (d), forward captionis similar to one of the ground-truth captions (“A passenger train that is pulling into a station”) andthe backward caption is similar to the ground-truth caption (“a train is in a tunnel by a station”).It illustrates that our model has a strong capability in learning visual-language correlation andgenerates novel sentences.
More example sentence generations on Flickr8K, Flickr30K, MSCOCO, and Pascal1K can befound in Figure 11. Those examples demonstrate that without using an explicit pre-trained lan-guage model on additional corpus, our models generate sentences which are highly descriptiveand semantically relevant to corresponding images.
5 CONCLUSIONS
We proposed a bidirectional LSTM model that generates a descriptive sentence for an image bytaking both history and future context into account. We further designed deep bidirectional LSTMarchitectures to embed image and sentence at high semantic space for learning visual-languagemodel. We proved multi-task learning of Bi-LSTM is beneficial to increase model generality andfurther confirmed by transfer learning experiment. We also qualitatively visualized internal statesof the proposed model to understand how multimodal bidirectional LSTM generates words atconsecutive timesteps. The effectiveness, generality, and robustness of the proposed models wereevaluated with numerous datasets on two different tasks: image captioning and image-sentenceretrieval. Our models achieve highly competitive results on both tasks. Our future work will fo-cus on exploring more sophisticated language representation (e.g., word2vec) and incorporatingan attention mechanism into our model. It would also be interesting to explore the multilingualcaption generation problem. We also plan to apply our models to other sequence learning taskssuch as text recognition and video captioning.
REFERENCES
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95–133.Xinlei Chen and C. Lawrence Zitnick. 2014. Learning a recurrent visual representation for image caption generation.
arXiv:1411.5654.X. Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In
CVPR. 2422–2431.Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describingmultimedia content using attention-based encoder-
decoder networks. IEEE Transactions on Multimedia 17, 11 (2015), 1875–1886.K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase
representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning 40:19
Junyoung Chung, Caglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks.In ICML. 2067–2075.
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-termrecurrent convolutional networks for visual recognition and description. In CVPR. 2625–2634.
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. 2015. From captionsto visual concepts and back. In CVPR. 1473–1482.
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model.In NIPS. 2121–2129.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detectionand semantic segmentation. In Computer Vision and Pattern Recognition.
Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. 2009. A novelconnectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 31, 5 (2009), 855–868.A. Graves, A. Mohamed, and G. E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE,
6645–6649.KaimingHe, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and
evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional
architecture for fast feature embedding. In ACMMM. ACM, 675–678.A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS.
1889–1897.A. Karpathy and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128–3137.R. Kiros, R. Salakhutdinov, and R. Zemel. 2014a. Multimodal neural language models. In ICML. 595–603.R. Kiros, R. Salakhutdinov, and R. Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language
models. arXiv:1411.2539.A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In
NIPS. 1097–1105.Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby
talk: Understanding and generating simple image descriptions. In 2011 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’11). IEEE, 1601–1608.G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. 2013. Babytalk: Understanding and
generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35, 12(2013), 2891–2903.
P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. 2012. Collective generation of natural image descriptions. InACL, Vol. 1. ACL, 359–368.
P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. 2014. TREETALK: Composition and compression of trees for image de-scriptions.Transactions of the Association for Computational Linguistics (TACL) 2, 10 (2014), 351–362.
M. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. ACL (2014), 376.Y. LeCun, Y. Bengio, and G. E. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams.
In CoNLL. ACL, 220–228.T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common
objects in context. In ECCV. Springer, 740–755.Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z. Wang. 2014. Rapid: Rating pictorial aesthetics using deep learning.
In Proceedings of the ACM International Conference on Multimedia. ACM, 457–466.J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural
networks (m-rnn). ICLR 2015.T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model.
In INTERSPEECH. 1045–1048.T. Mikolov, S. Kombrink, L. Burget, J. H. Černocky, and S. Khudanpur. 2011. Extensions of recurrent neural network lan-
guage model. In ICASSP. IEEE, 5528–5531.M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. 2012.
Midge: Generating image descriptions from computer vision detections. In ACL. ACL, 747–756.J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689–696.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
40:20 C. Wang et al.
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering22, 10 (2010), 1345–1359.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. InACL. ACL, 311–318.
R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2013. How to construct deep recurrent neural networks. arXiv:1312.6026.C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical
Turk. In NAACL HLT Workshop. Association for Computational Linguistics, 139–147.Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Trans-
actions on Multimedia 9, 5 (2007), 923–938.K. Simonyan and A. Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. 568–
576.K. Simonyan andA. Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and
describing images with sentences. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 207–218.Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with
recursive neural networks. In 28th International Conference on Machine Learning (ICML’11). 129–136.N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222–2230.I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112.C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going
deeper with convolutions. In CVPR. 1–9.R. Vedantam, Z. Lawrence, and D. Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566–4575.O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156–3164.ChengWang, Haojin Yang, Christian Bartz, and ChristophMeinel. 2016d. Image captioning with deep bidirectional LSTMs.
arXiv:1604.00790.ChengWang, Haojin Yang, and Christoph Meinel. 2016c. A deep semantic framework for multimodal representation learn-
ing. Multimedia Tools and Applications (2016), 1–22.Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, and Nicu Sebe. 2016a. Recurrent face aging. In
IEEE Conference on Computer Vision and Pattern Recognition. 2378–2386.Wei Wang, Sergey Tulyakov, and Nicu Sebe. 2016b. Recurrent convolutional face alignment. In Asian Conference on Com-
puter Vision. Springer, 104–120.Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang.
2015. DeepFont: Identify your font from an image. In 23rd ACM International Conference on Multimedia (MM’15). ACM,New York, 451–459. DOI:http://dx.doi.org/10.1145/2733373.2806219
Paul J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78, 10 (1990),1550–1560.
ZhizhengWu and Simon King. 2016. Investigating gated recurrent networks for speech synthesis. In 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5140–5144.
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural imagecaption generation with visual attention. ICML 2015.
Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentationlayout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 33.
Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015. Cross-domain feature learning inmultimedia. IEEE Transactionson Multimedia 17, 1 (2015), 64–78.
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. InCVPR. 4651–4659.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similaritymetrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics
(TACL) 2 (2014), 67–78.W. Zaremba and I. Sutskever. 2014. Learning to execute. arXiv:1410.4615.M. D. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV. Springer, 818–833.
Received December 2016; revised March 2017; accepted March 2017
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. Publication date: April 2018.
8
A Deep Semantic Framework for
Multimodal Representation
Learning
In this paper, inspired by the success of deep networks in multimedia computing,
we propose a novel unified deep neural framework for multimodal representation
learning. The extensive experiments on benchmark Wikipedia and MIR Flickr
25K datasets show that our approach achieves promising results compared to
both shallow and deep models in multimodal and cross-modal retrieval tasks.
8.1 Contribution to the Work
• Contributor to the formulation and implementation of research ideas
• Significantly contributed to the conceptual discussion and implementation.
• Guidance and supervision of the technical implementation
8.2 Manuscript
157
Multimed Tools Appl (2016) 75:9255–9276DOI 10.1007/s11042-016-3380-8
A deep semantic framework for multimodalrepresentation learning
Cheng Wang1 ·Haojin Yang1 ·Christoph Meinel1
Received: 24 September 2015 / Revised: 21 December 2015 / Accepted: 18 February 2016 /Published online: 3 March 2016© Springer Science+Business Media New York 2016
Abstract Multimodal representation learning has gained increasing importance in variousreal-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g.Canonical Correlation Analysis (CCA). These works neglected the exploration of fusingmultiple modalities at higher semantic level. In this paper, inspired by the success of deepnetworks in multimedia computing, we propose a novel unified deep neural framework formultimodal representation learning. To capture the high-level semantic correlations acrossmodalities, we adopted deep learning feature as image representation and topic feature astext representation respectively. In joint model learning, a 5-layer neural network is designedand enforced with a supervised pre-training in the first 3 layers for intra-modal regular-ization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasetsshow that our approach achieves state-of-the-art results compare to both shallow and deepmodels in multimodal and cross-modal retrieval.
Keywords Multimodal representation · Deep neural networks · Semantic feature ·Cross-modal retrieval
� Cheng Wangcheng.wang@hpi.de
Haojin Yanghaojin.yang@hpi.de
Christoph Meinelchristoph.meinel@hpi.de
1 Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam,Germany
9256 Multimed Tools Appl (2016) 75:9255–9276
1 Introduction
Multimodal data has been the subject of much attention last years due to the rapid increas-ing of multimedia data on the Web. It brings new challenges for many multimedia analysistasks such as multimedia retrieval and multimedia content recommendation. Conventionalunimodal based analytical frameworks that concentrate on domain-specific knowledge arelimited to explore the semantic dependency across modalities and bridge the “semanticgap” [35] between different modalities. In the field of multimedia, multimodal representa-tion learning is becoming more and more necessary and important as different modalitiestypically carry different information. Furthermore, one modality can be a semantic com-plementary for another modality [32] in expressing similar concepts. Many research worksuch as [7, 20, 25, 36] have illustrated that multimodal representation has shown the abilityto outperform unimodal representation based approach in various applications and achievedremarkable results.
The learning of correlation in multimodal data is a prevalent approach for handlingthe problems including multimodal and cross-modal retrieval. Recently, several approacheshave been proposed to learn the correlation between image and text data. In image repre-sentation, the conventional approach is to represent the image to visual words with SIFT[21] descriptor. In text representation, bag-of-words(BOW) feature and topic feature thatderived from Latent Dirichlet Allocation(LDA) [3] are usually used as textual feature. Forlearning cross-modal and multimodal correlations, one popular approach is to build a jointmodel for fusing text and image modality and discover the underlying shared conceptsacross them. Figure 1 shows two example images with associated text. Note that, imageand text are loosely related to each other because not all text words are representative tocorresponding image. But the shared semantics between them are key to learn the latentrelationships between different content modalities. Although previous work have made sig-nificant progress, those works are restrictive for exploring intra-modal and inter-modalrelationships at higher semantic space. Recent advances in deep learning [16, 17, 25, 39]open up new opportunities in multimodal data representation and modeling. In this paper,we are interested in exploring the highly non-linear “semantic-level” relationships acrossmodalities with deep networks. To perform semantic correlation learning, we utilized deepconvolutional neural network(CNN) feature as image(visual) representation. CNN featurehas been demonstrated powerful ability in image representation comparing to traditionalhand-crafted visual features that derived from SIFT [21], DSIFT [38], SURF [1] and Fishervectors [29] et al. For text modality, we adopted topic feature that derived from LDA as tex-tual representation. Since visual features are highly abstracted by deep CNNs and textualfeatures are extracted by computing topic distribution, both of them are high-level features.Based on the semantic features we extracted, we propose a 5-layer neural network to learn ajoint model where visual and textual features are fused. Thus, our proposed framework con-sists of three parts: a visual model for learning image feature, a textual model for learningtopic feature and a joint model for semantically correlating multiple features from differ-ent modalities. Note that, during the training phase, we use image-text pairs to train a jointmodel, which will be used as feature extractor in multimodal and cross-modal retrieval.
As an instance of multi-task learning [5, 12, 51], our unified framework can be gener-alized to address multimodal and cross-modal retrieval problems. In mutlimodal retrieval[19, 30, 31], both image and text modalities are involved. We use pre-trained joint modelto extract the shared representation across image-text pairs. In cross-modal (unimodal)retrieval, only single modality is available. Similarly, we used joint model to project image
Multimed Tools Appl (2016) 75:9255–9276 9257
Fig. 1 Two examples of image-text pairs: (a) is selected from Wikipedia dataset (“geography” category),(b) is selected from MIR Flickr 25K dataset . The shared concept between image and text is the key tomultimodal and cross-modal retrieval tasks
and text to a common semantic space respectively. Because different modalities have beenmapped to common semantic space, cross-modal retrieval can be performed by employingdistance metrics. Our approach can be applied to additional modalities such as audio andvideo, however, we investigated image-text modality for multimodal representation learningin this work.
Our main novelties and contributions can be summarized as follows:
1. We proposed a supervised deep architecture for mapping different modalities to a com-mon feature space. By imposing supervised pre-training as a regularizer, intra-modaland inter-modal relationships can be better captured and better performance is achieved.
2. We investigated deep CNN feature as image representation in multimodal representa-tion learning and explored visual-textual fusion at higher semantic level. Our work canbe complementary to existing non-deep feature based approaches.
3. Extensive experiments were conducted to demonstrate the effectiveness of the proposedframework in multimodal/cross-modal tasks on open benchmark datasets. Detailedcomparisons and discussion with related approaches (including deep/non-deep mod-els) are provided. Our experiments show that proposed approach achieved competitiveresults on mutlimodal retrieval and state-of-the-art results on cross-modal retrieval task.
The rest of this paper is structured as follows. Section 2 introduces recent works interms of multimodal and cross-modal retrieval. Section 3 describes the learning architec-ture for image and text representation and the network we proposed to learn multimodalrepresentation. Section 4 demonstrates the training procedure of proposed network. Sec-tion 5 presents the experiments and results for verifying the effectiveness of our approachin various retrieval tasks. Section 6 concludes this work and gives the outlook.
2 Related work
Recently, there are many works concentrate on mutlimodal/cross-modal problems. We candivide these works into following categories: (1) CCA-based approach, (2) topic model-based approach, (3) hash-based approach, (4) deep learning-based approach, (5) others.
One of the most popular approaches to perform cross-modal retrieval is to match textand image modality via canonical correlation analysis(CCA) [37]. N. Rasiwasia and J.Costa Pereira et al. [27, 32] proposed semantic correlation matching(SCM) that combinedwith CCA to learn the maximally correlated subspace. In [33] A. Sharma et al. extendedCCA to generalized multiview analysis(GMA), which is useful for cross-view classification
9258 Multimed Tools Appl (2016) 75:9255–9276
and cross-media retrieval. Unfortunately, CCA and its extension approaches are difficult toprocess unpaired data during training. Therefore, the unpaired data are not appropriatelyconsidered.
In [18] a nonparametric Bayesian approach was proposed to learn upstream supervisedtopic models for analyzing multimodal data. Y. F. Wang et al. [42] introduced a super-vised multimodal mutual topic reinforce modeling (M3R) approach by building a jointcross-modal probabilistic graphical model. This approach intended to discover the mutuallyconsistent semantic topics via appropriate interactions between model factors. Nevertheless,topic model based approach generally requires complicated computations.
In the work of [45], a supervised coupled dictionary learning with group structures formulti-modal retrieval (SliM2) was proposed. It can be seen as a constrained dictionary learn-ing problem. In [52] J. Zhou et al. proposed a approach named Latent Semantic SparseHashing (LSSH) to perform cross-modal similarity search by employing Sparse Coding andMatrix Factorization. In [48], a discriminative coupled dictionary hashing (DCDH) methodwas proposed. It aimed to preserve not only the intra-similarity but also inter-correlationamong multimodal data.
The recent success of deep learning advances the study of modeling multimodal datawith deep neural network. Actually, the powerful ability of deep learning has been proved inmany fields including image classification analysis [16], speech recognition [9], text detec-tion [13], video classification [44] and multimodal data modeling [8, 24, 25, 34, 36, 41].In [24] authors proposed to correlate image and text caption with deep canonical correla-tion analysis (DCCA), it proved canonical correlation is a very competitive objective, notonly for shallowly learnt features, but also in the context of deep learning. In [25] J.Ngiamet.al applied multimodal deep learning approach in audio-visual speech classification bygreedily training restricted boltzmann machine (RBM) and deep autoencoder. N.Srivastavaet al. [36] proposed a deep boltzmann machine (DBM) based approach to extract a unifiedrepresentation from different data modalities. The evaluation results show that this represen-tation is useful in addressing both classification and information retrieval problem. In [41]stacked autoencoder was extended to multimodal stacked autoencoder(MSAE). Compare toprevious works, it is an effective mapping mechanism but requires little prior knowledge.In work [34], a weakly-shared Deep Transfer Networks (DTNs) is proposed to translatecross-domain information from text to image. Based on two basic uni-modal autoencoders,a correspondence autoencoder (Corr-AE) [8] was proposed and shown that the combinationof representation learning is very effective.
Besides CCA, topic model, hashing and deep learning-based approaches, some otherapproaches were also proposed to tackle multimodal/cross-modal problem. J. Yu et al. [47]designed a cross-modal retrieval system that considering image-text statistical correlations.In [49], X.Hua Zhai et al. proposed a cross-modality correlation propagation (CMCP) algo-rithm. It aimed to exploit both positive and negative correlations for cross-modal retrieval.In another work that conducted by K.Y. Wang et al. [40] in which a method of combiningcommon subspace learning and coupled feature selection for cross-modal matching problemwas proposed. Through selecting features with l21-norm from coupled modalities, coupledlinear regression was used to project data to a common space. Recent research work [43]introduced an approach called Bi-directional Cross-Media Semantic Representation Model(Bi-CMSRM). It performs bi-directional ranking by learning a latent space for both image-query-text retrieval and text-query-image retrieval. One of the most recent works to processcross-modal retrieval is Local Group based Consistent Feature Learning (LGCFL) [15] inwhich a local group prior knowledge was introduced. for improving block based imagefeature (HOG,GIST [26]).
Multimed Tools Appl (2016) 75:9255–9276 9259
Unfortunately, on one hand, the features used in previous work are low-level or sophis-ticated low-level features. We found that most of works adopted BOVW (bag of visualwords)(e.g. 128-D [15, 18, 40, 41, 50, 52], 500-D [45], 512-D [23], 1000-D [43, 45, 48]and 4096-D [27] BOVW) to represent image and LDA-based 10-D feature vector to rep-resent text(limited to Wikipedia Dataset). Although some other features such as GIST [26]and PHOW [4, 38] were explored in multimodal/cross-modal scenario, none of those worksconsidered deep CNN feature (highly abstracted semantic features) [13, 16] as image repre-sentation. For text representation, commonly two features were adopted in those works. Thefirst one is BOW(bag of words) feature which generally has high dimensionality. The otherone is LDA based topic distribution which can be computed with prior parameter of pre-trained model for modeling new document. On the other hand, most of projection modelsare shallow models. These methods are restrictive for exploring the high-level semantic cor-relations across modalities. In this paper, we leverage deep CNN feature and topic featureas visual and textual representation respectively, and train a joint deep model for capturingthe correlations between visual and textual representation at higher abstract level.
3 Learning architecture
An overview of proposed architecture is shown in Fig. 2. The overall framework consists ofthree components: (1) a visual model for image representation learning, (2) a textual modelfor text representation learning and (3) a joint model for multimodal representation learn-ing. Components (1) and (3) involve deep neural network. In representing image and textmodality as semantic feature, the first step is to train visual Mv and textual Mt models sepa-rately. Then with pre-trained models we can represent image-text pair document as semanticfeature pair. For a given training dataset S that contains N documents S = {D1,D2...DN },each image-text pair D = {Ir , Tr } can then be represented as feature level pair D = {I, T },where I and T denote visual and textual semantic feature that extracted from raw imageinput Ir and Tr with visual model Mv and Mt respectively:
Mv(Ir ) → I Mt(Tr ) → T . (1)
Fig. 2 Deep Semantic Multimodal Learning Framework. The input image-text pair D = {Ir , Tr } are repre-sented as 4096-D deep CNN feature I ∈ �I and 20-D topic or 2000-D tag features T ∈ �T respectively.W(l) denotes the weights between l and l + 1 layer. Both visual and textual features are available in networklearning as well as multimodal retrieval phase. In cross-modal retrieval, only single modality is available andthe other one is initialized with zero values as inputs, e.g. with pre-trained joint model J , to map image to �
can be performed by J (I) = J (I, 0) → πi
9260 Multimed Tools Appl (2016) 75:9255–9276
Multimodal representation can be understood as a joint representation where featuresfrom multiple modalities are combined. To learn multimodal representation, we proposeda regularized deep neural network (RE-DNN) for fusing visual and textual feature. Theintention of RE-DNN is to learn a joint model J ,
J : ΨI , ΨT → Π (2)
ΨI and ΨT represent image and text semantic feature spaces. Π denotes the common seman-tic space learned by RE-DNN for correlating different modalities. Then we can learn theshared representation π ∈ Π for given image-text pair {I, T } by
J ((I, T ),W, b) → πi (3)
where I ∈ ΨI , T ∈ ΨT are extracted visual and textual feature respectively. W and b
are weights and biases that learned by RE-DNN with training data. In cross-modal map-ping with J , since there is only one modality available, similar to [8, 25], we set the othermodality as zero:
J ((I, 0),W, b) → πi J ((0, T ),W, b) → πt . (4)
where πi and πt represent the projected features in common semantic space from visualfeature I and textual feature T respectively. In our work, the πi and πt are pre-activationvalue of the 5-th layer. Since πi ∈ Π and πt ∈ Π in same feature space, it is straightforwardto calculate the similarity between different modalities.
4 Methodology
This section elaborates the training and optimization procedures of our proposed RE-DNNfor learning image-text multimodal joint model.
In our proposed 5-layer DNN, one-third of network for image modality only, one-third ofnetwork for text modality only and the last one-third of network for multimodal joint model-ing. The whole learning process can be divided into two phases: (1)supervised pre-trainingfor intra-modal regularization and (2) whole network training. Intra-modal regularization ismainly responsible for learning optimal parameters for the first three layers of DNN. Y. Ben-gio et al. [2] pointed out that greedily pre-training basically works better than that withoutpre-training. Since our DNN has inputs from two different modalities, we adopted super-vised pre-training for estimating weights and biases for each modality respectively. Thenthe training of whole network is initialized with optimized weights and biases. Formally,given N training samples, each of them are preprocessed as a set of {visual feature, textualfeature, ground truth label}. Then training set D is denoted as D = {v(n), t (n), y(n)}Nn=1,
where v = {vp}NI
p=1, t = {tq}NT
q=1 and y = {yk}Kk=1. It denotes single input vector v ∈ RNI
,
t ∈ RNT
and y ∈ RK . Let xi denote the input vector at layer l-1, the pre-activation value z
of j th unit at l layer can be formulated as
z(l)j =
N(l−1)∑
i=1
W(l−1)ij xi + b(l−1) (5)
where N(l−1) represents the number of units at l-1 layer. Wij is the weight between ith andj th units, b is the bias unit. Then j th neural unit at layer l is computed by
y(l)j = f (l)(z
(l)i ) l > 1 (6)
Multimed Tools Appl (2016) 75:9255–9276 9261
In our work, sigmoid function f (x) = 1/(1 + e−x) is used as activation function for allhidden layers (l = 2, 3, 4), and softmax function f (x) = e(x−ε)/
∑Kk=1 e(xk−ε) is used
for activating output layer (l = 5). where ε = max(xk). The intra-modal regularizationproblem is to minimize overall error of training samples for each modality, that described as
argminθ∗
Cm = 1
2N
N∑
n=1
‖ y(n)m − y(n) ‖2 + λ
2
Lm−1∑
l=1
‖ Wl ‖2F . (7)
where parameter space θ∗ = {W (l)m , b(l)
m }, m ∈ {v, t} and Lm = 3.In back propagation we use ground truth semantics to regulate weights and to find opti-
mal parameters W(1)t , W
(2)t , b
(1)t and b
(2)t for text modality, W
(1)v , W
(2)v , b(1)
v and b(2)v for
image modality at layer l2 and l3 respectively. The intra-modal regularization is introducedto reduce noise feature and reserve intrinsic and representative feature for each modal-ity before fusing them at the shared hidden layers. The second part of equation (7) isweight decay term that used for preventing overfitting during training, λ is the weightdecay parameter. The training of whole network performs by initializing with optimalweights and biases θ∗ learned from intra-modal regularization. We randomly initialize W (3),W (4), b(3) and b(4) as standard training steps. The objective is to learn a parameter spaceΩ∗ = {W (l)
m , W (l), b(l)m , b(l)} by
argminΩ∗
C(θ∗) = Cm + 1
2N
N∑
n=1
‖ y(n) − y(n) ‖2 + λ
2
L−1∑
l=1
‖ Wl ‖2F (8)
The back propagation of whole network differs from standard propagation by which thederivation of weights and biases for different modality should be considered separately. Theprocedure of supervised pre-training for intra-modal regularization can be summarized inAlgorithm 1, and the procedure for whole network learning is described in Algorithm 2.
9262 Multimed Tools Appl (2016) 75:9255–9276
5 Experiments
This section introduces the experiments we conducted on two benchmark datasets:Wikipedia dataset1 and MIR Flickr 25K.2 We also compared the proposed approach withthe state-of-the-art methods for multimodal and cross-modal retrieval.
5.1 Experiment setups
5.1.1 Dataset descriptions
The first dataset utilized in our experiments is Wikipedia dataset [32], which is chosen from“featured articles” of Wikipedia. This is a continually updated collection of 2866 labeled
1http://www.svcl.ucsd.edu/project/crossmodal/2http://press.liacs.nl/mirflickr/
Multimed Tools Appl (2016) 75:9255–9276 9263
documents, each of them is composed of an image and corresponding text description. Thewhole dataset covers 10 semantic categories such as sport, art and history. The dataset israndomly split into two parts, of which 2173 documents are selected as training set and693 documents are selected as test set. The author of [32] also published extracted featuredata, that is, 10-D topic features for text representation and 128-D SIFT features for imagerepresentation. In this paper, we represented text as 20-D topic feature which is derived fromLDA, and represented image as 4096-D deep CNN features.
The second applied dataset is MIR Flickr 25K[11]. It has been recently introducedfor image retrieval and multimodal/cross-modal retrieval evaluation. It consists of 25000images that were downloaded from Flickr.3 Each image is accompanied with correspondingtags, the average number of tags per image is 8.94. However, some tags are weak labeledwhich leads those tags actually not relevant to image. This dataset covers 38 semantic cat-egories such as sky, flower and food. It should be noted that each image may belong tomultiple categories. In this work, 15K images were randomly selected for training, the restof 10K images were selected for test, 5K images of the test set were randomly selectedas query database, another 1K non-overlap images of the test set are randomly selected asqueries. For representation, we adopted the 2000-D tag features as text feature in [36] andgenerated 4096-D deep CNN features as image representation.
For images in both datasets, we adopted Caffe reference model4 [14] that trained withImageNet ILSVRC12 dataset [6] (1.2M training images) for deep CNN feature extraction.The features were extracted from the 7th layer (fully connected layer) of the caffe referencemodel. For further verify the effectiveness of our proposed approach and compare with deepmodels, we also conducted a group of experiments on MIR Flickr 25 K. We replaced deepCNN features by 3857-D features (a combination of PHOW[4], GIST [26] and MPEG-7descriptors [22] etc.) that used in [36].
5.1.2 Configurations
For visual feature extraction, we applied Caffe framework on Ubuntu 12.04 with NvdiaGTX 780 GPU with 3G memory. Our textual model was trained on Ubuntu 12.04 withIntel 3.20 GHz × 4 CPU and 8G RAM. To learn multimodal representation, we trainedthe joint model by using Matlab on Windows 8 platform, Intel 3.20 GHz × 4 CPU and 8GRAM.
In DNN learning, the first three layers are designed for intra-regularization forboth image and text modality. The unimodal networks are designed as [N/100/M](N: the dimension of image or text feature input, M: the number of categories, e.g.image network is designed as [4096/100/10] for deep CNN features). The last threelayers are set as [2M/100/M] for exploring inter-modal correlation across modali-ties. In our experiments, learning rate α = αm = 0.001, momentum = 0.9achieved the best performance on Wikipedia dataset, and α = αm = 0.01, momen-tum = 0.9 achieved the best performance on MIR Flickr 25K. According to thescale of our training data (2173 training samples), we adopted the mini batch gra-dient descent with batch size 41 for Wikipedia and 100 for MIR Flickr 25K. Forboth datasets the epoch number fixed at K = 200 and weight decay parameterλ = 10−4.
3https://www.flickr.com/4https://github.com/BVLC/caffe/tree/master/models/
9264 Multimed Tools Appl (2016) 75:9255–9276
5.1.3 Evaluation metrics
In order to compare with previous work, we also adopt mean average precision(mAP)to evaluate the retrieval performance. For each query, assume N documents are retrievedamong Ntest relevant documents. We compute the average precision by
AP = 1
Ntest
N∑
i=1
p(i)r(i) , (9)
where p(i) represents the precision of top i retrieved documents. r(i) denotes the rele-vance of a given rank, r(i) = 1 means the i-th retrieved document is relevant to query andr(i) = 0 otherwise. The mAP over Nq queries then can be computed by
mAP = 1
Nq
Nq∑
n=1
APn. (10)
Because different modalities can be projected to common semantic space by deep archi-tecture, it is natural to adopt distance metric for multimodal and cross-modal retrieval. Forthe distance calculation, we adopted four different distance metrics as in [27]: Euclideandistance (Euclidean), Kullback-Leibler divergence (KL), Cosine distance (Cosine), Normal-ized Correlation(NC).
5.2 Experiments on wikipeidia dataset
5.2.1 Effects of distance metrics
This set of experiments aim to explore the effect of distance metric to retrieval perfor-mance. Table 1 shows the mAP performance of different query types on Wikipedia dataset.Here we primarily consider three query types: multimodal query(QI+T ), image query(QI )and text query(QT ). We also consider the average performance of image and text query(QI +QT
2 ). Since NC achieved the best performance on Wikipedia dataset, we thus appliedNC as distance metric in the comparison with recent works.
5.2.2 Multimodal retrieval
We further tested our approach for multimodal image retrieval. Similar to [28], the textdata within document are served as semantic complementary information for improvingimage retrieval performance. Our work is in line with previous work that we also adoptedmean average precision(mAP) and Precision-Recall curve(P-R curve) to evaluate the overallperformance. Figure 3 shows the mAP performance for each semantic category by using
Table 1 mAP of differentdistance metrics(Wikipedia)
QI+T QI QTQI +QT
2
KL 0.6243 0.3439 0.272 0.308
Cosine 0.6384 0.3439 0.3175 0.3307
Euclidean 0.6146 0.3403 0.2581 0.2992
NC 0.6395 0.3404 0.3526 0.3465
Multimed Tools Appl (2016) 75:9255–9276 9265
Fig. 3 Per-category mAP (Wikipedia)
RE-DNN with different distance metrics. Table 2 presents the mAP of our approach andsome related work on multimodal retrieval task. The best result of RE-DNN is 0.6395 byusing NC, which is comparable to the state-of-the-art result on Wikipedia dataset.
5.2.3 Unimodal retrieval
As mentioned before, in unimodal retrieval, there is only one modality available in theretrieval phase. The query image(text) and text(image) database are mapped to commonsemantic feature space respectively. Our experiments consider: (1) use text query image and(2) vise versa.
Figure 4 presents the mAP performance against retrieval scope. In this experiment, weintended to observe the change of mAP along with increasing of retrieval scope, that is, thevalue of kt in image query and ki in text query. We initialized kt = ki = 2 and increased to693 (all test samples) with kt = ki = 2. In text query, we note that NC distance metricshows better performance in improving mAP with increasing ki compare to other distancemetrics. In image query, all distance metrics show similar performance as increasing kt .It demonstrated that NC is more appropriate to handle high-level semantic feature basedcross-modal retrieval.
Table 3 summaries the comparison result between our approach and recent work at tworetrieval scopes kt = ki = 8 and kt = ki = 50. At kt = ki = 8, we achieved comparableresult to PFAR [23], where we outperform PFAR [23] on image query and the average mAPscore. The best mAP performance obtained by using NC distance metric. Similarly, wecompared RE-DNN to SliM2 [45], in which retrieval scope was set to ki = kt = 50. Thebest results of SliM2 [45] are 0.2548 for image query and 0.2025 for text query. At that scale,
Table 2 mAP of multimodalretrieval(Wikipedia)
QI+T
Multi-Modal SGM RF [46] 0.641
Multi-Modal SGM Gaussian [46] 0.581
RIS [28] 0.356
TTI [10] 0.323
RE-DNN 0.6395
9266 Multimed Tools Appl (2016) 75:9255–9276
Fig. 4 mAP performance against retrieval scope (Wikipedia)
our image query result range from 0.2803(Euclidean) to 0.2854(KL), it outperforms SliM2.The text query obtained mAP score are 0.1455(KL), 0.1983(Cosine), 0.1428(Euclidean)and 0.2416(NC) respectively. RE-DNN outperforms SliM2 for all query types with a certainmargin. Overall, the proposed RE-DNN with NC metric achieves sate-of-the-art result atboth retrieval scopes. Figure 5a and b present Precision-Recall (P-R) Curves for image queryand text query separately. In both cases we considered the whole retrieval scope (ki = kt =693) and compared with SCM [27], CMTC [47] and CMCP [49].5 Note that for both imageand text query tasks, our approach obtained better precision at most of recall levels compareto CMCP and CMTC. Although RE-DNN doesn’t show better performance compare toSCM regarding image query, it performs better than SCM in terms of text query. Moreover,RE-DNN has higher precision at almost all recall levels in text query.
Table 4 describes the further comparison results between RE-DNN and SCM [32],CMTC [47], CMCP [49]. LCFS [40], Bi-CMSRM [43] and Corr-Full-AE [8] in terms ofmAP performance. Our approach achieved competitive results in text query (0.3526) andaverage mAP score(0.3465). It should be noted that it is difficult to make an exact com-parison with LCFS [40], Bi-CMSRM [43] and Corr-Full-AE [8], because the division oftraining/test dataset in those works are different to ours. In this work, we strictly followedthe scheme as [32, 47, 49] that with 2173 training/693 test documents. However [40] used1300/1566 as training/test dataset, [43] used 1500/500/866 as training/validation/test datasetand Corr-Full-AE [8] used 2173/462/231 as training/validation/test dataset. Besides, theretrieval scope in [8] was set to ki = kt = 50 which is different to our experiment setting(ki = kt = 693). The redivided datasets, unfortunately, were not published. Thus the directcomparison between our approach and [40, 43] is not possible. Nevertheless, our approachachieved similar result to [8] and a large margin comparing to [40, 43].
5.3 Experiments on MIR Flickr 25K dataset
This section presents the comparison with deep models regarding multimodal and unimodalretrieval. We first compare our approach to related work. Then, we conduct a group of
5The P-R values read from graph
Multimed Tools Appl (2016) 75:9255–9276 9267
Table 3 mAP performance comparison at different retrieval scope (Wikipedia)
kt = ki = 8 kt = ki = 50
QI QTQI +QT
2 QI QTQI +QT
2
PFAR [23] 0.298 0.273 0.286 − − −SliM2 [45] − − − 0.2548 0.2025 0.2287
RE-DNN 0.3519 0.2300 0.291 0.2815 0.2416 0.2616
experiments where we replaced deep CNN features with the image features used in [36]to verify the effectiveness of proposed RE-DNN without deep CNN features. The featuresused in [36] are 3857-D features that mixed from 2000-D Pyramid Histogram of Words(PHOW) features, 960-D Gist features, 256-D Color Structure Descriptor features and soon. Here, we refer to the mixed features as “PHOW feature” due to its ratio in the mixedfeatures. Since some image belongs to more than one category, in retrieval procedure, if aquery and retrieved items have overlap category label, then we assume the retrieved item isrelevant to the query [36].
5.3.1 Effects of distance metrics
First of all, we also report our approach (RE-DNN with deep CNN features) by using dif-ferent distance metrics. This is to explore the sensitivity of learned joint features to differentmetrics. As shown in Table 5, in multimodal query, Euclidean and KL perform sightly poorthan Cosine and NC. In unimodal query, all distance metrics achieved similar results. Tomake a fair comparison, we selected the Cosine distance metric as in [36]. Therefore, in therest of experiments, we select Cosine as distance metric for comparison purpose.
Fig. 5 Precision-Recall and average mAP performance comparison
9268 Multimed Tools Appl (2016) 75:9255–9276
Table 4 Comparsion of mAPperformance (Wikipedia)
QI QTQI +QT
2
CCA [27] 0.21 0.174 0.192
SM [27] 0.350 0.249 0.300
SCM [27] 0.362 0.273 0.318
CMTC [47] 0.293 0.232 0.266
CMCP [49] 0.326 0.251 0.289
LCFS [40]∗ 0.2798 0.2141 0.2470
Bi-CMSRM [43]∗ 0.2528 0.2123 0.2326
Corr-Full-AE [8]∗ 0.335 0.368 0.352
RE-DNN 0.3404 0.3526 0.3465
5.3.2 Multimodal retrieval
This subsection gives the comparison between RE-DNN and other deep models such asAutoencoder [25], DBM( Deep Boltzmann Machines) [36] and DBN (Deep Belief Net-work) [36] in multimodal retrieval. Differs to those works, our architecture trains jointmodel in a supervised manner.
Figure 6a shows the precision recall curve of RE-DNN and compared approaches in mul-timodal retrieval. We note that our approach consistently improves precision at all recallscales with large margin comparing to other deep models. From Table 6a, we can see thatRE-DNN achieved mAP 0.719 which significantly outperforms other deep models andachieves state-of-the-art result in multimodal retrieval task on MIR Flickr 25K dataset.Since Autoencoder and DBM are unsupervised architectures, our experiment also provedthat supervised DNN architecture is more capable of capturing the relationships acrossmodalities and learning joint representation without additional data for pre-training.
5.3.3 Unimodal retrieval
The comparison between RE-DNN with deep models DBN and DBM in [25] in termsof unimodal retrieval performance is described as in Fig. 6b . It illustrates RE-DNN con-sistently improves the precision in image query. The best mAP performance of RE-DNNis 0.677, which significantly outperforms image-DBN(0.578), image-DBM (0.587) andmultimodal-DBM (0.614).
Table 5 mAP of differentdistance metrics(MIR Flickr25K) QI+T QI QT
QI +QT
2
KL 0.7048 0.6548 0.5919 0.6234
Cosine 0.7191 0.6771 0.5760 0.6269
Euclidean 0.7029 0.6663 0.5816 0.624
NC 0.7178 0.6789 0.5755 0.6272
Multimed Tools Appl (2016) 75:9255–9276 9269
Fig. 6 Precision Recall on MIR Flickr 25K
5.3.4 Experiment with PHOW feature
Because deep CNN feature is used as image feature for learning a joint model to capturethe correlation across modalities. In order to verify the effectiveness and generality of pro-posed RE-DNN. This experiment is to explore the performance of RE-DNN architecture ontraditional features. As mentioned, we adopted the features used in [36] which we refer toit as “PHOW”feature. We replaced the 4096-D deep CNN feature by 3857-D PHOW fea-ture and keep the other configuration same to previous experiments (the only difference isthe input layer of image network, which is set as 3857). We re-trained image network andthe joint model for mapping different modalities to common semantic space. Our experi-mental results are reported in Fig. 6a for multimodal query and Fig. 6b for unimodal query.From precision recall curve in Fig. 6a, we find that RE-DNN with PHOW has higher pre-cision at most of recall scale than DBM but inferior to RE-DNN with deep CNN featurein multimodal query. Similarly, in unimodal query, RE-DNN with PHOW shows sightlyimprovements compare to DBM. The performance comparison as shown in Table 6, with
Table 6 Comparison of mAPwith deep models Methods mAP
(a) Multimodal query
DBN [36] 0.609
Autoencoder [25] 0.612
DBM [36] 0.622
RE-DNN (PHOW feature) 0.648
RE-DNN (CNN feature) 0.719
(b) Unimodal query
Image-DBN [36] 0.578
Image-DBM [36] 0.587
Multimodal-DBM [36] 0.614
RE-DNN (PHOW feature) 0.632
RE-DNN (CNN feature) 0.677
9270 Multimed Tools Appl (2016) 75:9255–9276
Fig. 7 The illustrative examples of multimodal image retrieval. Textual data is seen as auxiliary and semanticcomplementary information in image retrieval. Each row presents an example, where the query images aremarked by red bounding box, and the top-4 retrieved results are placed subsequently
PHOW feature, RE-DNN achieves mAP 0.648 for multimodal query and 0.632 for imagequery. This also further suggests the capability of deep CNN feature in mutlimodal andunimodal retrieval that rarely explored by previous work.
5.4 Illustrative examples
This subsection gives some examples of multimodal retrieval on Wikipedia and MIR Flickr25K dataset (see Figs. 7 and 8) and cross-modal retrieval on Wikipedia dataset(see Fig. 9).
Fig. 8 Multimodal queries on MIR Flickr 25K. The query images are marked with red bounding box, thetop-4 retrieved results are shown subsequently
Multimed Tools Appl (2016) 75:9255–9276 9271
Fig. 9 The illustrative examples of image and text query. Top-left: image or text query and correspond-ing semantic category probabilities distribution. Top-right: ground-truth image or text of query and withcorresponding semantic category probabilities distribution. Middle: retrieved top-4 images or texts. Bottom:semantic category probabilities distribution of retrieved images or texts
9272 Multimed Tools Appl (2016) 75:9255–9276
In Fig. 7, two exemplary results of multimodal image retrieval are illustrated. Differentto traditional content-based image retrieval in which only visual feature is involved, ourapproach can explore the deeper semantic relation between different modalities. It elevatesthe retrieval at “semantic level” by leveraging textual information as semantic complemen-tary. We can see that some visually dissimilar images are found because they share similarhidden topics or concepts with the query image. More specifically, the first one share theconcept in biology, while images share concepts in war, warfare in the second example.
Figure 8 presents two multimodal query examples. We can see that for top 4 retrievedresults, they share similar visual or textual concept as query. For example, all entries in thefirst query share concepts flower, nature and color. As mentioned before, we observed thatsome ground truth tags are not exactly related to corresponding image, such as d50, d80and nikond50. It further demonstrates that our approach is able to retain intrinsic essentialinformation and remove noise information in feature learning phase.
Figure 9 presents some examples for image query and text query. In both of them, top 4most relevant results are returned. We use red histogram to represent queries, and green his-togram to represent ground truth text corresponding to image query. Wine color histogram isadopted to represent retrieved results. We also displayed the probabilities score for queries,ground truth and retrieved results. Each histogram is computed using features that derivedfrom the output of 4th layer of RE-DNN. For image query, we note that all displayed textsare correctly retrieved due to all of them are perceived as belonging to semantic category9 (sport). Similarly, all results for text query also are correct because the probabilities ofall results on semantic category 3 (geography) are around 0.6, which consistent with thesemantics of both query text and ground truth image.
5.5 Discussion
5.5.1 Relations to shallow models
Different to previous shallow models in [23, 27, 40, 43, 45, 47, 48], RE-DNN captures tworelevant relationships: intra-modal relationship ( between feature and semantic) and inter-modal relationship( between image and text) with deep neural network. This work is alsoan attempt to adopt deep CNN feature for cross-modal problem at higher semantic level.The image representation used in previous work are hand crafted features such as SIFT[21], GIST [26], PHOW [4] etc. We propose to apply robust CNN model trained on largescale dataset base such as ImageNet (1.2M images) to extract image features from highersemantic level (7th layer). Visual and textual features are then fused in a supervised way,which achieved the state-of-the-art result in both evaluation datasets.
Another advantage of RE-DNN is that it is robust to handle modality missing problemwhile the most compared approaches suffer difficultly in processing unpaired data. We sug-gest that by setting the other modality (e.g. text) to zero (refers to section 3), we are ableto initialize the whole network without this modality. By doing this, for unimodal input,RE-DNN can inference the missing modality with a pre-trained joint model where the cor-relations between different modalities have been stored. By taking the propery of RE-DNN,we perform cross-modal retrieval and achieved to-date best result.
5.5.2 Relations to deep models
Differs to recently introduced deep models in [25], which focuses on video-audio datafusion, our work focuses on very different modalities, image-long text(or text tags) data
Multimed Tools Appl (2016) 75:9255–9276 9273
fusion problem. Furthermore, we train our network in supervised manner while works of [8,25, 36] are unsupervised architectures. For instance, in [36], image network is designed as[3857/1024/1024], text network is designed as [2000/1024/1024], the joint layer has 2048units. However, our architecture is designed as follows: image network is [N/100/M], textnetwork[N/100/M] where N presents the dimensionality of input feature and M denotes thenumber of categories. Two joint layers are designed as [2M/100/M], more details about net-work configuration refer to Section 5.1.2. Comparing our network to the network structurein [36], we use less hidden units which leads to learn less parameters. Overall, our approachis superior in both model complexity and retrieval performance.
6 Conclusion
We have introduced a unified DNN framework for multimodal representation learning. Byextracting 4096-D visual features with deep CNN and 20-D(or 2000-D) textual features, thecorrelation across modalities is explored with the proposed RE-DNN. By imposing super-vised pre-training, RE-DNN can capture both intra-modal and inter-modal relationshipsat higher semantic level. Our experimental studies on open benchmarks shown that RE-DNN outperforms the alternative approaches and achieves state-of-the-art performance onmultimodal image retrieval and cross-modal retrieval.
Our future work will focus on the optimization problems of learning DNN for variousmultimodal modeling tasks. A more reasonable and effective framework will be developedbased on current study. Besides, considering that our framework can be easily applied toother multimodal scenarios, it would be also interesting to apply RE-DNN on bi-modalfusion or cross-model mapping for video-text, audio-video modality pairs.
References
1. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis ImageUnderst 110(3):346–359
2. Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deepnetworks. Adv Neural Inf Process Syst 19:153
3. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–10224. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE
11th international conference on computer vision, ICCV 2007. IEEE, pp 1–85. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–756. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image
database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–2557. Escalante HJ, Hernadez CA, Sucar LE, Montes M (2008) Late fusion of heterogeneous methods for
multimedia image retrieval. In: Proceedings of the 1st ACM international conference on multimediainformation retrieval. ACM, pp 172–179
8. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedingsof the ACM international conference on multimedia. ACM, pp 7–16
9. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, SainathTN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views offour research groups. IEEE Signal Process Mag 29(6):82–97
10. Hoffman J, Rodner E, Donahue J, Darrell T, Saenko K (2013) Efficient learning of domain-invariantimage representations. arXiv:1301.3224
11. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACMinternational conference on multimedia information retrieval. ACM, pp 39–43
12. Jacob L, Vert J-p, Francis R Bach. (2009) Clustered multi-task learning: a convex formulation. AdvNeural Inf Process Syst, pp 745–752
9274 Multimed Tools Appl (2016) 75:9255–9276
13. Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: ECCV. Springer, Berlin,pp 512–528
14. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe:convolutional architecture for fast feature embedding. arXiv:1408.5093
15. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modalmultimedia retrieval. IEEE Trans Multimedia 17(3):370–381
16. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neuralnetworks. Adv Neural Inf Process Syst, pp 1097–1105
17. Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsuper-vised learning of hierarchical representations. In: Proceedings of the 26th annual international conferenceon machine learning. ACM, pp 609–616
18. Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In:Proceedings of the 7th ACM international conference on Web search and data mining. ACM, pp 493–502
19. Lienhart R, Romberg S, Horster E (2009) Multilayer plsa for multimodal image retrieval. In: Proceedingsof the ACM international conference on image and video retrieval. ACM, p 9
20. Liu D, Lai K-T, Ye G, Chen M-S, Chang S-F (2013) Sample-specific late fusion for visual cate-gory recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE,pp 803–810
21. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
22. Manjunath BS, Ohm J-R, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE TransCircuits Syst Video Technol 11(6):703–715
23. Mao X, Lin B, Cai D, He X, Pei J (2013) Parallel field alignment for cross media retrieval. In:Proceedings of the 21st ACM international conference on Multimedia. ACM, pp 897–906
24. Mikolajczyk FYK Deep correlation for matching images and text. In: 2015 IEEE conference oncomputer vision and pattern recognition (CVPR). IEEE, p 2015
25. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedingsof the 28th international conference on machine learning (ICML-11), pp 689–696
26. Oliva Aude, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatialenvelope. Int J Comput Vis 42(3):145–175
27. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On therole of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal MachIntell 36(3):521–535
28. Pereira JC, Vasconcelos N (2014) Cross-modal domain adaptation for text-based regularization of imagesemantics in image retrieval systems. Comput Vis Image Underst 124:123–135
29. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Pro-ceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007). IEEE,pp 1–8
30. Pham T-T, Maillot NE, Lim J-H, Chevallet J-P (2007) Latent semantic fusion model for image retrievaland annotation. In: Proceedings of the sixteenth ACM conference on Conference on information andknowledge management. ACM, pp 439–444
31. Pulla C, Jawahar CV (2010) Multi modal semantic indexing for image retrieval. In: Proceedings of theACM international conference on image and video retrieval. ACM, pp 342–349
32. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) Anew approach to cross-modal multimedia retrieval. In: Proceedings of the international conference onmultimedia. ACM, pp 251–260
33. Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminativelatent space. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).IEEE, pp 2160–2167
34. Shu X, Qi G-J, Tang J, Wang J (2015) Weakly-shared deep transfer networks for heterogeneous-domainknowledge propagation. In: Proceedings of the 23rd annual ACM conference on multimedia conference.ACM, pp 35–44
35. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at theend of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
36. Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. Adv NeuralInf Process Syst, pp 2222–2230
37. Thompson B (2005) Canonical correlation analysis. Encyclopedia of statistics in behavioral science38. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In:
Proceedings of the international conference on multimedia. ACM, pp 1469–1472
Multimed Tools Appl (2016) 75:9255–9276 9275
39. Vincent Pascal, Larochelle Hugo, Lajoie Isabelle, Bengio Yoshua, Manzagol P-A (2010) Stacked denois-ing autoencoders: learning useful representations in a deep network with a local denoising criterion. JMach Learn Res 11:3371–3408
40. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modalmatching. In: IEEE international conference on computer vision (ICCV). IEEE, pp 2088–2095
41. Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014) Effective multi-modal retrieval based on stackedauto-encoders. Proceedings of the VLDB Endowment 7(8):649–660
42. Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the ACM international conference on multimedia. ACM, pp 307–316
43. Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM international conference on multimedia.ACM, pp 877–886
44. Wu Z, Jiang Y-G, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships withdeep neural networks for video classification. In: Proceedings of the ACM international conference onmultimedia. ACM, pp 167–176
45. Wu F, Zhang Y, Lu WM, Zhuang YT, Wang YF (2013) Supervised coupled dictionary learningwith group structures for multi-modal retrieval. In: Twenty-Seventh AAAI Conference on ArtificialIntelligence
46. Xie L, Pan P, Lu Y (2013) A semantic model for cross-modal and multi-modal retrieval. In: Proceedingsof the 3rd ACM conference on international conference on multimedia retrieval. ACM, pp 175–182
47. Yu J, Cong Y, Qin Z, Wan T (2012) Cross-modal topic correlations for multimedia retrieval. In:Proceedings of the 21st international conference on pattern recognition (ICPR). IEEE, pp 246–249
48. Yu Z, Wu F, Yang Y, Tian Q, Luo Jiebo, Zhuang Y (2014) Discriminative coupled dictionary hashing forfast cross-media retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research& development in information retrieval. ACM, pp 395–404
49. Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. In:IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2337–2340
50. Zhai X, Peng Y, Xiao J (2013) Cross-media retrieval by intra-media and inter-media correlation mining.Multimedia Systems 19(5):395–406
51. Zhang Y, Yeung D-Y (2012) A convex formulation for learning task relationships in multi-task learning.UAI
52. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In:Proceedings of the 37th international ACM SIGIR conference on research & development in informationretrieval. ACM, pp 415–424
Cheng Wang is currently a Ph.D. student at Chair of Internet Technologies and Systems, Hasso-Plattner-Institute (HPI), University of Potsdam, Germany. Prior to HPI, he received master of science degree fromSichuan University, in China 2013 and bachelor of management from Shandong Jianzhu University, in China2010. In 2015, he had short-term research visits in Dept. of CS, University of Cape Town, South Africa andIsrael Institute of Technology, Israel. His research interests including deep learning, artificial intelligence,machine learning and multimedia retrieval.
9276 Multimed Tools Appl (2016) 75:9255–9276
Haojin Yang received the Diploma Engineering degree at the Technical University Ilmenau, in Germany2008. In 2013, he received the doctorate degree at the Hasso-Plattner-Institute for IT-Systems Engineering(HPI) at the University of Potsdam, in Germany. His current research interests revolve around multime-dia analysis, information retrieval, deep learning technologies, computer vision, content based video searchtechnologies.
ChristophMeinel studied mathematics and computer science at Humboldt University in Berlin. He receivedthe doctorate degree in 1981 and was habilitated in 1988. After visiting positions at the University ofPaderborn and the Max-Planck-Institute for computer science in Saarbrcken, he became a full professor ofcomputer science at the University of Trier. He is now the president and CEO of the Hasso-Plattner-Institutefor IT-Systems Engineering at the University of Potsdam. He is a full professor of computer science with achair in Internet technologies and systems. He is a member of acatech, the German National Academy ofScience and Engineering, and numerous scientific committees and supervisory boards. His research focuseson IT-security engineering, teleteaching, and telemedicine, multimedia retrieval. He has published more than500 papers in high-profile scientific journals and at international conferences.
9
Automatic Lecture Highlighting
In this paper, we propose a novel solution to highlight the online lecture videos in
both sentence- and segment-level, just as is done with paper books. The solution
is based on automatic analysis of multimedia lecture materials, such as speeches,
transcripts, and slides, in order to facilitate the online learners in the current era
of e-learning - especially with MOOCs. With the ground truth created by massive
users, an evaluation process shows the general accuracy can reach 70%, which is
reasonably promising. Finally, we also attempt to find the potential correlation
between these two types of lecture highlights.
9.1 Contribution to the Work
• Contributor to the formulation and implementation of research ideas
• Significantly contributed to the conceptual discussion and implementation.
• Guidance and supervision of the technical implementation
• Maintainer of the software project
9.2 Manuscript
181
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 1
Automatic Online Lecture HighlightingBased on Multimedia AnalysisXiaoyin Che, Haojin Yang and Christoph Meinel, Member, IEEE
Abstract—Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solutionto highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based onautomatic analysis of multimedia lecture materials, such as speeches, transcripts and slides, in order to facilitate the online learners inthis era of e-learning – especially with MOOCs. Sentence-level lecture highlighting basically uses acoustic features from the audio andthe output is implemented in subtitle files of corresponding MOOC videos. In comparison with ground truth created by experts, theprecision is over 60%, which is better than baseline works and also welcomed by user feedbacks. On the other hand, segment-levellecture highlighting works with statistical analysis, mainly by exploring the speech transcripts, the lecture slides and their connections.With the ground truth created by massive users, an evaluation process shows the general accuracy can reach 70%, which is fairlypromising. Finally we also attempt to find potential correlation between these two types of lecture highlights.
Index Terms—Lecture Highlighting, Acoustic Analysis, Statistical Analysis, MOOC
F
1 INTRODUCTION
Many people like using a marker to highlight books whilereading, especially students with textbooks in hand [1]. Researchshows that properly highlighted contents indeed support under-standing [2]. Perhaps this is the reason why quite a lot of bookauthors already highlight the key concepts, features or equations intheir books, and more are requested to do so [3]. Generally, thereare two types of highlighting: content highlighting and table ofcontents highlighting (see Fig. 1). The former mostly emphasizessentences, while the latter works in a larger scale, indicating whichsection should be given special attention.
Not only a widespread practice with traditional paper books,the highlighting function is also welcomed in the era of e-books[4, 5]. It is widely implemented in many e-book applications. Inthis case, while a marker is no longer required, highlighting is stillbased on the book-like textual materials. However, what if there isnothing textual, such as attending a lecture without textbook, doesthat make sense to highlight the lecture?
We believe the answer is yes. In a lecture, there are alwayssome key-points, such as definitions, illustrations, functions, appli-cations, etc., which are more important to students than other con-tents in the lecture. Fortunately, good teachers always know thesekey-points in their lectures and emphasize them while teaching [6].These emphases may draw attention from students through tonechanges in speech and further improve the teaching performance[7]. Once captured and presented to students, particularly to theself-learning students, they could be very helpful. A good teacher’semphasis should also be the students’ learning focus [8].
In recent years, with the rapid development of distancelearning technology, especially in form of the MOOC (MassiveOpen Online Course), numerous lectures are recorded in videos,uploaded to Internet and can be accessed freely online. However,research shows that for MOOC learners, the median engagementtime when watching a lecture video is at most 6 minutes [9]. Un-fortunately, many online lectures are much longer than that. After
All authors are with Hasso Plattner Institute, can be contacted by emails:{xiaoyin.che, haojin.yang, christoph.meinel}@hpi.de, or by post: Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany.
6 minutes, the learners become less concentrated or effective, andsometimes even close the video without finishing it, causing thephenomenon of “in-video dropout” [10].
Lecture highlighting may help in this situation. We could offersome highlighted sentences with alerts, which might work asrefreshments when the learners gradually get distracted. Perhapsthese alerts cannot keep the learners concentrated all the time, butat least when the key-points are presented in the video, the learnersknow. We could also highlight some video segments coveringspecific subtopics emphasized by the teacher, which would makeit easy for learners to directly jump to these key segments before“in-video dropout” occurs. With this effort, as least they wouldencounter the most important knowledge in this video beforequitting. And if the key segment arouses the interest of somelearners, they may even decide not to drop out at all.
Meanwhile, the big improvement achieved in video displayingand lecture recording systems makes the potential implementa-tion of key sentences and segments much easier. Enabling livetranscripts or subtitles, or addressed as CC (Closed Caption),is popular not only with Internet video service providers likeYouTube and Vimeo, but also in traditional TV service, such asARD or ZDF in Germany [11], as well as the majority of MOOCplatforms [12, 13]. These additional synchronized textual data arevery suitable for sentence-level highlight implementation. Fig. 2shows a screenshot of a slide-inclusive lecture recorded by tele-TASK system [14]. In this kind of modern recording, slide contentcan be thoroughly obtained and transition detection can be furtherapplied to logically segment the lecture video [15]. With the visualnavigation bar in bottom-right of Fig. 2, key segments can beeasily marked up by simply adding a sign or changing the color.Besides, the left-bottom textual segment list is also applicable.Some other online lecture archives (e.g. LectureVideo.NET) canoffer similar functions.
Based on all above motives, we propose two technically inde-pendent but practically related approaches to highlight the onlinelectures in both sentence- and segment-level automatically. Pleasenote that we intend to finish the highlighting before the lecture
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 2
(a) (b)
Fig. 1. Two examples of book highlighting. (a) Content highlighting, copyright of the image belongs to Maryellen Weimer onhttp://www.facultyfocus.com; (b) Table of contents highlighting, image published on http://backtoluther.blogspot.de, copyright belongs to originalauthor(s)
officially goes online, so both the real-time MOOC users andarchive lecture learners could benefit from it. However, enablinguser to make personalized highlighting is not the topic in thispaper, we concentrate on “we highlight for you”.
Sentence-level highlighting focuses on acoustic emphasis de-tection and the result will be presented in lecture transcripts, ormore specifically, subtitles, and evaluated by both subjective andobjective standards. The major technical contribution of sentence-level approach is that we innovatively manifest speaking rate bysyllable duration and pause rate in sentence-level and involveit together with frequently used features, pitch and energy, intoa general decision scheme. But more importantly, as far as weknow, we are the first to automatically detect acoustic-basedhighlights in lecture videos and implement practical applicationin educational domain. Segment-level highlighting, on the otherhand, mainly depends on exploring correlation between speechand slides, with a brand new form of statistical analysis orientedto the characteristics of online lecture videos. User feedbacks andforum threads are used for evaluation.
The rest of this paper will be organized as follow: Section2 discusses related work. Section 3 and Section 4 introducethe sentence-level lecture transcript highlighting and segment-level lecture video highlighting in detail respectively. Section 5compares key sentence and key segments and attempts to seek aconnection between them. This is followed by conclusion.
2 RELATED WORK
Detecting emphasis in speech is a long term research topic. Earlyattempts aimed to segment speech recordings or summarize spo-ken discourses based on the detected acoustic emphasis [16, 17].From then on, almost all approaches took pitch as the indispens-able feature in this task, since it is widely acknowledged that thepitch value will change as the speaker’s status changes [18, 19].However, Kochanski et al. argued that loudness and duration aremore crucial than pitch in classifying acoustic prominences, andfinished an experiment on syllable-level with positive result tosupport their argument [20]. After that, more approaches prefer totake all pitch, loudness and duration into general consideration.
Syllable-level prominence detection is fundamental in thistopic. As the most microscopic linguistic element, the stress ofa syllable in a word could be decisive in stress languages likeEnglish: “re-cor-d” and “re-cor-d” can be semantically differ-ent. Therefore, there are already many successful systems toautomatically classify them in different languages [21–23]. Thenthe research interest moves upwards from syllable-level to word-level. Acoustically there is no new feature introduced, althoughdiscussions have been made about whether to sample on syllablesor directly on whole words [24, 25]. Meanwhile, lexical featuresstart to be included in word-level [26, 27].
It is natural to make a similar extension from words to utter-ances or, as we say, sentences. Fairly promising results have beenreported in locating “hot spots” of meeting recordings, which is akind of conversational speech [28, 29]. However, lecture speech inour case is generally a kind of solo speech. As far as we know, theperformance of sentence-level emphasis on solo speech has notbeen reported before, and we would like to be the first to do so.
Although the research of sentence-level speech emphasisdetection is quite limited, there is a highly related and well-researched topic: speech emotion recognition [30, 31]. It sharessame research foundation with emphasis detection approachesby using same features (pitch, energy, duration, etc.) [32], andemotions are believed to be more suitable in describing pho-netic structures in longer time frames than emphases [33]. Morespecifically, observation finds that some acoustic phenomena ofspeech emphasis are highly similar to the widely used emotionstates “happy/joy” and “angry/anger”, such as higher pitch andenergy [34]. Sometimes emphasis is even taken as an independentemotion state “emphatic” [35]. These facts suggest that not onlythe technical insights, but also the experimental results fromemotion recognition approaches could be referable in our purposeof emphasis detection.
Another related research topic is social tendency analysis oflanguage. Speech emphasis is considered as a preliminary elementby some researchers to construct social signals [36, 37]. And as wealready introduced before, teachers use emphasis in their lecturespeeches in order to draw attention from the listeners. This behav-
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 3
Fig. 2. A screenshot of a slide-inclusive lecture recorded by tele-TASK system. Visual navigation bar with slide preview can be seen in the bottomarea of the “desktop stream” on the right, while the textual segment list is under the “lecturer stream” on the left.
ior is functionally a typical manifestation of high “extraversion” inclassic “Big Five” personality model [38]. Social tendencies canbe classified by text analysis [39], as well as emotions [40]. Theselexical approaches may also provide baseline result for us.
Once forwarding from sentence-level to segment-level, videobecomes the major carrier in emphasis analyzing research, andthe term “highlight” is more frequently mentioned to address keyvideo segment. Highlight detection in broadcasting sports videos isthe most well-researched subarea, but the features they used, suchas specific scenes, commentator’s keywords or replay sessions, areonly available in context of sports video [41, 42]. For other typesof video, highlight detection is generally taken as a step in videosummarization or abstraction. The video will be deconstructedinto shots, from which key-frames will be extracted and furtherevaluated according to their visual similarity, timing information,and features of synchronized audio signal [43–45].
Lecture video, on the other hand, is something different[46, 47]. It has very limited scene changes, which makes almostall the key-frames extracted visually similar. However, lecturevideo sometimes includes external multimedia data, such as slides,which enabled an early attempt by He et al. on slide-inclusivelectures [48]. They put audio features, slide transition informationand user statistics into the model, but focused more on contentintegrity than segment importance for the purpose of lecture videoabstraction. Taskiran et al. also contributed to the summarizationof lectures [49]. They used pauses in speech to segment thevideo and calculated importance scores for segments by detectingword co-occurrence based on transcripts. Inspired by these ideas,we plan to further explore the connection between slides andtranscripts in segment-level lecture video highlighting.
3 SENTENCE-LEVEL LECTURE HIGHLIGHTING
3.1 Sentence Units AcquisitionThe development and popularization of online lectures, especiallyin form of MOOCs, contribute a lot in breaking the geographicalbarrier of knowledge dissemination. Subtitles, no matter if trans-lated into other languages or only available in original language,
are considered the best breaker of the language barrier by far[50, 51]. To meet this need of the learners, many course providersoffer subtitles as supplementary material or facilitate the potentialintegration of subtitles, as already mentioned in Section 1. Inthis case, the subtitles are manually generated by professionalproducing teams or volunteering groups, fully punctuated, wellsynchronized and properly segmented into subtitle items in a user-friendly way. We can directly take these subtitle items as sentenceunits in our purpose.
However, if there are no existing subtitle files in a course,we can create them automatically. Starting with ASR (AutomatedSpeech Recognition), unpunctuated or under-segmented tran-scripts of the lecture videos can be achieved. Then SBD (SentenceBoundary Detection) can be employed, in which a model of deepneural network would be applied to classify whether a punctuationmark should be inserted after the k-th word of a continuous n-words sequence, with word vectors and pauses as major features[52, 53]. Then the transcripts with punctuation marks restoredcan be reasonably segmented into subtitle items to complete theautomatic subtitle generation procedure.
Unfortunately, errors can hardly be avoided in such automat-ically generated subtitles, such as improperly recognized words.But since the further process within this section focuses onacoustic features in audio, rather than lexical information in text,the potential negative influence could be minimized. In the end,a sentence unit contains a segment of audio, which is obtainedby the time tags of the corresponding subtitle item, along with itstextual content.
3.2 Voiced/Unvoiced Sound ClassificationLecture speech is typically solo speech. Specifically when prepar-ing the videos for MOOC, many lecturers prefer to talk to acamera in a studio, rather than in a classroom with real students[54, 55]. These phenomena make the audio signal of lecture videosin high quality with quite low level of noise. Therefore, we thinka denoising process is not necessary and all acoustic informationcan be considered as deriving from the speaker. By taking sentence
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 4
Fig. 3. The short-term energy and zero-crossing rate of the example sentence unit, along with its voiced/unvoiced deconstruction result.
units as input, our analyzing process starts with voiced/unvoicedsound classification.
Typically, speech consists of three categories of elements:voiced sound (V), unvoiced sound (U) and silence (S). Forexample, the pronunciation of English word “Breakfast” shouldhave the structure of “U-V-U-V-U-U” theoretically, correspondingto “b-rea-k-fa-s-t”. And in a sentence of actual speech, “breakfast”is very likely to be surrounded by two “S”. In many speechanalysis tasks, V/U/S classification is an important pre-processingstep [56], emphasis detection is no exception. In this work, wewill classify voiced and unvoiced sound with Short-Time Energyand Zero-Crossing Rate.
Short-Term Energy (address as energy or E afterwards) isa basic acoustic feature, commonly used to measure the instan-taneous loudness of the audio signal. Zero-Crossing Rate (ZCRor Z) is the rate of sign-changes along the signal, which can beseen as a simple measurement of frequency within a small timewindow. It is widely acknowledged that voiced sounds have highenergy and low ZCR, while unvoiced sounds are on the opposite:low energy but high ZCR [57–59]. The silence fragment is easy toclassify because both energy and ZCR approach 0.
Fig. 3 shows the energy and ZCR level of an example sentenceunit, with the content “this is the speed, with which the machineis working.” This example comes from the MOOC “In-MemoryData Management” in 20121, with the sample rate of its audiosignal as 48 kHz. In this work, both energy and ZCR are sampledwith the sampling window size of 0.02s and the step size of 0.01s,which means each sample covers 960 sampling points in total andthe average value is applied. Both of them are extracted by Yaafetoolkit2.
We use a heuristic-adaptive decision scheme for the V/U/Sclassification. First the average energy value of the whole sentence(E) is calculated. For i-th sample in the sentence unit, if Ei > Eand Ei > Zi, it will be taken as a voiced sample. Then adjacentvoiced samples will be connected to form voiced sounds, whilesingle independent voiced samples will be considered as accidentand abandoned.
Unvoiced sound classification is more complicated. The chal-lenge is both with voiced sound and noisy silence. After theobservation of the speech signals with several different lecturers,we set the following requirements:
1. https://open.hpi.de/courses/imdb20122. http://yaafe.sourceforge.net/
� This sample is NOT a voiced sample.� Ei < max{E, Zi}� Zi > Z × 1.5 or Ei + Zi < E
These requirements demand Ei to be comparatively small, butnot too small, and Zi to be large. Only when a sample meetsall three requirements, it would be taken as an unvoiced sample.Similarly, continuous unvoiced samples are gathered together asunvoiced sounds. All samples, which are neither involved by anyvoiced sounds nor unvoiced sounds, will be considered as silence,although some of them might be independent voiced or unvoicedsamples.
The voiced/unvoiced deconstruction result of the examplesentence unit can be also found in Fig. 3. Theoretically, thereshould be 12 voiced sounds and 7 unvoiced sounds based on thetextual content. Our deconstruction scheme successfully classifies11 voiced sounds and 6 unvoiced sounds, missing the unvoiced“-d” in “speed” and mistaking voiced “ma-” in “machine” asunvoiced sound. Generally, this scheme can keep the accuracyaround 85∼90% when the lecturer speaks calmly and fluently. Inpurpose of emphasis analysis, this accuracy is basically accept-able.
3.3 Acoustic Emphasis Analysis
We believe when a speaker emphasizes something, he/she willspeak louder and/or raise the tone of the voice. Some acousticfeatures, such as pitch and loudness, will be affected directly.Moreover, the speaker definitely wants the audience to clearlycatch every single word that is emphasized and might give theaudience some extra time for response, which may result inlonger pauses between words. In order to catch these possibleclues, we measure the following features to analyze the emphasesacoustically:
• Loudness. Same as in V/U/S classification, we measurethe loudness of a sentence unit by the short-term energy,but the method is different. Only the samples in all voicedsounds are included to calculate an average value E. Eachsample in calculation is treated equally, regardless of itsposition in the voiced sound it belongs to or the positionof the voiced sound in the sentence unit. E represents thelevel of loudness of certain sentence unit and the averageenergy of each voiced sound will not be calculated.
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 5
• Pitch. Similar to loudness, we calculate the average pitchof the sentence unit (addressed as P ) only with all voicedsounds. It is widely believed that males often speak at 65to 260 Hz, while females speak in 100 to 525 Hz range.But in our experiments, the pitch value within the unvoicedsound period can easily reach 1000 Hz, which has to beexcluded from the calculation. Pitch level in our work isextracted by Aubio toolkits3.
• Syllable Duration. It is reasonable to measure speakingrate by words when the speech sample is long enough.However, in our task we have already segmented thespeech into sentence units with only several words. Inthis case, the length of a word matters more. For ex-ample, the German words “Ja” and “Immatrikulations-bescheinigung”, which mean “yes” and “confirmation ofenrollment” respectively, should not be counted as oneword equally. A better measurement is with syllables,where “Ja” has only 1 syllable and “Immatrikulations-bescheinigung” has 11. Ideally, syllables in the transcriptshould match the voiced sounds in speech one by one [37].But in practice, because of the hesitation “eh. . . ” by thespeaker or the mistake mentioned in Fig. 3 with “ma-chine”, the numbers may differ. Therefore, we calculateboth the average syllable duration and average voicedsound duration and apply the smaller one. It is addressedas D.
• Pause Rate. The pause rate (Rp) is the percentage ofsilences in the whole sentence unit. It is supposed to belarger when the speaker emphasizes speech elements withextra pauses, as described previously. Practically speaking,we sum up the total time of previously classified voicedand unvoiced sounds, and then deduct it from the sentenceunit duration. It can be seen as an additional feature ofspeaking rate.
With all above features, an acoustic importance value Aj ofthe j-th sentence unit in a lecture video with n sentence units intotal can be defined as
Aj = E × (1 +j − 1
n− 1× 0.1) + P × λ+D×µ+Rp× η (1)
where λ, µ and η are the weights to balance the influences ofdifferent features. They are necessary because the absolute valueranges of the features differ a lot – based on our observation, forinstance, E ∈ [0, 0.5) while P ∈ (100, 500). Practically, wewill calculate the average values of the features for each lectureand using these average values as benchmark to tune λ, µ and ηper lecture, in order to make the influences of all four featuresbasically the same. The amendation of E is designed becauseas the lecture proceeds, the speaker will gradually get tired andthe loudness level will also decrease unconsciously. The timeline-based amendation could compensate this phenomenon of generalenergy decay and give a fairer chance to those sentence units inthe later phase of the lecture to be detected as emphasis.
In this section, we aim to highlight a certain proportionof sentences with acoustic emphases from a lecture, with thepurpose of facilitating learning. Thus there is no need to set ahard threshold to decide whether a sentence unit is acousticallyemphasized or not. Instead, all sentence units of a lecture willbe sorted by their importance values calculated in descending
3. https://aubio.org/manpages/latest/aubiopitch.1.html
order, where the top ones will be marked as emphasized. Sincethe sentence unit might not be a complete sentence, if only onepart of the sentence is considered as highlight, the highlight shouldbe extended to the complete sentence.
3.4 Experimental ImplementationAfter confirming which sentence units should be highlighted, thenext step is to figure out how to highlight them in a user-friendlyappearance when implemented in the MOOC context. Actually wecould literally do the “transcript highlighting,” by re-formattingthe subtitle file into a pure textual transcript file, highlightingthose selected sentence units with bold font, background coloror underline, just as what we do with traditional paper books,and making it downloadable. However, watching the video whilechecking external reading material simultaneously would be anunpleasant experience for the learners, therefore we have no reasonto be optimistic about how it would work.
Alternatively we need to implement highlighted sentences in amore easily accessible way, which is better simultaneous with thevideo displaying. Beeping or flashing could be an option, since itis the common way to arouse attention in various scenarios, but weare afraid that it could be too aggressive in an educational setting.We consider the best way is to implement some visual sign for thehighlighted sentences in the subtitle file. Such signs would makethe user instantly aware of these sentences.
Since we have no previous experiences about how to designsuch signs, we could only experimentally make an attempt. Eachhighlighted sentence will be surrounded by a pair of star pentagonswith solid fill, as shown in Fig. 4. Additionally, a pair of emptystar pentagons will be used to mark the previous subtitle item ofa highlighted sentence unit as a reminder. We will collect userfeedback for this type of implementation in a later chapter.
The importance analysis method we apply in sentence-level ismostly based on acoustic features, which is theoretically languageindependent. There is also no problem to do the importanceanalysis with the original teaching language and offer its result ina translated target language, such as Fig. 4, which is a screenshotof the MOOC “Internetworking.” This MOOC is instructed inEnglish but offered to Chinese-speaking users with subtitles insimplified Chinese4.
3.5 EvaluationIn order to evaluate the performance of the proposed scheme, weoffered the highlighting result in lecture 4.5 and 4.6 of “Internet-working.” It is a comparatively limited scale evaluation because wedid not know the acceptance of this new “we highlight for you”feature among learners. The evaluation consists of three aspects.First we demonstrate a few highlighted examples and explainthe rationality behind them, then analyze the precision based onground-truth created by multiple experts, and finally present userfeedbacks. In this experiment, the coefficients in equation (1) areset as follow: λ = 0.001, µ = 0.02, η = 1, according to theprinciple introduced in section 3.3 and the selection proportion isaround 1/6.
3.5.1 Example DemonstrationThe first example is a fraction of lecture 4.5 which talks about“Neighbor Discovery Protocol (NDP)”. The total length is 7:28
4. https://openhpi.cn/courses/internetworking2016
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 6
Fig. 4. The “highlighted” subtitles in MOOC environment, with the sign of star pentagon.
and it is segmented into 58 sentence units in the subtitle file. 10units among them are acoustically highlighted, plus 2 added by the“complete sentence” policy. Here we extract the textual content astranscript and mark the highlighted part with bold font (pleasenote that it is different from what learners actually get):
“. . . And the first, I want to mention is the neighbordiscovery protocol. The task of the neighbor discoveryprotocol, NDP, is to facilitate the interaction betweenadjacent nodes. What are the adjacent nodes? Theyare neighbor nodes. In IPv6 nodes are consideredadjacent, if they are located on the same ‘link’. Andthe IPv6 link is the network area that is bounded by arouter . . . ”
Grammatically, the highlighted content in this example isthe answer of a “hypophora,” or addressed as “anthypophora.”In plain English, hypophora is a self-answering question, whichis generally believed to be used to draw attention or arousecuriosity from the audience [60, 61], in order to heighten theeffect of what is being spoken, in other word, to create emphasis.This phenomemon widely exists in educational context [62, 63].Semantically, the content of highlighted sentences is the expla-nation of an important technical term – neighbor node – in alecture talking about NDP. All these facts strongly suggest thathighlighting this sentence is highly logical.
As we mentioned before, “Internetworking” is a MOOCrecorded in English but offered in Chinese. The subtitle preparedis completely in simplified Chinese and the above example isactually presented as:
. . .首先我要提的是邻机发现协议。邻机发现协议NDP的任务是协助邻近节点之间的互动,什么是邻近节点?这这这些些些是是是邻邻邻近近近节节节点点点。。。在在在IPv6中中中如如如果果果节节节点点点位位位于于于相相相同同同的的的“链链链路路路”,,,则则则它它它们们们被被被认认认为为为是是是邻邻邻近近近节节节点点点。。。IPv6链路是指一台路由器覆盖的网络区域. . .
Here we quote the Chinese text because, due to the con-sideration of word order and fluency, the second and the thirdhighlighted sentence units reverse their positions when translatedfrom English to Chinese (For non-Chinese speakers, please focuson the different positions of the quotation marks in the example).However, it is not influenced because of the “complete sentence”policy.
The second example derives from lecture 4.6, which talksabout the “Dynamic Host Configuration Protocol” under frame-work of IPv6 (DHCPv6). This lecture lasts for 7:45, with 66sentence units in total, 12 of them acoustically highlighted and 1added for “complete sentence”. This example actually correspondsto Fig. 4:
“. . . In IPv4, this was only possible with the DHCPprotocol, the Dynamic Host Configuration Protocol. TheDHCP protocol was responsible to dynamically allo-cate IP address to the host, to allocate the host names,to provide information about default gateway, andinformation about responsible DNS server (Domainname service). See DHCP protocol works in a statefulmode. That means the respective DHCP server knowswhich host uses which configuration and keeps track ofall the interactions . . . ”
By comparing with the slide recorded in the right section ofFig. 4, we can find that the highlighted part in this example isthe same as what is written in the slide. People use slides as theoutline of the talk, in other words, the slide is the collection ofimportant terms the speaker wants to mention. Detecting them asthe key content seems to be a good option.
Although with these successful examples, we clearly know thatour result is far from perfect. Therefore we prepare a precisionanalysis in a more general way.
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 7
TABLE 1Precision Analysis on Sentence-Level Highlighting
Method All Sentences Highlighted SentencesNum Ave Num Ave Hit Precision
ToneAnalyzer 124 0.86 22 0.86 8 36.4%Vokaturi 124 0.86 22 1.11 12 54.5%Proposed 124 0.86 22 1.14 14 63.6%
3.5.2 Precision Analysis
In this subsection we would like to evaluate the general accuracyof the automatically highlighted sentences in a fairly objectiveway. Unlike typical classification question, it is impossible toget absolute objective ground-truth in this case, because peoplealways have different measurements on whether a sentence in thespeech is more important. There is only recommending or not,no right and wrong. Alternatively, we invited multiple experts oncorresponding topics of the test lectures to give their opinions onpotential lecture highlights, and then the comprehensive opinionwould be further taken as the ground-truth.
The testing videos are still lecture 4.5 and 4.6 of “Internet-working.” We asked 10 different experts in total, 5 for each lecture,who graduated from IT-related majors in different universitiesand still work in this profession, to rate the importance of eachsentence unit with three levels: 2 (recommend as highlight), 1(neutral) and 0 (not important). Then an importance score for eachsentence unit is calculated by averaging these 5 rates. If the valueof a sentence unit is greater than 1 (not including equal), it will betaken as a ground-truth key sentence. Since we didn’t set a restrictto the experts that how many sentences could be recommendedas highlights, the total number of ground-truth key sentences issignificantly larger than the number of automatically highlightedsentences (38 vs 22), so we only measure the precision in thisexperiment, not the recall rate.
Unfortunately, research effort in sentence-level emphasis de-tection in educational domain is still limited. Therefore we failedto find any publicly available toolkit to test our data or any publicdataset to run our approach on. However, as already mentionedin Section 2, emphasis is one of the fundamental elements forclassifying “happy/joy” and “angry/anger” in emotion analysis,and functionally similar to social tendency “extraversion”. Thuswe would run state-of-the-art audio-based emotion detection ap-proach Vokaturi5 and highly reputable linguistic-based IBM Wat-son ToneAnalyzer6 for comparison. For each baseline approach,the measurement of emphasis or highlight is defined as the sum ofthe respective confident values of being classified as “happy/joy”,“angry/anger” and, when applicable, “extraversion”. Similar toour method, all sentence units are sorted by this measurementin descending order and top 1/6 will be selected as highlights.
Please note that “complete sentence” policy is not appliedhere for the proposed approach. As can be seen in Table 1,“Ave” represents the average importance score geven by expertsof all corresponding sentences, while “Hit” means the number ofsentence units which are both highlighted by our acoustic analysisand rated as highlights by the experts. Result shows that 14 of22 highlighted sentence units are correct by our method, with theprecision as 63.6%, which is higher than both baselines. But wehave to claim again that both Vokaturi and ToneAnalyzer are not
5. https://developers.vokaturi.com/downloads/sdk6. https://tone-analyzer-demo.mybluemix.net/
TABLE 2Statistics about the Survey
Q1: Do you think the feature “we highlight for you”is meaningful in context of MOOC? Count Ratio
(1) Yes 56 76.7%(2) No 6 8.2%(3) I’m not sure 11 15.1%Q2: Have you noticed the highlighted sentences inprevious lectures? Count Ratio
(1) Yes 56 77.8%(2) No 16 22.2%Q3: Do you think our current implementation, withstar polygon pairs, is appropriate? Count Ratio
(1) Yes, it’s completely appropriate. 32 47.1%(2) It’s OK, but the reminder is unecessary. 14 20.6%(3) It’s OK, but the sign should be more obvious. 13 19.1%(4) It’s OK, but the sign is too garish. 5 7.4%(5) No, it’s terrible. 4 5.9%Q4: Please rate the current accuracy of the highlightsoffered. (5-star is the highest and 1 is the lowest) Count Ratio
(1) ? ? ? ? ? 20 28.2%(2) ? ? ? ? 22 31.0%(3) ? ? ? 21 29.6%(4) ? ? 1 1.4%(5) ? 7 9.9%Q5: With current level of accuracy, do you want us toformally apply “we highlight for you” in followinglectures and courses?
Count Ratio
(1) Yes 52 76.5%(2) No 7 10.3%(3) I’m not sure 9 13.2%
specifically designed for emphasis detection, so the undergoingtask may not maximize their potentials.
3.5.3 User FeedbackBesides the subjective and objective evaluation from developer’sside, opinions directly from user’s side are also crucial. In MOOCsonly when a new feature is welcomed and used by learners, mayit actually be beneficial. We set up a survey about the generalacceptance of proposed sentence-level “we highlight for you”approach in form of highlighted subtitles. Since all survey itemsare optional in principle and independent from each other, the totalnumber of replies per item could be different.
When talking about the prospect of newly developed tech-niques, Technology Acceptance Model (TAM) is frequently refer-enced [64], in which “perceived usefulness” and “perceived ease-of-use” are considered as the basic reactions for the users whoencounter new technical stuff, and then form the “attitude towardsusing” and finally affect actual use. As illustrated in Table 2,76.7% of survey respondents acknowledge the positive meaningof proposed “we highlight for you” feature (Q1), while 77.8% didnotice the existence of highlighted sentences (Q2). These numbersprove the potential usefulness and the convenience of accessingour work.
Regarding technical detailed, users expressed different andsomehow contradictory opinions about the way how we imple-mented lecture highlights (Q3), and many of them are basicallysatisfactory with current accuracy (Q4): an average rate of 3.66 isachieved and can be transformed into 66.5% in percentage, whichis similar to the objective precision obtained (63.6%). Finally,76.5% of users explicitly indicated their “Yes” attitude towardsour new feature by encouraging us to formally adopt subtitles withhighlighted items in follow-up lectures and courses, while another13.2% are not against this idea either (Q5).
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 8
Generally speaking, the users accept the feature “we highlightfor you” and are basically content with the technical aspectof proposed sentence-level highlighting approach. However, weclearly know that the scale of our survey is limited, not onlybecause of the comparatively small number of the participants,but also because of their homogenization – all of them are nativeChinese speakers. Moreover, the structure of above survey and themethodology of feedback analysis are also relatively simple. Wewould definitely attempt to improve when having the chance.
4 SEGMENT-LEVEL LECTURE HIGHLIGHTING
4.1 Segment Units PreparationIn segment-level lecture highlighting, the first task is to definethe segment. Lecture video segmentation has been researchedfor years [65–67]. It is different from traditional natural videosegmentation, because generally there are only very few scenechanges in the lecture videos, which is the most important featurein natural video segmentation [68]. However, many lecturersuse slides as additional teaching materials [69, 70]. The slidetransitions can be detected and applied as the boundaries of lecturesegments [71, 72]. In this section, we work only with slide-inclusive videos and address the lecture segments as SUs (SlideUnits).
Each SU has beginning and ending time tags, and a textualoutline can be created based on the corresponding screenshottedslide image by OCR (Optical Character Recognition) and TOG(Tree-Structure Outline Generation) [73, 74]. If the digital slidefile (in .pdf or .pptx format) is available, the slide content can alsobe parsed from the file with better accuracy, which further im-proves the quality of the textual outline for each SU. Meanwhile,the subtitle files of the lecture videos can also be split by the timetags and each SU will possess a paragraph of the lecture transcript.Now there are following direct parameters available for each SU:
� Type: T-SU (pure textual slide), NT-SU (except for thetitle, there is no text in the slide but only illustrations, suchas chart, image, etc.) and HT-SU (mixed).
� Duration (d): counted in second.� O-Words (WO): total number of words in the slide outline.� O-Items (I): total number of textual items in the slide
outline, including title, topics and subtopics.� S-Words (WS): total number of words in speech para-
graph.� Co-Occur (C): total number of words shared by both slide
outline and speech paragraph.
Based on these direct parameters, we define several indirectparameters to better represent the characteristic of the SUs, whichinclude:
� Speaking Rate: RS = WS/(d/60)
� Matching Rate: RM = C/WO
� Explanation Rate: RE = WS/WO
� Average O-Item Length: LI = WO/I
� Average O-Item Duration: dI = d/I
With slide-inclusive videos as input, the complete segmentpreparation process is fully automated. For those extra-long lec-tures, which contain too many SUs, they will be first cut intoseveral clips by exploring the inter-slides logic [75]. Each clipwill be further considered as an independent lecture.
Fig. 5. Ascending trend of RE while SU duration increases.
4.2 Importance Analysis of T-SU
For educational purposes, Slides generally serve as the outlineof the textbook. In these slides the lecturer lists the titles of thesubtopics one by one, in addition to some short explanations. Inour approach, the textual slide based SU would be considered asT-SU. However, the lecturer should offer some extra informationin the lecture speech. Otherwise learners could simply read theslides by themselves. The importance evaluation of T-SU wouldmainly focus on the connections between the information involvedin speech and the slide. This would involve the following factors:
4.2.1 Expected Explanation Rate
The idea here is based on a simple assumption. The lecturer willexplain in more details when talking about something important.Naturally, a T-SU in such conditions will have a comparativelyhigher explanation rate. Meanwhile, we notice that the absolutevalue ofRE might not suitable to be directly taken as the measure-ment because when we collect data from a complete course (“WebTechnologies”), there is an apparent ascending trend of RE withthe increase of SU duration. Fig. 5 illustrates this trend clearly.Therefore we introduce the concept of expected explanation rate,which is estimated by the SU duration d, based on the lineal trendline fitted in Fig. 5 and addressed as RE(d).
Similarly, another expected explanation rate could be esti-mated based on the course-scale observation of the SU parameter“average item length (LI )”. Smaller LI refers to more key-wordsor key-phrases in the slide, while larger LI indicates that theremight be more complete sentences. Then it is quite understandablethat a lecturer needs to add more extra information in the speechwhen the LI is decreasing. Fig. 6 captured this trend with thedescending trend line, by which we could calculate the secondexpected explanation rate RE(LI).
Now we could take the difference between the expectedexplanation rates and the actual one as the measurement. The firstevaluation factor of T-SU, fE , would be calculated by
fE = RE −RE(d) + RE(LI)
2(2)
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 9
Fig. 6. Descending trend of RE while LI increases.
fE might be either positive or negative, and we expect the valueof fE could be large and positive if the content of the relevantT-SU is important.
4.2.2 Hypothesis on Speaking Rate and Matching RateAs already mentioned in section 3.2, lecture speech, especiallyfor MOOCs, is generally solo speech and recorded in a studiowith very limited interference. In such scenarios, the lecturer isuninterrupted and it is easy to keep the speaking rate stable.However, the well-experienced teachers know that a lecture shouldnot be given like a lullaby. When, where and how to makeemphasis is very important to improve teaching quality. Whenemphasizing, intentional slowing down is a frequently used andeffective trick [76, 77].
But we cannot simply take the low speaking rate as evidenceof emphasizing. Pauses from hesitation may also result in lowspeaking rate [78]. Unfortunately, it is very difficult to avoid, evenfor experienced teachers. In this case, we cannot distinguish aslow-down event to be either intentional or accidental by simplychecking the speaking rate.
However, we make a hypothesis that when the lecturer slowsdown intentionally, the content of the speech should be closelyrelated to the slide because, just like an outline of a textbook, theslide should only include the key-points, which are the potentialemphasizing targets. In this case, the matching rate should becomparatively high. Based on this hypothesis, we introduce thesecond evaluation factor fH for T-SU:
fH =(RM − RM )× 100− (RS − RS)
2(3)
where RM and RS refer to the average values of RM and RS ofthe whole course. We also expect fH of important SUs to be largeand positive.
4.2.3 Overview BonusMany MOOCs are designed for the purpose of popularizingscience. A certain lecture in one of such courses is very likelybe an initial introduction about a specific topic, not academicallyadvanced, but covering as many subtopics as possible. For exam-ple, if a lecture is about a first glance of programming languages,
it may introduce C, C++, Java, Python, etc., separately and briefly.For these lectures, there is probably an overview slide placedat the beginning of the corresponding video, which works as anabstract and is actually the most important part in this video. Inour approach if a video clip is not long (less than 10 minutes),contains only few slide pages (less than 10 pages) and the firstslide is an independent slide, which is defined by discontinuoustitles with the second slide, then we acknowledge the first slide ofthis video as an overview page and give the corresponding SU abonus (BO).
By now we can summarize all 3 factors to a final importancevalue of T-SU: VT = fE + λ × fH + BO + µ, where λ is aweight to adjust the influence of fH and µ is a course-based fixedoffset to make VT always positive. Certainly we suppose VT ofSUs with higher importance to be larger.
4.3 Importance Analysis of NT-SUIn a NT-SU, the slide structure is generally simple: a title anda full-page illustration. It might be a chart, a diagram, an imageor in IT-related courses, a code block. Since O-Words (WO) ismeaningless in such slides, several features we used in T-SU areno longer available, such as the explanation rate and the matchingrate. Thus we adopt a simple measure here: the total amount ofinformation contained by a NT-SU, which depends on the S-Words(WS). We suppose that if a full-page illustration is introducedfor a key procedure in a technique or a significant exhibitionof important system, the lecturer would explain in detail in thespeech, with a large WS logically. And for those illustrationswhich the lecturer just briefly mentions in a few words, weconsider them as less important. We simply define the importancevalue of NT-SU: VNT = WS .
4.4 Importance Analysis of HT-SUThe situation of HT-SU is in between of T-SU and NT-SU. Withillustrations occupying half the page, there is still a considerableportion of text, which makes all SU parameters available. But itis very difficult to quantify the proportion of information carriedby text and illustrations, explanation rate becomes much lessconvincing. Alternatively, we implement the average item duration(dI ) as the measurement of how detailed the lecturer teachingwithin the certain HT-SU. One illustration would be counted asone additional item in the slide.
On the other hand, similar to NT-SU, we also suggest thatthe importance of a HT-SU is positively related to the amount ofinformation the lecturer gives, including both WS and WO. Theimportance value of a HT-SU will be set as
VHT =WS +WO − C
2+ dI (4)
where C is the co-occurrence, which we intend to remove asredundancy. VHT shall be large when the HT-SU is a key segment.
4.5 Ground-Truth AcquisitionIn order to evaluate whether the highlighted segments by ourapproach are correct, we added survey questions in the self-testsof the MOOC “Web Technologies,” which is a 6-week courseinstructed in English on the openHPI platform7. 10022 learnersenrolled in this course during the opening time, 1328 participants
7. https://open.hpi.de/courses/webtech2015
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 10
(a) T-SU (b) NT-SU (c) HT-SU
Fig. 7. General trend illustration for all 3 types of SUs.
took the final exam and 1179 of them successfully earned thecertificate.
We set survey questions in 43 video clips with a total lengthof 632 minutes. 348 SUs are obtained automatically from thesevideos. In the survey we asked our learners to select one segmentas the most important one in the correlated lecture video. Over5000 replies are received for the first video, and as users droppingout, there were still over 1000 users who took part in the surveyof last video.
For the i-th SU in a video, which has n SUs in total, if uiusers choose it as the most important segment, its basic importancefactor IFi will be set as
IFi =ui∑nj=1 uj
× n (5)
By this calculation, the importance factor can better represent theextent how important the corresponding SU is in users’ point ofview. It bases on the proportion of users who select certain SU asmost important, not the absolute number, which could avoid thenegative influence of varying numbers of survey participants. It isalso related to the total number of SUs in the videos. Obviouslyearning 33% of votes in a 10-SUs video is already high enoughfor a SU, but earning 33% in a 3-SUs video is just on the average.This feature is also well represented in (5). Mathematically, theaverage value of IFi in either a lecture or the whole course is 1.
Moreover, we also followed the course-attached discussionforum. We believe that the important part of the lecture wouldintrigue learners to ask more questions. So we counted the totalnumber of questions related to certain SU by content. Each relatedquestion earns a small bonus for the importance factor of certainSU, and the bonus is also balanced since there are obviously morequestions in the early stage of the course than the later stage. Ifthe i-th SU has qi related questions, the final importance factorIF ′i is set to:
IF ′i = IFi +qi√∑nj=1 uj
× η (6)
where η is a coefficient to keep the bonus value proper and canonly be set manually based on how many forum threads arecreated. For “Web Technologies”, η is set to 10, which makeseach question worth 0.1∼0.2. Since this coefficient is only validin evaluation, it will not affect the automatic process of detectinghighlights. In the end, IF ′i will be taken as the ground-truth fori-th SU in following evaluation.
Fig. 8. A ratio of 1/k means when sorting all SUs with the calculatedimportance descending, top 1/k SUs will be selected as “key segments”.Precision generally increases as the selection rate gets smaller.
TABLE 3Precision Analysis on Top 1/6 Selection
Type All Segments Selected Key Segments (Top 1/6)Num A-IF ′ Num A-IF ′ Correct Accuracy
T-SU 268 1.15 44 1.58 31 70.5%NT-SU 42 0.82 7 1.40 5 71.4%HT-SU 38 0.83 6 1.13 4 66.7%
All 348 - 57 - 40 70.2%
4.6 EvaluationBased on the data collected from “Web Technologies,” we setλ = 0.1 to keep the influences of fE and fH on same levelfor T-SUs, and µ = 3.5 to shift all VT beyond zero. SinceVT , VNT and VHT have different definitions, we first show theirevaluation result separately. As shown in Fig. 7, with calculatedimportance value (VT , VNT or VHT ) in x-axis and the ground-truth importance IF ′ in y-axis, the ascending trend, or the positiverelation between the two, is sound.
More specifically for T-SUs, which are the majority in all SUs,the hypothesis mentioned in section 4.2.2 (fH ) does not meet ourinitial expectations independently. But it is effective to eliminatethe T-SUs with low matching rate and high speaking rate. Theassumption about the explanation rate in section 4.2.1 acts as thefoundation of the final result and the overview bonus in section4.2.3 is proven to be a positive boost.
After the general evaluation, we also pay attention in potentialapplication. If we just sort all SUs with descending order of theircalculated importance value, select a certain portion from thetop as the highlighted segments and offer them to the learners,
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 11
precision is the vital measurement. We define correctness asfollows: if a highlighted segment has a ground-truth IF ′ greaterthan 1, then it is correct. Again, three types of SUs will be treatedseparately and we take T-SU as an example. Fig. 8 shows theprecisions when the selection ratio changes.
We further take 1/6 as the selection ratio, just as in sentence-level highlighting, to see more details. As listed in Table 3, theprecisions for selected T-SUs, NT-SUs and HT-SUs are 70.5%,71.4% and 66.7% respectively. Ave-IF ′ represents the averageIF ′ of the SUs included. It is obvious that the average ofhighlighted segments is larger than the average of all – no matterfor which class. The general precision is 70.2%, which is quitepromising to us.
5 POTENTIAL CONNECTION BETWEEN KEY SEN-TENCES AND SEGMENTS
Since we have achieved fairly positive result in both sentence-leveland segment-level lecture highlighting, it is quite natural to ask: isthere any connection between the highlighted key sentences andkey segments? In order to figure it out, we run the sentence-levelanalysis introduced in Section 3 on all the testing data used inSection 4. The sentence units are distributed to the slide unitsbased on the time tags, averagely 18 to 1, and the followingfeatures are collected per SU:
• The normalized mean acoustic importance value. Firstwe get the acoustic importance value for each sentenceunit and then calculate a simple mean value (Ai forthe i-th SU in the lecture), with each sentence unitwith same weight. Since the importance values for SUsfrom different lectures may differ a lot, due to possiblerecording condition or lecturer status difference, we couldnot simply apply the absolute value of Ai as measure-ment. Alternatively, we further calculate a lecture-averageacoustic importance value (A), and define the normalizedAi = Ai/A.
• The standard deviation of acoustic importance value.The calculation is carried out exactly according to themathematical definition, addressed as Di
• Highlighting Rate. A simple ratio of the highlightedsentence units in all sentence units, addressed as Ri.
If there is a positive relation between key segments and keysentences, a highlighted segment should be supported by morehighlighted sentence units, which means larger values of Ai andRi. Meanwhile, emphasis needs comparison between peak andtrough and would result in a larger value ofDi. Again we use somestatic coefficients to balance the influences of these features basedon their course-range average values, according to the principleintroduced in section 3.3, and then sum them up by VA = Ai +λ ×Di + Ri + µ, and then project the sum in Fig. 9 against theground-truth importance factor IF ′, with λ = 10 and µ = −1.5.However, the data points in general are randomly distributed andno ascending trend line can be found. Based on the data in “WebTechnologies,” we must claim that there is no evidence to supportthe theory that the key segments are constructed by key sentences.
Although this result is not ideal, it is actually logical. Acous-tically selected key sentences in fact derive from acoustic promi-nences. No matter whether it concerns the loudness, tone or speak-ing rate, the prominence is a short-time phenomenon and focuses
Fig. 9. The distribution of data points and simulated linear trend line ofacoustic importance value and ground-truth importance factor.
only on local context. Generally, it comes from the lecturer’ssubconscious reaction when attempting to draw attention.
A segment, however, on the average consists of 18 sentenceunits and lasts for 109 seconds in “Web Technologies.” Duringsuch a long period of time, if a speaker consistently offers acousticprominences, it would sound over excited and quickly lead tofatigue, which is what an experienced teacher would like to avoidin a lecture. Therefore, it is understandable that the key segmentsare not acoustically significant and thus not positively related withkey sentences. Key segments, on the other hand, mainly dependon high explanation rate, large information amount and overviewbonus, as explained in section 4.6. All of them are structuralelements and originate from thoughtful decisions when the teacherprepares the lecture beforehand.
As introduced in Section 1, although both sentence-level andsegment-level lecture highlighting have the purpose of improvingonline learning experiences for learners, the detailed goals aredifferent. Sentence highlighting in subtitles, or in some other kindof real-time transcripts, works as a reminder to keep learnersfocused. Highlighted segments are more like a selector, givinglearners a better navigation. In this point of view, the acoustic-based key sentences and the structural-related key segments couldaccomplish their tasks seperately.
6 CONCLUSION
In this paper we proposed to highlight online lectures in bothsentence- and segment-level. In sentence-level approach, we de-compose the audio signal into voiced/unvoiced sounds and applya new scheme to integrate different acoustic features into animportance value. Sentences with larger importance values arehighlighted and implemented in subtitle files. Based on experts-generated ground-truth, the general precision reached 63.6%,better than baselines. Some example demonstrations and userfeedbacks also support the outcomes.
Segment-level highlighting is based on a novel form of sta-tistical analysis. Slide-transition is taken to create SUs and theyare further distinguished into three types according to the slide
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 12
content. Taking structural features like explanation rate, matchingrate and speaking rate as measurements, the connection betweenlecture speech and slides is explored and the segment-level im-portance value is further calculated. Evaluation based on massiveuser created ground-truth shows quite promising result, with theprecision reaching 70.2%.
Beside the achievements, there are also some limitations to bediscussed. Our attempt to seek correlation between highlightedsentences and segments remains unsuccessful. The evaluationscale of sentence-level approach is quite limited, due to thedifficulty of ground-truth creation. Perhaps we could also provideinteractive highlighting tool for the voluntary users to collectground-truth from users’ perspective. On segment-level, experi-ments on different courses with different lecturers are needed totest the robustness of proposed approach.
In the future, we intend to further improve the quality ofthe highlights. Since currently the majority of parameters in theapproach are set heuristically, it is possible to further optimizethem as the research proceeds. An attempt with machine learning,especially with deep learning, could also be made when there aremore labelled data available. If possible, we also want to quan-titatively evaluate how these highlighted contents could actuallyhelp the online learners by setting experiment and control groupsin certain context.
ACKNOWLEDGMENT
The authors would like to thank all the participants in the surveys,especially the experts who helped with generating ground-truth insentence-level highlighting.
REFERENCES
[1] K. E. Bell and J. E. Limber, “Reading skill, textbookmarking, and course performance,” Literacy Research andInstruction, vol. 49, no. 1, pp. 56–67, 2009.
[2] R. L. Fowler and A. S. Barker, “Effectiveness of highlightingfor retention of text material.” Journal of Applied Psychol-ogy, vol. 59, no. 3, p. 358, 1974.
[3] R. V. Hogg and J. Ledolter, Applied statistics for engineersand physical scientists. Macmillan New York, 1992, vol. 59.
[4] J. R. Huffman, R. D. Cruickshank, S. N. Jambhekar,J. Van Myers, and R. L. Collins, “Electronic book havinghighlighting feature,” Sep. 2 1997, uS Patent 5,663,748.
[5] E. H. Chi, L. Hong, M. Gumbrecht, and S. K. Card, “Scen-thighlights: highlighting conceptually-related sentences dur-ing reading,” in Proceedings of the 10th international con-ference on Intelligent user interfaces. ACM, 2005, pp. 272–274.
[6] P. Scott, “Teacher talk and meaning making in scienceclassrooms: A vygotskian analysis and review,” Studies inScience Education, vol. 32, no. 1, pp. 45–80, 1998.
[7] L. Pickering, “The role of tone choice in improving itacommunication in the classroom,” TESOL Quarterly, pp.233–255, 2001.
[8] G. Gibbs and M. Coffey, “The impact of training of uni-versity teachers on their teaching skills, their approach toteaching and the approach to learning of their students,”Active learning in higher education, vol. 5, no. 1, pp. 87–100, 2004.
[9] P. J. Guo, J. Kim, and R. Rubin, “How video productionaffects student engagement: An empirical study of mooc
videos,” in Proceedings of the first ACM conference onLearning@ scale conference. ACM, 2014, pp. 41–50.
[10] J. Kim, P. J. Guo, D. T. Seaton, P. Mitros, K. Z. Gajos,and R. C. Miller, “Understanding in-video dropouts andinteraction peaks in online lecture videos,” in Proceedingsof the first ACM conference on Learning@scale conference.ACM, 2014, pp. 31–40.
[11] A. Kurch, N. Malzer, and K. Munch, “Qualitatsstudie zulive-untertitelungen–am beispiel des “tv-duells”,” 2015.
[12] A. Ng and J. Widom, “Origins of the modern mooc(xmooc),” Hrsg. Fiona M. Hollands, Devayani Tirthali:MOOCs: Expectations and Reality: Full Report, pp. 34–47,2014.
[13] N. Mamgain, A. Sharma, and P. Goyal, “Learner’s perspec-tive on video-viewing features offered by mooc providers:Coursera and edx,” in MOOC, Innovation and Technology inEducation (MITE), 2014 IEEE International Conference on.IEEE, 2014, pp. 331–336.
[14] F. Grunewald, H. Yang, E. Mazandarani, M. Bauer, andC. Meinel, “Next generation tele-teaching: Latest record-ing technology, user engagement and automatic metadataretrieval,” in Human Factors in Computing and Informatics.Springer, 2013, pp. 391–408.
[15] H. Yang, C. Oehlke, and C. Meinel, “An automated analysisand indexing framework for lecture video portal,” in Interna-tional Conference on Web-Based Learning. Springer, 2012,pp. 285–294.
[16] B. Arons, “Pitch-based emphasis detection for segmentingspeech recordings.” in ICSLP, 1994.
[17] F. R. Chen and M. Withgott, “The use of emphasis toautomatically summarize a spoken discourse,” in Acoustics,Speech, and Signal Processing, 1992. ICASSP-92., 1992IEEE International Conference on, vol. 1. IEEE, 1992,pp. 229–232.
[18] K. E. A. Silverman, “The structure and processing of fun-damental frequency contours,” Ph.D. dissertation, Universityof Cambridge, 1987.
[19] J. Hirschberg and B. Grosz, “Intonational features of localand global discourse structure,” in Proceedings of the work-shop on Speech and Natural Language. Association forComputational Linguistics, 1992, pp. 441–446.
[20] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner, “Loud-ness predicts prominence: Fundamental frequency lends lit-tle,” The Journal of the Acoustical Society of America, vol.118, no. 2, pp. 1038–1054, 2005.
[21] R. Silipo and S. Greenberg, “Automatic transcription ofprosodic stress for spontaneous english discourse,” in Proc.of the XIVth International Congress of Phonetic Sciences(ICPhS), vol. 3, 1999, p. 2351.
[22] F. Tamburini, “Automatic prosodic prominence detection inspeech using acoustic features: an unsupervised system.” inINTERSPEECH, 2003.
[23] G. Christodoulides and M. Avanzi, “An evaluation of ma-chine learning methods for prominence detection in french.”in INTERSPEECH, 2014, pp. 116–119.
[24] M. Heldner, E. Strangert, and T. Deschamps, “A focus detec-tor using overall intensity and high frequency emphasis,” inProc. of ICPhS, vol. 99, 1999, pp. 1491–1494.
[25] R. Fernandez and B. Ramabhadran, “Automatic explorationof corpus-specific properties for expressive text-to-speech: Acase study in emphasis,” in 6th ISCA Workshop on Speech
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 13
Synthesis, 2007.[26] J. M. Brenier, D. M. Cer, and D. Jurafsky, “The detection
of emphatic words using acoustic and lexical features.” inINTERSPEECH, 2005, pp. 3297–3300.
[27] S. Kakouros, J. Pelemans, L. Verwimp, P. Wambacq, andO. Rasanen, “Analyzing the contribution of top-down lexicaland bottom-up acoustic cues in the detection of sentenceprominence,” Interspeech 2016, pp. 1074–1078, 2016.
[28] L. S. Kennedy and D. P. Ellis, “Pitch-based emphasis detec-tion for characterization of meeting recordings,” in AutomaticSpeech Recognition and Understanding, 2003. ASRU’03.2003 IEEE Workshop on. IEEE, 2003, pp. 243–248.
[29] B. Wrede and E. Shriberg, “Spotting ”hot spots” in meetings:human judgments and prosodic cues.” in INTERSPEECH,2003.
[30] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotionin speech,” in Spoken Language, 1996. ICSLP 96. Proceed-ings., Fourth International Conference on, vol. 3. IEEE,1996, pp. 1970–1973.
[31] S. G. Koolagudi and K. S. Rao, “Emotion recognition fromspeech: a review,” International journal of speech technol-ogy, vol. 15, no. 2, pp. 99–117, 2012.
[32] L.-c. Yang and N. Campbell, “Linking form to meaning: theexpression and recognition of emotions through prosody,” in4th ISCA Tutorial and Research Workshop (ITRW) on SpeechSynthesis, 2001.
[33] O. Niebuhr, “On the phonetics of intensifying emphasis ingerman,” Phonetica, vol. 67, no. 3, pp. 170–198, 2010.
[34] D. Ververidis and C. Kotropoulos, “Emotional speech recog-nition: Resources, features, and methods,” Speech communi-cation, vol. 48, no. 9, pp. 1162–1181, 2006.
[35] D. Erickson, O. Fujimura, and B. Pardo, “Articulatory corre-lates of prosodic control: Emotion and emphasis,” Languageand Speech, vol. 41, no. 3-4, pp. 399–417, 1998.
[36] A. Pentland, “Social dynamics: Signals and behavior,” inInternational Conference on Developmental Learning, SalkInstitute, San Diego, CA, 2004.
[37] W. T. Stoltzman, “Toward a social signaling framework:Activity and emphasis in speech,” Ph.D. dissertation, Mas-sachusetts Institute of Technology, 2006.
[38] M. R. Barrick and M. K. Mount, “The big five personality di-mensions and job performance: a meta-analysis,” Personnelpsychology, vol. 44, no. 1, pp. 1–26, 1991.
[39] Y. R. Tausczik and J. W. Pennebaker, “The psychologi-cal meaning of words: Liwc and computerized text analy-sis methods,” Journal of language and social psychology,vol. 29, no. 1, pp. 24–54, 2010.
[40] Y. Wang and A. Pal, “Detecting emotions in social media:A constrained optimization approach.” in IJCAI, 2015, pp.996–1002.
[41] N. H. Bach, K. Shinoda, and S. Furui, “Robust highlightextraction using multi-stream hidden markov models forbaseball video,” in IEEE International Conference on ImageProcessing 2005, vol. 3. IEEE, 2005, pp. III–173.
[42] Y.-F. Huang and W.-C. Chen, “Rushes video summarizationby audio-filtering visual features,” International Journal ofMachine Learning and Computing, vol. 4, no. 4, p. 359,2014.
[43] Y. Zheng, G. Zhu, S. Jiang, Q. Huang, and W. Gao, “Visual-aural attention modeling for talk show video highlight detec-tion,” in 2008 IEEE International Conference on Acoustics,
Speech and Signal Processing. IEEE, 2008, pp. 2213–2216.[44] F. Wang and B. Merialdo, “Multi-document video summa-
rization,” in 2009 IEEE International Conference on Multi-media and Expo. IEEE, 2009, pp. 1326–1329.
[45] S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding basedfeature learning for video summarization,” IEEE Transac-tions on Multimedia, vol. 16, no. 6, pp. 1497–1509, 2014.
[46] H.-P. Chou, J.-M. Wang, C.-S. Fuh, S.-C. Lin, and S.-W. Chen, “Automated lecture recording system,” in SystemScience and Engineering (ICSSE), 2010 International Con-ference on. IEEE, 2010, pp. 167–172.
[47] A. R. Ram and S. Chaudhuri, “Media for distance education,”in Video Analysis and Repackaging for Distance Education.Springer, 2012, pp. 1–9.
[48] L. He, E. Sanocki, A. Gupta, and J. Grudin, “Auto-summarization of audio-video presentations,” in Proceedingsof the seventh ACM international conference on Multimedia(Part 1). ACM, 1999, pp. 489–498.
[49] C. M. Taskiran, Z. Pizlo, A. Amir, D. Ponceleon, andE. J. Delp, “Automated video program summarization us-ing speech transcripts,” IEEE Transactions on Multimedia,vol. 8, no. 4, pp. 775–791, 2006.
[50] T. Beaven, A. Comas-Quinn, M. Hauck, B. de los Arcos,and T. Lewis, “The open translation mooc: creating onlinecommunities to transcend linguistic barriers,” Journal ofInteractive Media in Education, vol. 2013, no. 3, 2013.
[51] X. Che, S. Luo, C. Wang, and C. Meinel, “An attempt atmooc localization for chinese-speaking users,” InternationalJournal of Information and Education Technology, vol. 6,no. 2, p. 90, 2016.
[52] X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuationprediction for unsegmented transcript based on word vector,”in Proceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016), 2016.
[53] X. Che, S. Luo, H. Yang, and C. Meinel, “Sentence boundarydetection based on parallel lexical and acoustic models,”Interspeech 2016, pp. 2528–2532, 2016.
[54] D. Garcia, M. Ball, and A. Parikh, “L@s 2014 demo: bestpractices for mooc video,” in Proceedings of the first ACMconference on Learning@ scale conference. ACM, 2014,pp. 217–218.
[55] W. Krauth, “Coming home from a mooc,” Computing inScience & Engineering, vol. 17, no. 2, pp. 91–95, 2015.
[56] Y. Qi and B. R. Hunt, “Voiced-unvoiced-silence classifica-tions of speech using hybrid features and a network classi-fier,” IEEE Transactions on Speech and Audio Processing,vol. 1, no. 2, pp. 250–255, 1993.
[57] B. Atal and L. Rabiner, “A pattern recognition approachto voiced-unvoiced-silence classification with applicationsto speech recognition,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 24, no. 3, pp. 201–212,1976.
[58] H. Deng and D. O’Shaughnessy, “Voiced-unvoiced-silencespeech sound classification based on unsupervised learning,”in 2007 IEEE International Conference on Multimedia andExpo. IEEE, 2007, pp. 176–179.
[59] R. Bachu, S. Kopparthi, B. Adapa, and B. D. Barkana,“Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy,” in Advanced Techniques in Com-puting Sciences and Software Engineering. Springer, 2010,
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2017.2716372, IEEETransactions on Learning Technologies
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING & TRANSACTIONS ON LEARNING TECHNOLOGIES 14
pp. 279–282.[60] O. Luzanova, “Means of speech for the creation of a positive
image of man in a panegyric discourse. typical deviationsin english speech, made by non-native speakers (consideringenglish personal advertisements),” 2014.
[61] A. Crines and T. Heppell, “Rhetorical style and issue em-phasis within the conference speeches of ukip’s nigel farage2010–2014,” British Politics, 2016.
[62] H. Pedrosa-de Jesus and B. da Silva Lopes, “Exploringthe relationship between teaching and learning conceptionsand questioning practices, towards academic development,”Higher Education Research Network Journal, p. 37, 2012.
[63] M. Li and X. Jiang, “Art appreciation instruction and changesof classroom questioning at senior secondary school in visualculture context,” Cross-Cultural Communication, vol. 11,no. 1, p. 43, 2015.
[64] F. D. Davis, “Perceived usefulness, perceived ease of use, anduser acceptance of information technology,” MIS quarterly,pp. 319–340, 1989.
[65] M. Onishi, M. Izumi, and K. Fukunaga, “Blackboard seg-mentation using video image of lecture and its applications,”in Pattern Recognition, 2000. Proceedings. 15th Interna-tional Conference on, vol. 4. IEEE, 2000, pp. 615–618.
[66] M. Lin, J. F. Nunamaker Jr, M. Chau, and H. Chen,“Segmentation of lecture videos based on text: a methodcombining multiple linguistic features,” in System Sciences,2004. Proceedings of the 37th Annual Hawaii InternationalConference on. IEEE, 2004, pp. 9–pp.
[67] T. Tuna, M. Joshi, V. Varghese, R. Deshpande, J. Subhlok,and R. Verma, “Topic based segmentation of classroomvideos,” in Frontiers in Education Conference (FIE), 2015.32614 2015. IEEE. IEEE, 2015, pp. 1–9.
[68] I. Koprinska and S. Carrato, “Temporal video segmentation:A survey,” Signal processing: Image communication, vol. 16,no. 5, pp. 477–500, 2001.
[69] A. Hill, T. Arford, A. Lubitow, and L. M. Smollin, ““i’mambivalent about it” the dilemmas of powerpoint,” TeachingSociology, vol. 40, no. 3, pp. 242–256, 2012.
[70] D. G. Levasseur and J. Kanan Sawyer, “Pedagogy meetspowerpoint: A research review of the effects of computer-generated slides in the classroom,” The Review of Communi-cation, vol. 6, no. 1-2, pp. 101–123, 2006.
[71] K. Li, J. Wang, H. Wang, and Q. Dai, “Structuring lecturevideos by automatic projection screen localization and anal-ysis,” IEEE Transactions on Pattern Analysis & MachineIntelligence, vol. 37, no. 6, pp. 1233–1246, 2015.
[72] H. J. Jeong, T.-E. Kim, H. G. Kim, and M. H. Kim,“Automatic detection of slide transitions in lecture videos,”Multimedia Tools and Applications, vol. 74, no. 18, pp.7537–7554, 2015.
[73] H. Yang and C. Meinel, “Content based lecture videoretrieval using speech and video text information,” IEEETransactions On Learning Technologies, vol. 7, no. 2, pp.142–154, 2014.
[74] X. Che, H. Yang, and C. Meinel, “Adaptive e-lecture videooutline extraction based on slides analysis,” in Advances inWeb-Based Learning–ICWL 2015. Springer, 2015, pp. 59–68.
[75] ——, “Lecture video segmentation by automatically analyz-ing the synchronized slides,” in Proceedings of the 21st ACMinternational conference on Multimedia. ACM, 2013, pp.
345–348.[76] U. Natke, J. Grosser, and K. T. Kalveram, “Fluency, funda-
mental frequency, and speech rate under frequency-shiftedauditory feedback in stuttering and nonstuttering persons,”Journal of Fluency Disorders, vol. 26, no. 3, pp. 227–241,2001.
[77] H. Quene, “Multilevel modeling of between-speaker andwithin-speaker variation in spontaneous speech tempo,” TheJournal of the Acoustical Society of America, vol. 123, no. 2,pp. 1104–1113, 2008.
[78] D. O’Shaughnessy, “Timing patterns in fluent and disfluentspontaneous speech,” in Acoustics, Speech, and Signal Pro-cessing, 1995. ICASSP-95., 1995 International Conferenceon, vol. 1. IEEE, 1995, pp. 600–603.
Xiaoyin Che was born on January 2, 1987 inBeijing, China. He entered university in 2005and received his bachelor degree from Collegeof Computer Science, Beijing University of Tech-nology (BJUT) in 2009, majored in computer sci-ence and technology. Then he started his mas-ter program in Multimedia and Intelligent Soft-ware Technology Laboratory, College of Com-puter Science, BJUT and received the degree in2012. He is currently a PhD student in the Chairof Internet Technologies and Systems, Hasso
Plattner Institute, Potsdam, Germany. He previously researched in videocoding standards and image processing. Now his research interestsinclude multimedia analysis, natural language processing and their ap-plications in e-Learning.
Haojin Yang received the Diploma Engineeringdegree at the Technical University Ilmenau, inGermany 2008. In 2013, he received the doc-torate degree at the Hasso-Plattner-Institute forIT-Systems Engineering (HPI) at the Universityof Potsdam. His current research interests re-volve around multimedia analysis, informationretrieval, computer vision and deep learningtechnology.
Christoph Meinel was born in 1954. He studiedmathematics and computer science at the Hum-boldt University of Berlin from 1974 to 1979. In1981 he received his PhD degree with the titleDr. rer. nat. From 1981 to 1991 he served asresearch assistant at Humboldt University andat the Institute for Mathematics at the BerlinAcademy of Sciences. He completed his habil-itation in 1988, earning the title Dr. Sc. nat.
He is the Scientific Director and CEO of theHasso Plattner Institute for Software Systems
Engineering GmbH (HPI), Potsdam, Germany, since 2004. Previouslyhe worked in University of Saarbrucken and University of Paderborn,and became full professor (C4) for computer science at the University ofTrier. His areas of research focus on Internet and Information Security,Web 3.0, Semantic Web, Social and Service Web and the domains of e-learning, tele-teaching and tele-medicine. Christoph Meinel is author/co-author of 9 books and 4 anthologies, as well as editor of various con-ference proceedings. More than 400 of his papers have been publishedin high-profile scientific journals and at international conferences. Prof.Meinel is a member of acatech, the German “National Academy ofScience and Engineering” and also a member of the IEEE.
10
Medical Image Semantic
Segmentation
In this work, we introduce a fully automatic conditional generative adversarial
network (cGAN) for medical image semantic segmentation. The proposed frame-
work consists of three components: a generator, a discriminator and a refinement
model. Three models are trained jointly, and the final segmentation masks are
composed by the output of the three models. Our experimental results prove that
the proposed framework successfully applied to different types of medical images
of varied sizes.
10.1 Contribution to the Work
• Contributor to the formulation and implementation of research ideas
• Significantly contributed to the conceptual discussion and implementation.
• Guidance and supervision of the technical implementation
10.2 Manuscript
197
Multimedia Tools and Applicationshttps://doi.org/10.1007/s11042-019-7305-1
Recurrent generative adversarial network for learningimbalancedmedical image semantic segmentation
Mina Rezaei1 ·Haojin Yang1 ·Christoph Meinel1
Received: 1 October 2018 / Revised: 18 December 2018 / Accepted: 29 January 2019 /
© Springer Science+Business Media, LLC, part of Springer Nature 2019
AbstractWe propose a new recurrent generative adversarial architecture named RNN-GAN to miti-gate imbalance data problem in medical image semantic segmentation where the number ofpixels belongs to the desired object are significantly lower than those belonging to the back-ground. A model trained with imbalanced data tends to bias towards healthy data which isnot desired in clinical applications and predicted outputs by these networks have high preci-sion and low recall. To mitigate imbalanced training data impact, we train RNN-GAN withproposed complementary segmentation mask, in addition, ordinary segmentation masks.The RNN-GAN consists of two components: a generator and a discriminator. The gen-erator is trained on the sequence of medical images to learn corresponding segmentationlabel map plus proposed complementary label both at a pixel level, while the discriminatoris trained to distinguish a segmentation image coming from the ground truth or from thegenerator network. Both generator and discriminator substituted with bidirectional LSTMunits to enhance temporal consistency and get inter and intra-slice representation of the fea-tures. We show evidence that the proposed framework is applicable to different types ofmedical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, andLiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacyof our approach.
Keywords Imbalanced medical image semantic segmentation · Recurrent generativeadversarial network
1 Introduction
Medical imaging plays an important role in disease diagnosis, treatment planning, and clini-cal monitoring [4, 24]. One of the major challenges in medical image analysis is imbalancedtraining sample where desired class pixels (lesion or body organ) are often much lower in
� Mina Rezaeimina.rezaei@hpi.de
Extended author information available on the last page of the article.
Multimedia Tools and Applications
numbers than non-lesion pixels. A model learned from class imbalanced training data isbiased towards the majority class. The predicted results of such networks have low sensitiv-ity, showing the ability of correctly predicting non-healthy classes. In medical applications,the cost of miss-classification of the minority class could be more than the cost of miss-classification of the majority class. For example, the risk of not detecting tumor could bemuch higher than referring to doctors a healthy subject.
The problem of the imbalanced class has been recently addressed in diseases classifica-tion, tumor localization, and tumor segmentation. Two types of approaches being proposedin the literature: data-level approaches and algorithm-level approaches.
At data-level, the objective is to balance the class distribution through re-sampling thedata space [35, 52], by including SMOTE (Synthetic Minority Over-sampling Technique)of the positive class [10] or by under-sampling of the negative class [23]. However, theseapproaches often lead to remove some important samples or add redundant samples to thetraining set.
Algorithm-level based solutions address class imbalance problem by modifying thelearning algorithm to alleviate the bias towards majority class. Examples are cascade train-ing [8, 11], training with cost-sensitive function [47], such as Dice coefficient loss [11, 13,41], and asymmetric similarity loss [18] that modifying the training data distribution withregards to the miss-classification cost.
In this paper, we mitigate imbalanced training samples: In data-level, we explore theadvantage of training network with inverse class frequency segmentation masks, namedcomplementary segmentation masks in addition to ground truth segmentation masks (ordi-nary masks) which can then be used to improve the overall prediction of the quality ofthe segmentation. Assume, Y is true segmentation label annotated by expert and Y is syn-thesized pair of corresponding images with a complementary label. In the complementarymasks Y , the majority and minority pixels value are changed to skew bias from majoritypixels where the negative label for the major class and a positive label for the c − 1 class.Then, our network train with both ordinary segmentation mask Y and complementary seg-mentation masks Y at the same time but in multiple loss. The final segmentation masksrefine by considering ordinary and complementary mask prediction.
In algorithm-level, we study the advantage of mixing adversarial loss with categoricalaccuracy loss compared to traditional losses such as �1 loss. Hence, image segmentationis an important task in medical imaging that attempts to identify the exact boundaries ofobjects such as organs or abnormal regions (e.g. tumors). Automating medical image seg-mentation is a challenging task due to the high diversity in the appearance of tissues amongdifferent patients, and in many cases, the similarity between healthy and non-healthy tissues.Numerous automatic approaches have been developed to speed up medical image segmen-tation [32]. We can roughly divide the current automated algorithms into two categories:those based on generative models and those based on discriminative models.
Generative probabilistic approaches build the model based on prior domain knowledgeabout the appearance and spatial distribution of the different tissue types. Traditionally,generative probabilistic models have been popular where simple conditionally independentGaussian models [14] or Bayesian learning [33] are used for tissue appearance. On the con-trary, discriminative probabilistic models, directly learn the relationship between the localfeatures of images [3] and segmentation labels without any domain knowledge. Traditionaldiscriminative approaches such as SVMs [2, 9], random forests [27], and guided randomwalks [12] have been used in medical image segmentation. Deep neural networks (DNNs)are one of the most popular discriminative approaches, where the machine learns the hier-archical representation of features without any handcrafted features [26, 51]. In the field of
Multimedia Tools and Applications
medical image segmentation, Ronneberger et al. [38] presented a fully convolutional neuralnetwork, named UNet, for segmenting neuronal structures in electron microscopic stacks.
Recently, GANs [15] have gained a lot of momentum in the research fraternities. Mirzaet al. [28] extended the GANs framework to the conditional setting by making both the gen-erator and the discriminator network class conditional. Conditional GANs (cGANs) havethe advantage of being able to provide better representations for multi-modal data genera-tion since there is a control over the modes of the data being generated. This makes cGANssuitable for image semantic segmentation task, where we condition on an observed imageand generate a corresponding output image.
Unlike previous works on cGANs [22, 29, 48], we investigate the 2D sequence of med-ical images into 2D sequence of semantic segmentation. In our method, 3D bio-medicalimages are represented as a sequence of 2D slices (i.e. as z-stacks). We use bidirectionalLSTM units [16] which are an extension of classical LSTMs and are able to improve modelperformance on sequence processing by enhancing temporal consistency. We use time dis-tribution between convolutional layers and bidirectional LSTM units on bottleneck of thegenerator and the discriminator to get inter and intra-slice representation of features.
Summarizing, the main contributions of this paper are:
– We introduce RNN-GAN, a new adversarial framework that improves semantic seg-mentation accuracy. The proposed architecture shows promising results for smalllesions segmentation as well as anatomical regions.
– Our proposed method mitigates imbalanced training data with biased complementarymasks in task of semantic segmentation.
– We study the effect of different losses and architectural choices that improve semanticsegmentation.
The rest of the paper is organized as follows: in the next section, we review recentmethods for handling imbalanced training data and semantic segmentation tasks. Section 3explains the proposed approach for semantic segmentation, while the detailed experimen-tal results are presented in Section 4. We conclude the paper and give an outlook on futureresearch in Section 5.
2 Related work
This section briefs the previous studies carried out in the area of learning from imbalanceddatasets, generative adversarial networks, and medical image semantic segmentation mostlyin recent years.
Handling imbalanced training dataset. Cascade architecture [8] and ensembleapproaches [43] provided best performance on highly imbalanced medical dataset likeLiTS-2017 for segmentation of very small lesion(s). Some have focused on balancingrecall and precision with asymmetric loss [18], others used accuracy loss [41] andweighted the imbalanced class according to its frequency in the dataset [8, 36]. Similarto some recent work [39, 41], we mitigate the negative impact of the class imbalanced,by mixing adversarial loss and categorical accuracy loss and training deep model withcomplementary masks.
Learning with complementary labels. Recently, the complementary labels in context ofmachine learning [21] has been used by assuming the transition probabilities are identicalwith modifying traditional one-versus-all and pairwise-comparison losses for multi-class
Multimedia Tools and Applications
classification. Ishida et al. [21] theoretically prove that unbiased estimator to the clas-sification risk can be obtained by complementary labels. Yu et al. [50] study learningfrom both complementary labels and ordinary labels can provide a useful application formulti-class classification task. Inspired by recent success [21, 50], we train the proposedRNN-GAN with both complementary labels and ordinary labels for the task of semanticsegmentation to skew the bias from majority pixels.
Generative Adversarial Network. Previous works [22, 54] show the success of condi-tional GANs as a general setting for image-to-image translation. Some recent worksapplied GANs unconditionally for image-to-image translation by forcing generator topredict desired output under �1 [48] or �2 [31, 53] regression. Here, we study the mixingof adversarial loss in conditional setting with traditional loss and accuracy loss motivatedto attenuate imbalanced training dataset. Our method also differs from the prior works[22, 25, 29, 55] by the architectural setting of the generator and the discriminator, weuse bidirectional LSTM units on top of the generator and discriminator architecture tocapture temporal consistency between 2D slices.
Medical image semantic segmentation. The UNet has achieved promising results in med-ical image segmentation [38] since it allows low-level features concatenated withhigh-level features which provided better learning representation. Later, UNet with com-bination of residual network [6], in cascade of 2D and 3D [20] were used for cardiacimage segmentation or heterogeneous liver segmentation [8]. The generator network inRNN-GAN, is modified UNet where high resolution features are concatenated with up-sampled of global low-resolution features to help the network learn both local and globalinformation.
3 Method
In this section we present the recurrent generative adversarial network for medical imagesemantic segmentation. To tackle with miss-classification cost and mitigate imbalancedpixel labels, we mixed adversarial loss with categorical accuracy loss Section 3.1. More-over, we explain our intuition for skewing the biased from majority pixels with proposedcomplementary labels Section 3.2.
3.1 Recurrent generative adversarial network
In a conventional generative adversarial network, generative model G tries to learn a map-ping from random noise vector z to output image y; G : z → y. Meanwhile, a discriminativemodel D estimates the probability of a sample coming from the training data xreal ratherthan the generator xf ake. The GAN objective function is two-player mini-max game withvalue function V (G,D) :
mG
in mD
ax V (D, G) = Ey[logD(y)] + Ez[log(1 − D(G(z)))] (1)
In a conditional GAN, a generative model learns the mapping from the observed imagex and a random vector z to the output image y; G : x, z → y. On the other hand theD attempts to discriminate between generator output image and the training set images.According to the (2), in the cGANs training procedure both G and D are conditioned ondesired output y.
mG
in mD
ax V (D, G) = Ex,y[logD(x, y)] + Ex,z[log(1 − D(x, G(x, z)))] (2)
Multimedia Tools and Applications
More specifically, in our proposed RNN-GAN network, a generative model learns themapping from a given sequence of 2D medical images xi to the semantic segmentation ofcorresponding labels yiseg ; G : xi, z → {yiseg } (where i refers to 2D slices index between1 and 20 from a total 20 slices acquired from ACDC-2017). The training procedure for thesemantic segmentation task is similar to two-player mini-max game (3). While the gener-ator predicted segmentation in pixel level, the discriminator takes the ground truth and thegenerator’s output to determine whether predicted label is real or fake.
Ladv ← mG
in mD
ax V (D, G) = Ex,yseg [logD(x, yseg)]+Ex,z[log(1−D(x, G(x, z)))] (3)
We mixed the adversarial loss with �1 distance (4) to minimize the absolute differencebetween the predicted value and the existing largest value. Hence the �1 objective functiontakes into account CNN features and differences between the predicted segmentation andthe ground truth, resulting in less noise and smoother boundaries.
LL1(G) = Ex,z ‖ yseg − G(x, z) ‖ (4)
L�acc (G) = 1
c
∑
j=1
∑
i=1
yijseg ∩ G(xij , z)
yijseg ∪ G(xij , z)(5)
where j and i indicate the number of semantic classes and the number of 2D slices for eachpatients respectively.
Moreover, we mixed categorical accuracy loss �acc, (5), in order to mitigate imbalancedtraining data by assigning a higher cost to the less represented set of pixels, boosting itsimportance during the learning process. Categorical accuracy loss checks whether the max-imal true value is equal to the maximal predicted value regarding each category of thesegmentation.
Then, the final adversarial loss for semantic segmentation task by RNN-GAN iscalculated through (6).
LRNN−GAN(D,G) = Ladv(D,G) + LL1(G) + L�acc (G) (6)
In this work, similar to the work of Isola et al. [22], we used Gaussian noise z in thegenerator alongside the input data x. As discussed by Isola et al. [22], in training procedureof conditional generative model from conditional distribution P(y|x), that would be better,a trained model produces more than one sample y, from each input x. When the generatorG, takes plus input image x, random vector z, then G(x, z) can generate as many differentvalues for each x as there are values of z. Specially for medical image segmentation, thediversity of image acquisition methods (e.g., MRI, fMRI, CT, ultrasound), regarding theirsettings (e.g., echo time, repetition time), geometry (2D vs. 3D), and differences in hardware(e.g., field strength, gradient performance) can result in variations in the appearance of bodyorgans and tumour shapes [19], thus learning random vector z with input image x makesnetwork robust against noise and act better in the output samples. This has been confirmedby our experimental results using datasets having a large range of variation.
3.2 Complementary label
In order to mitigate the impact of imbalanced pixels labels on medical images, the proposedRNN-GAN as described in Fig. 1, is trained with complementary mask (Fig. 2, third col-umn) in addition of the ordinary masks (Fig. 2, columns 4–6). Similar to Yu et al. [50],we assumed transition probabilities are identical then the adversarial loss (i.e. categoricalcross entropy loss) provides an unbiased estimator for minimizing the risk. Since we have
Multimedia Tools and Applications
Fig. 1 The architecture of RNN-GAN consists of two deep networks: a generative network G and a discrim-inative network D. G takes sequence of 2D images as a condition and generates the sequence of 2D semanticsegmentation outputs, D determines whether those outputs are real or fake. RNN-GAN captures inter andintra-slice feature representation with bidirectional LSTM units on bottleneck of both G and D network.Here, G is modified UNet architecture and D is fully convolutional encoder
the same assumption we skip the proof of theoretical side and here we experimentally showthat complementary labels in addition of ordinary losses are able to provide more accurateresults for a task of semantic segmentation.
3.3 Network architecture
The proposed architecture is shown in Fig. 1, where the generator network G in the leftfollowed by the discriminator network D in the right side of the figure. We design bidi-rectional LSTM units on circumvent bottleneck of both G and D, to capture the non-linearrelationship between previous, current, and next 2D slices which is important key to processsequential data.
3.3.1 Recurrent generator
The recurrent generator takes a random vector z plus sequence of 2D medical images. Sim-ilar to the UNet architecture, we added skip connections between each layer r and thecorresponding layer t − 1 − r , where t represents the total number of layers. Each skip con-nection simply concatenates all channels at layer r with those at layer t − 1 − r . Feature
Fig. 2 The chest MR image, from ACDC-2017 after pre-processing. The first column is semantic segmen-tation mask correspond to MR images in second column. Columns 3-6 present complementary labels mask,right ventricle, myocardium vessel, and left ventricle where we map 2D images from second column intofour segmentation masks presented in columns 3-6
Multimedia Tools and Applications
Fig. 3 The cardiac MR image, from ACDC 2017 after pre-processing left side image shows end of systolicsample and right side is end of diastolic phase. We extracted complementary mask from inverse of groundtruth file annotated by medical expert, presented in the second and seventh column. Other binary masksextracted from ground truth file in columns 3-5 and 8-10 respectively are right ventricles, myocardium vessel,and left ventricles which they are used by the discriminator. The first and sixth columns are an example inputof the generator
maps from the convolution part in the down-sampling step are fed into the up-convolutionpart in the up-sampling step. The generator is trained on a sequence input images from samepatient and same acquisition plane. We use the convolutional layer with kernel size 5 × 5and stride 2 for down-sampling, and perform up-sampling by the image resize layer with afactor of 2 and convolutional layer with kernel size 3 × 3 and stride 1.
3.3.2 Recurrent discriminator
The discriminator network is a classifier and has similar structure as an encoder of thegenerator network. Hierarchical features are extracted from fully convolutional encoder ofdiscriminator and used to classify between the generator segmentation output and groundtruth. More specifically, the discriminator is trained to minimize the average negative cross-entropy between predicted and the true labels.
Then, two models are trained through back propagation corresponding to a two-playermini-max game (see (3)). We use categorical cross entropy [30] as an adversarial loss.In this work, the recurrent architecture selected for both discriminator and generator is abidirectional LSTM [16].
4 Experiments
We validated the performance of RNN-GAN on three recent public medical imaging chal-lenges: real patient data obtained from the MICCAI 2017, automated cardiac MRI segmen-tation challenge (ACDC-2017) [5], CT liver tumour segmentation challenge (LiTS-2017),and the 2016 whole-heart and great vessel segmentation challenge (HVSMR).
4.1 Datasets and pre-processing
Our experiments are based on three independent datasets consisting of two cardiac MRimages, and an abdomen CT dataset that all segmented manually by radiologists at pixellevel.
ACDC. The ACDC dataset1 comprised of 150 patients with 3D cine-MR images acquiredin a clinical routine. The training database was composed of 100 patients. For all thesedata, the corresponding manual references were given by a clinical expert. The testingdatabase consisted of 50 patients without manual references. Figure 3 shows a cardiacMR images from the ACDC dataset.
1https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html
Multimedia Tools and Applications
Fig. 4 The abdomen CT image, from LiTS-2017. The first and second columns show before and after pre-processing. Our generator takes after pre-processing slices (second column) and learns to map third andfourth columns by getting feedback from discriminator
HVSMR. Thirty training cine MRI scans from 10 patients were provided by the organizersof the HVSMR challenge.2 Three images were provided for each patient: a completeaxial cine MRI, the same image cropped around the heart and the thoracic aorta, and acropped short-axis reconstruction.
LiTS. In third experiment, we applied the LiTS-2017 benchmark3 that comprised of 130 CTtraining and 70 test subjects. The examined patients were suffering from different livercancers. The challenging part is segmentation of very small lesion target on a high unbal-anced dataset. Here, pre-processing is carried out in a slice-wise fashion. We appliedHounsfield unit (HU) values, which were windowed in the range of [100, 400] to excludeirrelevant organs and objects as shown in Fig. 4. Furthermore, we applied histogramequalization to increase the contrast for better differentiation of abnormal liver tissue.
Pre-processing of MR images. The gray-scale distribution of MR images is dependent onthe acquisition protocol and the hardware. This makes learning difficult since we expectto have the same data distribution from one subject to another. Therefore, pre-processingis an important step toward bringing all subjects under similar distributions. We applieda bias field correction on the MR images from HVSMR and ACDC datasets to correctthe intensity non-uniformity using N4ITK [42]. Lastly, we applied histogram matchingnormalization on the all 2D slices from sagittal, coronal, and axial planes.
4.2 Implementation and configuration
The RNN-GAN architecture is implemented based on Keras [7] and TensorFlow [1] library.The implemented code is available on the author GitHub.4 All training was conducted on aworkstation equipped with NVIDIA TITAN X GPU.
The model was trained for up to 120 epochs with batch size 10, iteration 450 and initiallearning rate 0.001 on ACDC dataset. Similarly, in HVSMR, we had initial learning rate0.001, batch size 10, iteration 2750, and 100 epochs where we used all 2D slices fromcoronal, sagittal, and axial planes with size 256 × 256. The generator and discriminator forall layers use the tanh activation function except the output layer which uses softmax. We usecategorical cross-entropy as an adversarial loss mixed with categorical accuracy and �1. TheRMSprop optimizer was used in both the generator and the discriminator. The RMSpropdivides the learning rate by an exponentially decaying average of squared gradients.
2http://segchd.csail.mit.edu/3https://competitions.codalab.org/competitions/170944https://github.com/HPI-DeepLearning/Recurrent-GAN
Multimedia Tools and Applications
Table 1 Comparison of the achieved accuracy in term of Dice metric on ACDC benchmark with relatedapproaches and top-ranked methods where the best performance in each cardiac phase and regions of interesthave been bold
Methods Phases Left ventricle Right ventricle Myocardium
RNN-GAN ED 0.968 0.940 0.933
ES 0.951 0.919 0.925
cGAN ED 0.934 0.906 0.899
ES 0.918 0.874 0.870
Isensee et al. [20] ED 0.955 0.925 0.865
ES 0.905 0.834 0.882
Wolterink et al. [46] ED 0.96 0.92 0.86
ES 0.91 0.84 0.88
Rohe et al. [37] ED 0.94 0.96 0.90
ES 0.92 0.95 0.90
Zotti et al. [56] ED 0.96 0.94 0.89
ES 0.94 0.87 0.90
U-Net [38] ED 0.96 0.88 0.78
ES 0.92 0.79 0.76
Poudel et al. [34] 0.90 − −
The network was trained with both the ground truth and complementary masks and adversarial loss wasmixed with �1 and categorical accuracy
Table 2 Comparison of achieved accuracy in term of Hausdorff distance on ACDC benchmark with top-ranked participant approaches and related work where the best performance in each cardiac phase and regionsof interest have been bold
Methods Phases Left ventricle Right ventricle Myocardium
RNN-GAN ED 6.82 8.95 8.08
ES 8.02 12.17 8.69
cGAN ED 8.62 12.16 9.04
ES 9.44 13.2 9.50
Isensee et al. [20] ED 7.38 10.12 8.72
ES 6.90 12.14 8.67
Wolterink et al. [46] ED 7.47 11.87 11.12
ES 9.6 13.39 10.06
Rohe et al. [37] ED 7.04 14.04 11.50
ES 10.92 15.92 13.03
Zotti et al. [56] ED 5.96 13.48 8.68
ES 6.57 16.66 8.99
U-Net [38] ED 6.17 20.51 15.25
ES 8.29 21.20 17.92
Here, RNN-GAN was trained with the ground truth and complementary masks and adversarial loss wasmixed with �1 and categorical accuracy
Multimedia Tools and Applications
Fig. 5 The cardiac segmentation results at test time by RNN-GAN from ACDC 2017 benchmark onPatient084. The red, green, and blue contour present respectively right ventricle, myocardium, and leftventricle region. The top two rows show the diastolic phase from different slices from t=0 till t=9 circle.Respectively the third and fourth rows present systolic cardiac phase from t=0 till t=9 circle
The training took eight hours on ACDC for a total of 120 epochs on parallel NVIDIATITAN X GPUs and with same configuration, it was 12 hours on HVSMR dataset. Withthis implementation, we are able to produce a cardiac segmentation mask between 500-700ms per patient on same cardiac phase from ACDC dataset on an axial plane.
The proposed approach is trained on 75% training data released by the HVSMR-2016and LiTS-2017 benchmarks. We used all provided images from three axes of sagittal, coro-nal, and axial for training, validation and testing. We trained our system on 75 exams fromaxial, coronal, and sagittal plane and validated it on the remaining 25 exams for the ACDCdataset.
In both the training and testing phase, the mini-batch consists of 2D images from thesame patient, the same acquisition plane and same cardiac phase. We initially normalizethe inputs where the mean and variance are computed on a specific patient from the sameacquisition plane and from all available images in the same cardiac phase (ED, ES). Thisnormalization helps to restrict the effect of outliers. With batch norm, we normalized theinputs (activations coming from the previous layer) going into each layer using the meanand variance of the activations for the entire mini-batch.
Multimedia Tools and Applications
Table 3 Dice-scores for different losses, evaluated on ACDC benchmark for segmentation of cardiac MRimages
Methods Phases Left ventricle Right ventricle Myocardium
RNN-GAN ED 0.968 0.940 0.933
(adv + �1 + acc + CL) ES 0.951 0.919 0.925
RNN-GAN ED 0.965 0.938 0.933
(adv + CL) ES 0.950 0.917 0.921
RNN-GAN ED 0.961 0.931 0.927
(adv + �1) ES 0.949 0.913 0.917
RNN-GAN ED 0.952 0.94 0.929
(adv + acc) ES 0.946 0.907 0.913
cGAN ED 0.934 0.906 0.899
(adv) ES 0.918 0.874 0.870
The best performance achieved when the RNN-GAN trained with complementary labels (CL) in addition �1and accuracy (acc) losses
Let us mention that Wolterink’s method (using an ensemble of six trained CNNs) took 4seconds to compute predictions mask per patient with a system equipped NVIDIA TITANX GPU in ACDC benchmark as reported in [46], while the RNN-GAN took 500 ms inaverage per patient with a system equipped single of NVIDIA TITAN X GPUs.
4.3 Evaluation criteria
The evaluation and comparison performed using the quality metrics introduced by eachchallenge organizer. Semantic segmentation masks were evaluated in a five-fold cross-validation. For each patient, a corresponding images for the End Diastolic (ED) instantand for the End Systolic (ES) instant has provided. As described by ACDC-2017, cardiacregions are defined as right-ventricle region labeled 1, 2 and 3 representing respectivelymyocardium and left ventricles. In order to optimize the computation of the different errormeasures, the Dice coefficient (7) and Hausdorff distance (8) python script code wereobtained from the ACDC for all participants.
The average distance boundary (ADB) in addition Dice and Hausdorff considered forevaluating the blood pool and myocardium in HVSMR-2016 and similarly, for validatingof liver lesions segmentation on LiTS-2017. Besides these parameters, we calculated sensi-tivity and specificity since they are a good indicator for miss-classified rate (false positivesand false negatives) (see Tables 5 and 6).
Dice(P, T ) ← | P ∧ T |(| P | + | T |)/2
(7)
Haus(P, T ) ← max{sup inf d( P, T ) , sup inf d( T , P ) } (8)
where P and T indicates predicted output by our proposed method and ground truthannotated by medical expert respectively.
4.4 Comparison with relatedmethods and discussion
As shown in Table 1, our method outperforms other top-ranked approaches from the ACDCbenchmark. Based on Table 1, in Dice coefficient, our method achieved slightly better than
Multimedia Tools and Applications
Fig. 6 The ACDC 2017 challenge results using RNN-GAN and cGAN architecture. The left figure showsDice coefficient in two cardiac phase as follows the right sub figure presents Hausdorff distance. The y-axisshows the Dice metrics and x-axis shows segmentation performance based on cGAN and RNN-GAN in EDand ES cardiac phase. In each sub figure, the mean is presented in red. The ACDC 2017 challenge resultsusing RNN-GAN and cGAN architecture. The sub figure (b) y-axis codes the Hausdorff distance in mm andx-axis presents segmentation performance based on cGAN and RNN-GAN in ED and ES cardiac phase
the Wolterink et al. [46] on ACDC challenge in left ventricle and myocardium segmentation.However, Rohe et al. [37] achieved outstanding performance for right ventricle segmen-tation since they applied the multi-atlas registration and segmentation at the same time.Poudel et al. [34] achieved competitive results on left ventricle segmentation with overallDice 0.93, based on recurrent fully convolutional networks.
Based on Tables 1 and 2, the right ventricle is a difficult organ for all the participantsmainly because of its complicated shape, the partial volume effect close to the free wall, and
Multimedia Tools and Applications
Table 4 Comparison of Segmentation results on HVSMR dataset in terms of Dice metric and averagedistance boundaries with other participant where the best performance in each metrics have been bold
Methods Dice1 Dice2 Adb1 Adb2
RNN-GAN 0.86 0.94 0.92 0.84
cGAN 0.74 0.91 1.19 1.07
Yu et al. [49] 0.84 0.93 0.99 0.86
Wolterink et al. [45] 0.80 0.93 0.89 0.96
Shahzad et al. [40] 0.75 0.89 1.10 1.15
U-Net [38] 0.68 0.81 2.04 1.82
For all columns, index 1 is myocardium and 2 blood pool
intensity of homogeneity. Our achieved accuracy in term of Hausdorff distance, in averageis 1.2 ± 0.2mm lower than other participants. This is a strong indicator for precision ofboundary that RNN-GAN architecture substituted with bidirectional LSTM units is suitablesolution for capturing the temporal consistency between slices. Compared to cGAN (Tables 1and 2) RNN-GAN provides better results when the network is trained with complementarysegmentation mask and even sensitivity and precision.
Compared to the expert annotated file on the original ED phase instants, individual Dicescores of 0.968 for the left ventricle (LV), 0.933 for the myocardium (MYO), and 0.940 forthe right ventricle (RV) (see Table 1) were achieved in test time on 25 patients. Qualitatively,the RNN-GAN segmentation results are promising (see Fig. 5 and 7) where we can seerobust and smooth boundaries for all substructures.
We report the effect of different losses for RNN-GAN in Table 3. As we expected,the best performance obtained when the network was trained with mixing of categoricalcross-entropy (as adversarial loss) with �1 and categorically accuracy. Using an �1 lossencourages the output respect the input, since the �1 loss penalizes the distance betweenground truth outputs, which match the input and synthesized outputs. Using categoricalaccuracy force the network to assign a higher cost to less represented set of objects, byboosting its importance during the learning process.
As depicted on Fig. 5 and Table 1 right ventricle is complex organ to segment. Themost failure happened in systolic phase. Based on Fig. 5 the achieved accuracy in the testtime on ACDC benchmark, we observed that the average results in diastolic phase (firstand second rows) are better than the average results on systolic phase (third and fourth
Table 5 Comparison of Segmentation errors in HVSMR dataset in terms of Hausdorff distance, sensitivity,and specificity with other participant approaches where the best performance in each metrics have been bold
Methods HD1 HD2 Sen1 Sen2 Spec1 Spec2
RNN-GAN 5.84 6.35 0.89 0.92 0.97 0.99
cGAN 6.79 9.2 0.82 0.88 0.94 0.99
Yu et al. [49] 6.41 7.03 − − − −Wolterink et al. [45] 6.13 7.07 − − − −Shahzad et al. [40] 6.05 7.49 − − − −U-Net [38] 8.86 11.2 0.78 0.74 0.91 0.99
For all columns, index 1 is myocardium and 2 blood pool
Multimedia Tools and Applications
Fig. 7 The cardiac segmentation results in test time by RNN-GAN from HVSMR 2016 benchmark. The toprow shows the predicted output by RNN-GAN and the second row presents the corresponding ground truthannotated by medical expert. The contour with cyan colour describes blood pool and dark blue shows themyocardium region
rows). We evaluated quantitatively the results using Hausdorff distance and Dice as shownin Fig. 6. As expected, the achieved Dice score on left ventricle (median of 6.82/8.02 for theED/ES frames) tend to be lower than for the two other regions of interest with myocardiumat 8.08/8.69 and right ventricle at 8.95/12.07 for ED/ES.
Based on Tables 4, 5 and Fig. 7, the results show good relation to the ground truth forthe blood pool. The average value of the Dice index is around 0.94. The main source oferror here is the inability of the method to completely segment all the great vessels wherethe average Dice score is 0.86. Regarding the results on Tables 4 and 5, by comparing thefirst and second row the achieved accuracy is better when we conditional GAN substitutedwith bidirectional LSTM units. These architecture provide a better representation of featuresby capturing spatial-temporal information in forward and backward dependency. In thiscontext, Poudel et al [34] designed unidirectional LSTMs on top of UNet architecture tocapture inter-intra slice features and achieved competitive results for segmentation of leftventricle.
The qualitative results of liver tumour segmentation are presented in Fig. 8. Based onFig. 8 and Table 6, RNN-GAN is able to detect complex and heterogeneous structure ofall lesions. The RNN-GAN architecture trained with complementary masks yielded betterresults and trade off between Dice and sensitivity. Dice score is a good measure for class
Fig. 8 LiTS-2017 test results for liver tumour(s) segmentation using RNN-GAN. We overlaid predicted livertumour region on CT images shown with blue colour. Compared to the green contour annotated by medicalexpert from ground truth file, we achieved 0.83 for Dice score and 0.74 for sensitivity
Multimedia Tools and Applications
Table 6 Quantitative segmentation results of the liver lesions segmentation on the LiTS-2017 dataset
Architecture Dice Sen VOE RVD ASD HD
RNN-GAN 0.83 0.74 14 −6 6.4 40.1
RNN-GAN * 0.80 0.68 20 −2 9.7 52.3
cGAN 0.76 0.57 21 −1 10.8 87.1
UNet [8] 0.72 − 22 −3 9.5 165.7
ResNet+Fusion [6] − − 16 −6 5.3 48.3
H-Dense+ UNet [17] − − 39 7.8 1.1 7.0
FCN [44] − − 35 12 1.0 7.0
The first and second rows show achieved accuracy for the task of liver lesions segmentation when our networkwas trained with (RNN-GAN) and without (RNN-GAN *) complementary segmentation masks respectively
imbalance where indicate the true positive rate by considering false negative and false pos-itive pixels. The effect of class balancing can be seen with comparison of first and secondrow of Table 6. As we expected the RNN-GAN trained with complementary segmentationlabels and binary segmentation masks computed more accurate result with average 3% and6% improvement respectively in Dice and sensitivity.
We compared predicted results by RNN-GAN at test time with other top-ranked andrelated approaches on LiTS-2017 in terms of volume overlap error (VOE), relative volumedifference (RVD), average symmetric surface distance (ASD), and maximum surface dis-tance or Hausdorff distance (HD), as introduced by challenge organizer. As depicted resultsin Table 6 cascade UNet [8] or ensemble network [6, 17] architectures has achieved bet-ter performance compared to trained only with fully convolutional neural network (FCN)[44]. In contrast to prior work such as [6, 8, 17], our proposed method could be general-ized to segment the very small lesion and also multiple organs in medical data in differentmodalities.
5 Conclusion
In this paper, we introduced a new deep architecture to mitigate the issue of imbalancedpixel labels in the task of medical image segmentation task. To this end, we developeda recurrent generative adversarial architecture named RNN-GAN, consists of two archi-tecture: a recurrent generator and a recurrent discriminator. To mitigate imbalanced pixellabels, we mixed adversarial loss with categorical accuracy loss and train the RNN-GANwith ordinary and complementary masks. Moreover, we analyzed the effects of differentlosses and architectural choices that help to improve semantic segmentation results. Ourproposed method shows outstanding results for segmentation of anatomical regions (i.e.cardiac image semantic segmentation). Based on the segmentation results on two cardiacbenchmarks, the RNN-GAN is robust against slice misalignment and different CMRI pro-tocols. Experimental results reveal that our method produces an average Dice score of 0.95.Regarding the high accuracy and fast processing speed, we think it has the potential to usefor the routine clinic task. We validated also the RNN-GAN on tumor segmentation basedon abdomen CT images and achieved competitive results on LiTS benchmark.
The impact of learning from complementary labels from different imbalanced ratio mayalso be useful in the context of semantic segmentation. We will investigate this issue in the
Multimedia Tools and Applications
future. In term of application, we plan to investigate the potential of RNN-GAN network forlearning multiple clinical tasks such as diseases classification and semantic segmentation.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.
References
1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, DevinM, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M,Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, SutskeverI, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M,Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems.https://www.tensorflow.org/. Software available from tensorflow.org
2. Afshin M, Ayed IB, Punithakumar K, Law M, Islam A, Goela A, Peters T, Li S (2014) Regional assess-ment of cardiac left ventricular myocardial function via mri statistical features. IEEE Trans Med Imaging33(2):481–494
3. Avola D, Cinque L (2008) Encephalic nmr image analysis by textural interpretation. In: Proceedings ofthe 2008 ACM symposium on applied computing, pp 1338–1342. ACM
4. Avola D, Cinque L, Di Girolamo M (2011) A novel t-cad framework to support medical image analysisand reconstruction. In: International conference on image analysis and processing, pp 414–423. Springer
5. Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng PA, Cetin I, Lekadir K, CamaraO, Ballester MAG et al (2018) Deep learning techniques for automatic mri cardiac multi-structuressegmentation and diagnosis: Is the problem solved? IEEE Transactions on Medical Imaging
6. Bi L, Kim J, Kumar A, Feng D (2017) Automatic liver lesion detection using cascaded deep residualnetworks. arXiv:1704.02703
7. Chollet F et al (2015) Keras8. Christ PF, Ettlinger F, Grun F, Elshaer MEA, Lipkova J, Schlecht S, Ahmaddy F, Tatavarty S, Bickel M,
Bilic P, Rempfler M, Hofmann F, D’Anastasi M, Ahmadi S, Kaissis G, Holch J, Sommer WH, BrarenR, Heinemann V, Menze BH (2017) Automatic liver and tumor segmentation of CT and MRI volumesusing cascaded fully convolutional neural networks. arXiv:1702.05970
9. Ciecholewski M (2011) Support vector machine approach to cardiac spect diagnosis. In: Internationalworkshop on combinatorial image analysis, pp 432–443. Springer
10. Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditionalgenerative adversarial networks. Expert Syst Appl 91:464–471
11. Drozdzal M, Chartrand G, Vorontsov E, Shakeri M, Di Jorio L, Tang A, Romero A, Bengio Y, Pal C,Kadoury S (2018) Learning normalized inputs for iterative estimation in medical image segmentation.Med Image Anal 44:1–13
12. Eslami A, Karamalis A, Katouzian A, Navab N (2013) Segmentation by retrieval with guidedrandom walks: application to left ventricle segmentation in mri. Med Image Anal 17(2):236–253
13. Fidon L, Li W, Garcia-Peraza-Herrera LC, Ekanayake J, Kitchen N, Ourselin S, Vercauteren T (2017)Generalised wasserstein dice score for imbalanced multi-class segmentation using holistic convolutionalnetworks. In: International MICCAI Brainlesion workshop, pp 64–76. Springer
14. Fischl B, Salat DH, Van Der Kouwe AJ, Makris N, Segonne F, Quinn BT, Dale AM (2004) Sequence-independent segmentation of magnetic resonance images. Neuroimage 23:S69–S84
15. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y(2014) Generative Adversarial Networks ArXiv e-prints
16. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and otherneural network architectures. Neural Netw 18(5-6):602–610
17. Han X (2017) Automatic liver lesion segmentation using a deep convolutional neural network method.arXiv:1704.07239
18. Hashemi SR, Salehi SSM, Erdogmus D, Prabhu SP, Warfield SK, Gholipour A (2018) Tversky as aloss function for highly unbalanced image segmentation using 3d fully convolutional deep networks.arXiv:1803.11078
19. Inda Maria-del-Mar RB, Seoane J (2014) Glioblastoma multiforme: A look inside its heterogeneousnature. In: Cancer archive 226-239
Multimedia Tools and Applications
20. Isensee F, Jaeger PF, Full PM, Wolf I, Engelhardt S, Maier-Hein KH (2017) Automatic cardiac dis-ease assessment on cine-mri via time-series segmentation and domain specific features. In: Internationalworkshop on statistical atlases and computational models of the heart, pp 120–129. Springer
21. Ishida T, Niu G, Hu W, Sugiyama M (2017) Learning from complementary labels. In: Advances inneural information processing systems, pp 5639–5649
22. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarialnetworks. In: The IEEE conference on computer vision and pattern recognition (CVPR)
23. Jang J, Eo T, Kim M, Choi N, Han D, Kim D, Hwang D (2014) Medical image match-ing using variable randomized undersampling probability pattern in data acquisition. In: 2014international conference on electronics, information and communications (ICEIC), pp 1–2.https://doi.org/10.1109/ELINFOCOM.2014.6914453
24. Kaur R, Juneja M, Mandal A (2018) A comprehensive review of denoising techniques for abdominal ctimages. Multimedia Tools and Applications pp 1–36
25. Kohl S, Bonekamp D, Schlemmer H, Yaqubi K, Hohenfellner M, Hadaschik B, Radtke J, Maier-HeinKH (2017) Adversarial networks for the detection of aggressive prostate cancer. arXiv:1702.08014
26. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–44427. Mahapatra D (2014) Automatic cardiac segmentation using semantic information from random forests.
J Digit Imaging 27(6):794–80428. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.178429. Moeskops P, Veta M, Lafarge MW, Eppenhof KAJ, Pluim JPW (2017) Adversarial training and dilated
convolutions for brain MRI segmentation. arXiv:1707.0319530. Nasr GE, Badr E, Joun C (2002) Cross entropy error function in neural networks: Forecasting gasoline
demand. In: FLAIRS conference, pp 381–38431. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning
by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp2536–2544
32. Peng P, Lekadir K, Gooya A, Shao L, Petersen SE, Frangi AF (2016) A review of heart chamber seg-mentation for structural and functional analysis using cardiac magnetic resonance imaging. Magn ResonMater Phys, Biol Med 29(2):155–195
33. Pohl KM, Fisher J, Grimson WEL, Kikinis R, Wells WM (2006) A bayesian model for joint segmentationand registration. Neuroimage 31(1):228–239
34. Poudel RP, Lamata P, Montana G (2016) Recurrent fully convolutional neural networks for multi-slicemri cardiac segmentation. In: Reconstruction, segmentation, and analysis of medical images, pp 83–94.Springer
35. Prabhu V, Kuppusamy P, Karthikeyan A, Varatharajan R (2018) Evaluation and analysis of data drivenin expectation maximization segmentation through various initialization techniques in medical images.Multimed Tools Appl 77(8):10375–10390
36. Qiu Q, Song Z (2018) A nonuniform weighted loss function for imbalanced image classification. In:Proceedings of the 2018 international conference on image and graphics processing, pp 78–82. ACM
37. Rohe MM, Sermesant M, Pennec X (2017) Automatic multi-atlas segmentation of myocardium withsvf-net. In: Statistical atlases and computational modeling of the heart (STACOM) workshop
38. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmen-tation. In: International conference on medical image computing and computer-assisted intervention, pp234–241. Springer International Publishing
39. Rota Bulo S, Neuhold G, Kontschieder P (2017) Loss max-pooling for semantic image segmenta-tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2126–2135
40. Shahzad R, Gao S, Tao Q, Dzyubachyk O, van der Geest R (2016) Automated cardiovascular segmenta-tion in patients with congenital heart disease from 3d cmr scans: combining multi-atlases and level-sets.In: Reconstruction, segmentation, and analysis of medical images, pp 147–155
41. Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ (2017) Generalised dice overlap as a deeplearning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysisand multimodal learning for clinical decision support, pp 240–248. Springer
42. Tustison NJ, Avants BB, Cook PA, Zheng Y, Egan A, Yushkevich PA, Gee JC (2010) N4itk: improvedn3 bias correction. IEEE Trans Med Imaging 29(6):1310–1320
43. Vorontsov E, Tang A, Pal C, Kadoury S (2018) Liver lesion segmentation informed by joint liversegmentation. In: 15th IEEE international symposium on biomedical imaging (ISBI 2018), pp 1332–1335
44. Vorontsov E, Tang A, Pal C, Kadoury S (2018) Liver lesion segmentation informed by joint liver seg-mentation. In: 15th IEEE international symposium on biomedical imaging (ISBI 2018), pp 1332–1335
Multimedia Tools and Applications
45. Wolterink JM, Leiner T, Viergever MA, Isgum I (2016) Dilated convolutional neural networks for car-diovascular mr segmentation in congenital heart disease. In: Reconstruction, segmentation, and analysisof medical images, pp 95–102. Springer
46. Wolterink JM, Leiner T, Viergever MA, Isgum I (2017) Automatic segmentation and disease classifica-tion using cardiac cine mr images. arXiv:1708.01141
47. Xu J, Schwing AG, Urtasun R (2014) Tell me what you see and i will show you where it is. In:Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3190–3197
48. Xue Y, Xu T, Zhang H, Long LR, Huang X (2017) Segan: Adversarial network with multi-scalel1 lossfor medical image segmentation. arXiv:1706.01805
49. Yu L, Yang X, Qin J, Heng PA (2016) 3d fractalnet: dense volumetric segmentation for cardiovascularmri volumes. In: Reconstruction, segmentation, and analysis of medical images, pp 103–110. Springer
50. Yu X, Liu T, Gong M, Tao D (2018) Learning with biased complementary labels. In: The europeanconference on computer vision (ECCV)
51. Zhang YD, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochas-tic pooling for tea category classification on gpu platform. Multimedia Tools and Applications pp1–19
52. Zhang YD, Zhao G, Sun J, Wu X, Wang ZH, Liu HM, Govindaraj VV, Zhan T, Li J (2017) Smartpathological brain detection by synthetic minority oversampling technique, extreme learning machine,and jaya algorithm. Multimedia Tools and Applications pp 1–20
53. Zhou Y, Berg TL (2016) Learning temporal transformations from time-lapse videos. In: Europeanconference on computer vision, pp 262–277
54. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistentadversarial networks. In: The IEEE international conference on computer vision (ICCV)
55. Zhu W, Xie X (2016) Adversarial deep structural networks for mammographic mass segmentation.arXiv:1612.05970
56. Zotti C, Luo Z, Humbert O, Lalande A, Jodoin PM (2017) Gridnet with automatic shape prior registrationfor automatic mri cardiac segmentation. arXiv:1705.08943
Mina Rezaei is currently a Ph.D. student at Chair of Internet Technologies and Systems, Hasso-Plattner Insti-tute (HPI), University of Potsdam, Germany. Prior to HPI, she received master degree in artificial intelligencefrom Shiraz University, in 2013 and bachelor of software engineering from Arak University, 2008. She wasworking more than 5 years as software developer in Statistical Center of Iran. In 2013, she had chance toresearch visits in Dept. of CAMP, Technical University of Munich, Germany and 2017, she had short-termresearch visits in Dept. of CS, University of Cape Town, South Africa and Nanjing University, China. Herresearch interests including deep learning, generative model, learning from imbalanced data, and medicalimage analysis.
Multimedia Tools and Applications
Haojin Yang received the Diploma Engineering degree at the Technical University Ilmenau, in Germany2008. In 2013, he received the doctorate degree at the Hasso-Plattner-Institute for IT-Systems Engineering(HPI) at the University of Potsdam, in Germany. His current research interests revolve around multime-dia analysis, information retrieval, deep learning technologies, computer vision, content based video searchtechnologies.
ChristophMeinel studied mathematics and computer science at Humboldt University in Berlin. He receivedthe doctorate degree in 1981 and was habilitated in 1988. After visiting positions at the University of Pader-born and the Max-Planck-Institute for computer science in Saarbr?cken, he became a full professor ofcomputer science at the University of Trier. He is now the president and CEO of the Hasso- Plattner-Institutefor IT-Systems Engineering at the University of Potsdam. He is a full professor of computer science with achair in Internet technologies and systems. He is a member of acatech, the German National Academy ofScience and Engineering, and numerous scientific committees and supervisory boards. His research focuseson IT-security engineering, tele teaching, and telemedicine, multimedia retrieval. He has published more than500 papers in high-profile scientific journals and at international conferences.
Multimedia Tools and Applications
Affiliations
Mina Rezaei1 ·Haojin Yang1 ·Christoph Meinel1
Haojin Yanghaojin.yang@hpi.de
Christoph Meinelchristoph.meinel@hpi.de
1 Hasso Plattner Institute, Prof. Dr. Helmert Street 2-3, Potsdam, Germany
11
Discussion
In the previous chapters, some representative papers of my research work have
been presented. I will give a short discussion of the achievements regarding the
pre-defined research questions described in section 1.1.1.
• Q1: DL is data hungry, how can we alleviate the reliance on substantial
data annotations? Through synthetic data? Or/and through unsupervised
and semi-supervised learning method?
Q2: How can we perform multiple computer vision tasks with a uniform
end-to-end neural network architecture?
Discussion: In [YWBM16], we successfully developed a real-time scene
text recognition system SceneTextReg by taking advantages of both classical
computer vision techniques (e.g., MSERs detector) and high accurate deep
learning models. In order to solve the lacking of training data problem,
we developed a synthetic data engine which can produce large amounts of
text images with a broad range of variety. We achieved similar accuracy
compared to the method trained based on large-scale real-world samples.
Therefore, through this paper, we show that the sufficient model accuracy
can be obtained by using a carefully implemented synthetic data engine.
In the recent development, providing huge advantages the synthetic data
engines thus serve as one of the most efficient solutions for training DL
models.
219
11. DISCUSSION
As mentioned in out paper, synthetic data generation works for some use
cases, such as image text recognition, object detection, and segmentation,
but it doesn’t work for many other tasks, by which a highly accurate data
generator is hard to obtain. Therefore, the semi-supervised, as well as un-
supervised methods play more crucial roles in those use cases.
In the context of scene text recognition, we were thinking about two possible
research ideas: first, are we able to integrate the detection and recognition
task into a uniform neural network, and optimize the whole network end-to-
end? It is obviously feasible to accomplish that by using so-called multi-task
learning techniques in the fully supervised manner. However, are we able to
solve the problem in a semi-supervised way? If solving the whole task using
unsupervised or semi-supervised method is too complicated, could we solve
the intermediate task in a semi-supervised way? This is the motivation of
our work SEE [BYM18]. To our knowledge, SEE is the first work trying to
solve the text detection and recognition task with one end-to-end neural net-
work and train the text detection network part by only using semi-supervised
signal delivered from the text recognition network part. The human vision
system inspires this idea. We never provide fully supervised information like
the exact location and bounding box size of objects to kids, when we teach
the kids to recognize an object. The human vision system can learn to find
the target objects by using weak supervision signals such as context or at-
tention information, etc. In this work, we successfully trained deep models
for end-to-end scene text recognition, focused scene word recognition, and
achieved state-of-the-art results on different benchmark datasets. Every ap-
proach has its own shortcomings, so is ours. The current limitations, as well
as future working directions are discussed in chapter 12.1.
• Q3: How can we apply DL models on low power devices as e.g., smart-
phones, embedded devices, wearable, and IoT devices?
Discussion: State-of-the-art deep models are computationally expensive
and consume large storage space. On the other hand, DL is also strongly
demanded by numerous applications from areas such as mobile platforms,
wearable devices, autonomous robots, and IoT devices. How to efficiently
220
apply deep models on such low power devices becomes a challenging research
problem. In this thesis, we presented two work to address this issue by
following the recently introduced Binary Neural Networks.
We developed BMXNet and published it as open-source software, by which
both the research community and industries can take advantages from. We
conducted an extensive study on training strategy and executive efficiency
of BNN. The results show that BMXNet achieves excellent performance re-
garding inference speed and memory usage and we can easily reduce the
model size with a large compression ratio (32× theoretical compression
rate).
After successfully building the foundation framework, we aimed to address
the general accuracy issue of BNNs further. We thus systematically evalu-
ated different network architectures and hyperparameters to provide useful
insights on how to train a BNN, which can benefit further research. We
introduced the following insights about binary neural networks which have
not been found in previous work:
– We evaluated the importance of removing bottleneck design
– Increasing number of shortcut connections increases accuracy and re-
duces model size
– Changing the clipping threshold can have a significant influence on the
training
– How to increase the bandwidth of the information flow seems to be one
of the most important factors for accuracy gain.
We showed meaningful scientific insights and made our models and codes
publicly available, which can serve as a solid foundation for the future re-
search work. The current biggest issue of BNNs remains the large accuracy
gap to the full precision counterpart. We will describe our idea for further
enhancing the information flow in the future work chapter.
• Q4: Can DL models gain multimodal and cross-modal representation learn-
ing tasks?
221
11. DISCUSSION
Discussion: According to this research question, we conducted two sub-
topics: visual-textual feature fusion in multimodal and cross-modal retrieval
task [WYM16a] and visual-language feature learning with its use case Image
Captioning [WYBM16, WYM18].
For the former one, we proposed a hybrid deep model RE-DNN which cap-
tures both the correlations between image and text pairs. In the image
retrieval task, the image representation used in previous work are hand-
crafted features such as SIFT, GIST, PHOW, etc. We propose to apply
high-level CNN features extracted from AlexNet model trained based on
ImageNet. Subsequently, visual and textual features are fused in a super-
vised manner. Overall, we achieved the state-of-the-art results using two
evaluation datasets. Another advantage of RE-DNN is that it is robust
to handle modality missing problem, by contrast, the most comparable ap-
proaches suffer difficulty in processing unpaired data. In our paper, we show
that RE-DNN is quite robust for solving this problem and outperformed the
alternative approaches with state-of-the-art performance on multimodal im-
age retrieval and cross-modal retrieval tasks.
For the latter, the Image Captioning task , we developed an end-to-end
trainable deep Bidirectional LSTM network to capture the semantical corre-
spondences between images and their caption sentences. We studied several
different architectural designs and investigated the corresponding activation
visualizations. This helps us to improve our understanding when we fed both
visual image features and sentence features into one LSTM network and to
learn the fused representations. The effectiveness and generalization ability
of the proposed model have been evaluated using mainstream benchmark
datasets, including Flickr8K [RYHH10], Flickr30K [YLHH14], MSCOCO
[LMB+14], and Pascal1K [RYHH10]. The experimental results show that
our models outperformed related work in both image captioning and image-
sentence retrieval task over almost all the tasks. However, even we could
achieve very promising results on multiple benchmarking datasets, we are
still far away from completely solving multi-modal representation learning
task. The reason is that the academical datasets have a lot of limitations,
222
far from comprehensively approximating the real-world data distribution.
Therefore, there are still a lot of challenges needed to be addressed in the
future work.
• Q5: Can we effectively and efficiently apply multimedia analysis and DL
algorithms in real-world applications?
Discussion: Regarding this question, we studied the feasibility and prac-
ticality of the developed techniques in two practical use cases: Automatic
Online Lecture Analysis and Medical Image Segmentation.
In the first use case, based on the automatic analysis methods we devel-
oped a solution to highlight the online lecture videos at different levels of
granularity. We analyze a wide range of learning materials including lecture
speeches, transcripts and lecture slide both in the file and video format1.
In this approach, we applied analytical methods as well as DL models for
gathering several different lecture insights, creating highlighting informa-
tion based on them, and analyzing the behavior of learners further. From
our user study, we found that the extracted highlighting information can
facilitate the learners, especially in the context of MOOCs. In the qualita-
tive evaluation, our approach achieves satisfied precision, which outperforms
baseline methods and also welcomed by user feedbacks.
In the second use case, we aimed to investigate the practicability of the
state-of-the-art instance object detection and segmentation techniques from
DL on medical image segmentation task. We proposed a novel end-to-end
DL architecture to address the brain tumor and liver tumor segmentation
task. The proposed architecture achieves promising results on the popular
medical image benchmark datasets and is very well generalized for medical
images in different types and varied sizes.
The applied benchmark datasets include BraTS 2017 dataset for brain tu-
mor segmentation [20117a] (MRI images) and LiTS2017 dataset for liver
cancer segmentation [20117b] (Computed Tomography (CT ) images), and
MDA231 dataset for Microscopic cell segmentation [BE15]. Overall, the
1The slide screen is captured during the presentation by using a dedicated recording system.
223
11. DISCUSSION
achieved result demonstrates a strong generalization ability of the proposed
method for medical image segmentation task.
From our experimental results, we can draw the preliminary conclusion that
there exists a huge application potential of DL technologies in the field of
medical image processing. DL models can achieve outstanding performance
with sufficient and high qualitative labeled training data. In the future,
DL models could be an excellent aid to radiologists, effectively alleviating
the shortage of experienced doctors, and can help the doctors to ensure the
accuracy of the diagnosis as well.
224
12
Conclusion
In this thesis, I have presented several previous as well as ongoing studies on deep
representation learning using multimedia data. The involved research topics show
a broad range of variety, including a typical computer vision problem scene text
recognition using fully supervised and semi-supervised DL methods, multimodal
retrieval and multimodal feature fusion image captioning, binary neural networks
BMXNet and two application use cases: online lecture highlighting and medical
image segmentation using DL technologies. Furthermore, I wrote a relatively
comprehensive overview that summarizes the history, development, vision, and
related technical fundamentals of DL techniques, in order to give readers a better
understanding of the manuscripts included.
There is still a large room for improvement in the current work and several
derived exciting research directions attract me to follow. Therefore a comprehen-
sive outlook of future work is provided in the next section. I will also discuss
some of my opinions on the future development of DL technologies.
12.1 Future Work
In the current scene text recognition research, most of the approaches are focusing
on improving their text localization and word recognition performance using fully
supervised methods. Less work is about the unsupervised or semi-supervised
solution. Our work SEE is trying to open this more challenging but meaningful
direction of research. This goal motivates us to provide our codes, trained models
225
12. CONCLUSION
and compiled datasets as open-source to the research community. We intended
to encourage more researchers to join us in this research direction.
However, it fits our intuitive feeling that semi-supervised method such as SEE
cannot yet offer a strong text localizer as those based on fully supervised methods.
How to further improve both the localization recall and precision by using weakly
supervised methods still remains an open research question. We have achieved
some promising result in semi-supervised text localization on FSNS dataset, how-
ever, the text appearance form in this dataset is somehow monotonous. There is
only a small variety of font size and font style (due to the nature of the “street
sign” dataset), and the degree of geometric distortion of texts is limited. More-
over, in the current state, we note that our models are not fully capable of de-
tecting scene texts in arbitrary locations in the image, as we saw during our
experiments with the FSNS dataset. Currently, our model is also constrained
to a fixed number of maximum words that can be detected with one forward
pass. In the future work, we want to redesign the network in a way that makes
it possible for the network to determine the number of text lines in an image
by itself. Therefore, we still need to prove our approach in the more challenging
scene text localization scenarios. A straightforward idea is to use the additional
weak supervision signal to give more guidance to the localization network.
On the other hand, beyond the specific object type “text”, we also aim to
implement the idea of SEE on general object detection task. There are several
challenging questions needed to be answered for this direction: in order to measure
the probability of a captured image region to be an object, we need a new metric
called Objectness Score, but how do we define and evaluate the objectness of
detected regions? Once we get the regions with solid objectness scores, how can
we assign the most probably classes to them? The recently proposed “Model
Distillation” technique [HVD15] might be a good direction to follow, which can
be combined with SEE network. Model distillation is an effective technique to
transfer knowledge from a teacher to a student network. The typical application
is to transfer from a powerful large network or ensemble to a small network, in
order to meet the low-memory or fast execution requirements. The idea is to use
distillation method and a teacher model to train a student model semi-supervisely,
where the teacher model is trained on different object classes. The model should
226
12.1 Future Work
be guided to learn the latent correlation between two visually correlated objects.
For instance, a football and a pomeranian puppy with the same “while” color,
moving on the lawn. They are obviously two different objects, but seen from
a distance, they are showing some similar visual characteristics for an object
detector or object tracker. Could we apply this visual correlation information for
developing a more general object detector is an exciting research problem.
For BNNs research, almost all the recent methods are focusing on the accu-
racy enhancement. Because this drawback significantly limits its application and
needs special strengthening. In our work [BYBM18, BYBM19], we systemati-
cally evaluated different network architectures and hyperparameters to provide
useful insights on how to train a BNN. Based on that our future work will also be
focusing on how to reduce the precision gap between binary networks and their
full precision counterparts. Specifically, both the network architecture and the
binary layer design should be improved. I will mainly follow two ideas: first,
to beyond the existing approaches, a more efficient method to approximate the
full precision weights and activations using scaling factors is urgently needed;
second, we will try to find a better way to enhance the capacity of information
flow further. Because, based on our study, we found that the information flow
can significantly affect the optimization results. I believe that a better optimized
information flow path is one of the critical factors, which can ease the training
process and significantly gain the overall performance. The starting point here
might be how to further improve the shortcut connections beyond DenseNet.
For multimodal representation learning, we proposed a bidirectional LSTM
model that can generate caption sentences for an image by taking both historical
and future context into account. We studied several deep bidirectional LSTM
architectures to embed image and sentence at the high semantic space for learn-
ing the visual-language model. In this work, we also proved that multi-task
learning with bidirectional LSTM is beneficial to increase model generalization
ability, which is further confirmed by our transfer learning experiments. We
qualitatively visualized internal states of the proposed model to understand how
bidirectional LSTM with multimodal information generates words at consecutive
time steps. The robustness of the proposed models have been evaluated with
227
12. CONCLUSION
numerous datasets on two different tasks: image captioning and image-sentence
retrieval.
As the future direction, we will focus on exploring more efficient language
representation, e.g. word2vec [MSC+13b], and incorporating an attention mecha-
nism [VTBE15] into our model. Furthermore, the multilingual caption generation
is another interesting research problem to address. The proposed bidirectional
models can also be applied to other sequence learning tasks such as text recogni-
tion and video captioning.
In the medical image segmentation paper, we successfully applied a novel deep
architecture on the brain and liver tumor segmentation task. We achieved promis-
ing results in several popular medical imaging challenges. As the future work,
we plan to corporate with the doctors and further evaluate the generalization
ability of the developed framework on more medical image data in the clinical
context. Moreover, we will investigate the applicability of the current model for
learning multiple clinical tasks such as diseases diagnosis beside of the semantic
segmentation.
I briefly presented the basic idea and theoretical foundation of GANs in chap-
ter 2.3.4.2. Although GANs achieved state-of-the-art results on a large variety
of unsupervised learning tasks, training them is considered highly unstable, very
difficult and sensitive to hyperparameters, all the while, missing modes from the
data distribution or even collapsing large amounts of probability mass on some
modes. Successful GAN training usually requires large amounts of human and
computing efforts to fine-tune the hyperparameters, in order to stabilize train-
ing and avoid mode-collapse. People typically rely on their own experience and
tend to publish hyperparameters and recipes instead of a systematic method for
training GANs. In our recent work [MYM18], we extensively studied the mode-
collapse problem of GANs and proposed to incorporate an adversarial dropout
in generative multi-adversarial networks. Our approach forces the single genera-
tor not to constrain its output to satisfy a single discriminator, but, instead, to
fulfill a dynamic ensemble of discriminators. We show that this approach leads
to a more generalized generator, promoting variety in the generated samples and
avoiding the mode-collapse problem commonly experienced with GANs. In the
228
12.2 Some Concerns about DL
future work, we will apply the proposed approach in the medical image segmenta-
tion as well as the language generation task, to show more evidence of its ability
in eliminating mode-collapse and stabilizing training.
12.2 Some Concerns about DL
Regarding the future development, although DL has achieved many break-record
results in a large variety of perception tasks, it is still far from capable for the
common-sense reasoning. Yann LeCun even said that if in his lifetime, DL in
the common-sense reasoning field can reach the level of a mouse, which is enough
to meet his expectations. Some researchers like Gary Marcus believe that DL
should learn more about how human explore the cognitive world and apply more
cognitive representations of objects, datasets, spaces, etc. But, researchers from
DL camp like LeCun claimed that DL does not have to simulate the human
cognition behavior.
I personally, like many others, would like to replace the word Artificial Intelli-
gence with Machine Intelligence. The reason is based on the following thoughts:
steam engine released the human strength, but the steam engine does not mimic
human strength. Cars run faster than human, but cars don’t imitate the hu-
man’s legs. The future intelligent machines may release some of the brain power
of human, but computers don’t think like human’s brain. Machines should have
their own way to think. Moreover, our understanding of human’s brain itself is
extremely limited. Human beings need to learn to respect machine intelligence
technologies and machines should have their unique thinking and logic. Intelli-
gent machines should serve as the assistant for human rather than a replacement
of human beings.
Researchers like Ali Rahimi believe that many of the methods currently used
in ML lack of theoretical understanding, especially in the field of DL. Being able
to understand is undoubtedly a thing of great significance. The interpretable ML
is one of the most important topics in the research community. But, meanwhile,
we also have another important goal, which is to develop new technologies and
applications. I am more inclined to LeCun’s opinion. In the history of science and
technology, engineering products always precede the theoretical understanding:
229
12. CONCLUSION
lenses and telescopes come out before optics theory, steam engines come out
before thermal dynamics, aircraft come out before flight aerodynamics, radio
and data communication come out before information theory, and computers
come out before computer science. The reason is that theoretical researchers
will spontaneously study those “simple” phenomena, they will only divert their
attention to the complex problems when those problems begin to have important
practical implications.
If we want DL to have a longer-lasting and sustainable lifetime in the future,
a close collaboration between industry and academia is required. Generally, the
industry lacks the latest algorithms and the talents for algorithm engineering,
while the academic community lacks large-scale datasets representing the real-
world problems, and computing resources. Thus, good cooperation of those two
groups can significantly promote the overall development. On the other hand, we
should be vigilant about the excessive hype from the media and venture capital-
ists, also some startup companies, which may cause suspicion and resentment of
the entire society to the artificial intelligence industry.
230
Appendix A
Ph.D. Publications
• Ph.D. thesis: Haojin Yang, Automatic Video Indexing and Retrieval Us-
ing Video OCR Technology, Hasso-Plattner-Institute (HPI), Uni-Potsdam,
2013. Grade: “summa cum laude”12
• In Journals (3):
– Haojin Yang and Christoph Meinel, Content Based Lecture Video Re-
trieval Using Speech and Video Text Information. IEEE Transactions
on Learning Technologies (TLT), DIO: 10.1109/TLT.2014.2307305, on-
line ISSN: 1939-1382, pp. 142-154, volume 7, number 2, Publisher:
IEEE Computer Society and IEEE Education Society, April-June 2014
– Haojin Yang, Bernhard Quehl and Harald Sack, A Framework for Im-
proved Video Text Detection and Recognition. International Journal
of Multimedia Tools and Applications (MTAP), Print ISSN:1380-7501,
online ISSN:1573-7721, Publisher: Springer Netherlands, DOI:10.1007
/s11042-012-1250-6, 2012
– Haojin Yang, Harald Sack and Christoph Meinel, Lecture Video Index-
ing and Analysis Using Video OCR Technology. International Journal
of Multimedia Processing and Technologies (JMPT), Volume: 2, Is-
sue: 4, pp. 176-196, Print ISSN: 0976-4127, Online ISSN: 0976-4135,
December 20111https://de.wikipedia.org/wiki/Dissertation#Bewertungsstufen_einer_Dissertation2https://de.wikipedia.org/wiki/Promotion_(Doktor)#Deutschland
233
A. PH.D. PUBLICATIONS
• In Conferences (10):
– Haojin Yang, Franka Grunewald, Matthias Bauer and Christoph Meinel,
Lecture Video Browsing Using Multimodal Information Resources. 12th
International Conference on Web-based Learning (ICWL 2013), Octo-
ber 6-9, 2013, Kenting, Taiwan. Springer lecture notes
– Franka Grunewald, Haojin Yang, Elnaz Mazandarani, Matthias Bauer
and Christoph Meinel Next Generation Tele-Teaching: Latest Recording
Technology, User Engagement and Automatic Metadata Retrieval. In-
ternational Conference on Human Factors in Computing and Informat-
ics (southCHI), Lecture Notes in Computer Science (LNCS) Springer,
01–03 July, 2013 Maribor, Slovenia
– Haojin Yang, Christoph Oehlke and Christoph Meinel, An Automated
Analysis and Indexing Framework for Lecture Video Portal. 11th Inter-
national Conference on Web-based Learning (ICWL 2012), September
2-4, 2012, Sinaia, Romania. Springer lecture notes, pp. 285–294, Vol-
ume 7558, 2012 (best student paper award)
– Haojin Yang, Bernhard Quehl, Harald Sack, A skeleton based binariza-
tion approach for video text recognition. 13th International Workshop
on Image analysis for multimedia interactive services (WIAMIS 2012),
IEEE Press, pp. 1–4, Dublin Ireland, May. 23-25, 2012
– C. Hentschel, J. Hercher, M. Knuth, J. Osterhoff, B. Quehl, H. Sack,
N. Steinmetz, J. Waitelonis, H-J.Yang (Alphabetical order of author’s
name) :Open Up Cultural Heritage in Video Archives with Mediaglobe.
12th International Conference on Innovative Internet Community Ser-
vices (I2CS 2012), Trondheim (Norway), June. 13-15, 2012 (best pa-
per award)
– Haojin Yang, Franka Grunewald and Christoph Meinel, Automated ex-
traction of lecture outlines from lecture videos: a hybrid solution for
lecture video indexing. 4th International Conference on Computer Sup-
ported Education (CSEDU 2012), SciTePress, Porto Portugal, pp. 13–
22, Publisher: SciTePress, April. 16-18, 2012
234
– Haojin Yang, Bernhard Quehl and Harald Sack, Text detection in video
images using adaptive edge detection and stroke width verification. 19th
International Conference on Systems, Signals and Image Processing
(IWSSIP 2012), IEEE Press, Vienna, Austria, pp. 9–12, April. 11-13,
2012
– Haojin Yang, Maria Siebert, Patrick Lhne, Harald Sack and Christoph
Meinel, Lecture Video Indexing and Analysis Using Video OCR Tech-
nology. 7th International Conference on Signal Image Technology and
Internet Based Systems (SITIS 2011), Track Internet Based Computing
and Systems, IEEE Press, Dijon (France), pp. 54–61, November. 28 -
December. 1, 2011
– Haojin Yang, Maria Siebert, Patrick Lhne, Harald Sack and Christoph
Meinel, Automatic Lecture Video Indexing Using Video OCR Technol-
ogy. IEEE International Symposium on Multimedia 2011 (ISM 2011),
IEEE Press, Dana Point, CA, USA, December. 5-7, 2011
– Haojin Yang, Christoph Oehlke and Christoph Meinel, A Solution for
German Speech Recognition for Analysis and Processing of Lecture Videos.
10th IEEE/ACIS International Conference on Computer and Informa-
tion Science (ICIS 2011) , IEEE Press, ISBN 9783642336423, pp. 285–
294, Sanya, Heinan Island, China, May. 2011
235
Appendix B
Publications After Ph.D.
• In Journals (5):
– Mina Rezaei, Haojin Yang and Christoph Meinel: Recurrent generative
adversarial network for learning imbalanced medical image semantic
segmentation. International Journal of Multimedia Tools and Appli-
cations (MTAP), Special Issue: “Deep Learning for Computer - aided
Medical Diagnosis”, http://dx.doi.org/10.1007/s11042-019-7305-1
Feb. 2019
– Cheng Wang, Haojin Yang and Christoph Meinel, Image Captioning
with Deep Bidirectional LSTMs and Multi-Task Learning. ACM Trans-
actions on Multimedia Computing, Communications, and Applications
(TOMM), Volume 14 Issue 2s, No. 40, May 2018
– Xiaoyin Che, Haojin Yang, Christoph Meinel, Automatic Online Lec-
ture Highlighting Based on Multimedia Analysis, IEEE Transactions on
Learning Technologies (TLT), Publisher: IEEE Computer Society and
IEEE Education Society, Volume: PP, Issue: 99, Print ISSN: 1939-
1382, 2017
– Cheng Wang, Haojin Yang and Christoph Meinel, A Deep Seman-
tic Framework for Multimodal Representation Learning, International
Journal of Multimedia Tools and Applications (MTAP), online ISSN:1573-
7721, Print ISSN:1380-7501, Special Issue: Representation Learning for
Multimedia Data Understanding, March 2016
237
B. PUBLICATIONS AFTER PH.D.
– Xiaoyin Che, Haojin Yang, Christoph Meinel, The Automated Gen-
eration and Further Application of Tree-Structure Outline for Lecture
Videos with Synchronized Slides, International Journal of Technology
and Educational Marketing, Volume 4, Number 1, IGI Global, 2014
• In Conferences (>40):
– 2019
∗ Jonathan Sauder, Xiaoyin Che, Goncalo Mordido, Ting Hu, Hao-
jin Yang and Christoph Meinel, Best Student Forcing: A Novel
Training Mechanism in Adversarial Language Generation, the 57th
Annual Meeting of the Association for Computational Linguistics
(ACL 2019) (under review)
∗ Joseph Bethge, Haojin Yang, Marvin Bornstein, Christoph Meinel,
Back to Simplicity: How to Train Accurate BNNs from Scratch?.
International Conference on Computer Vision (ICCV 2019) (under
review)
∗ Mina Razaei, Haojin Yang, Christoph Meinel, Medical Image Se-
mantic Segmentation using Conditional Refinement Generative Ad-
versarial Networks. IEEE Winter Conference on Applications of
Computer Vision (WACV) IEEE, 2019
∗ Mina Rezaei, Haojin Yang, Christoph Meinel: Learning Imbalanced
Semantic Segmentation through Cross-Domain Relations of Multi-
Agent Generative Adversarial Networks. Accepted by SPIE Medi-
cal Imaging - Computer Aided Diagnosis (SPIE19)
– 2018
∗ Chrisitian Bartz, Haojin Yang, Christoph Meinel SEE: Towards
Semi-Supervised End-to-End Scene text Recognition, the Thirty-
Second AAAI Conference on Artificial Intelligence (AAAI-18), Febru-
ary 27, 2018 New Orleans, Lousiana, USA
∗ Joseph Bethge, Haojin Yang, Christian Bartz, Christoph Meinel
Learning to Train a Binary Neural Network. In: arXiv preprint
arXiv:1809.10463, 2018
238
∗ Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, Christoph
Meinel Training Competitive Binary Neural Networks from Scratch.
In: arXiv preprint arXiv:1812.01965, 2018
∗ Goncalo Mordido, Haojin Yang and Christoph Meinel, Dropout-
GAN: Learning from a Dynamic Ensemble of Discriminators ACM
KDD’18 Deep Learning Day (KDD DLDay 2018), London UK,
2018
∗ Mina Rezaei, Haojin Yang and Christoph Meinel Instance Tumor
Segmentation using Multitask Convolutional Neural Network Inter-
national Joint Conference on Neural Networks (IJCNN) 2018
∗ Mina Rezaei, Haojin Yang, Christoph Meinel Whole Heart and
Great Vessel Segmentation with Context-aware of Generative Ad-
versarial Networks Bildverarbeitung fur die Medizin (BVM) 2018
∗ Christian Bartz, Haojin Yang and Christoph Meinel, LoANs: Weakly
Supervised Object Detection with Localizer Assessor Networks In-
ternational Workshop on Advanced Machine Vision for Real-life
and Industrially Relevant Applications (AMV’18), Perth Australia,
2018
∗ Mina Rezaei, Haojin Yang, Christoph Meinel: voxel-GAN: Adver-
sarial Framework for Learning Imbalanced Brain Tumor Segmen-
tation. BrainLes@MICCAI 2018
∗ Mina Rezaei, Haojin Yang and Christoph Meinel, Generative Ad-
versarial Framework for Learning Multiple Clinical Tasks. Digital
Image Computing: Techniques and Applications (DICTA 2018)
∗ Mina Rezaei, Haojin Yang, Christoph Meinel, ”Automatic Cardiac
MRI Segmentation via Context-aware Recurrent Generative Adver-
sarial Neural Network”, Computer Assisted Radiology and Surgery
(CARS18)
∗ Jonathan Sauder, Xiaoyin Che, Gonalo Mordido, Haojin Yang and
Christoph Meinel. Pseudo-Ground-Truth Training for Adversarial
Text Generation with Reinforcement Learning. Deep Reinforce-
ment Learning Workshop at NeurIPS18
239
B. PUBLICATIONS AFTER PH.D.
∗ Mina Rezaei, Haojin Yang, Christoph Meinel Recurrent Generative
Adversarial Network for Learning Multiple Clinical Tasks. Machine
Learning for Health Workshop at NeurIPS 2018 (ML4H)
– 2017
∗ Haojin Yang, Martin Fritzsche, Christian Bartz, Christoph Meinel,
BMXNet: An Open-Source Binary Neural Network Implementation
Based on MXNet ACM International Conference on Multimedia
(ACM MM), October 23-27, 2017, Mountain View, CA USA
∗ Chrisitian Bartz, Haojin Yang, Christoph Meinel STN-OCR: A
single Neural Network for Text Detection and Text Recognition,
arXiv:1707.08831v1 2017
∗ Xiaoyin Che, Nico Ring, Willi Raschkowski, Haojin Yang and Christoph
Meinel, Traversal-Free Word Vector Evaluation in Analogy Space,
RepEval workshop at EMNLP 17 (Empirical Methods in Natural
Language Processing), September 711, 2017, Copenhagen, Den-
mark
∗ Christian Bartz, Tom Herold, Haojin Yang and Christoph Meinel
Language Identification Using Deep Convolutional Recurrent Neu-
ral Networks, 24th International Conference on Neural Information
Processing (ICONIP 2017), November 14-18, 2017, Guangzhou,
China
∗ Mina Rezaei, Haojin Yang and Christoph Meinel Deep Neural Net-
work with l2-norm Unit for Brain Lesions Detection, 24th Inter-
national Conference on Neural Information Processing (ICONIP
2017), November 14-18, 2017, Guangzhou, China
∗ Xiaoyin Che, Nico Ring, Willi Raschkowski, Haojin Yang and Christoph
Meinel Automatic Lecture Subtitle Generation and How It Helps,
17th IEEE International Conference on Advanced Learning Tech-
nologies (ICALT 2017), July 3-7, 2017, Timisoara, Romania
– 2016
240
∗ Haojin Yang, Cheng Wang, Christian Bartz, Christoph Meinel Scene-
TextReg: A Real-Time Video OCR System, ACM international con-
ference on Multimedia (ACM MM 2016), system demonstration
session, 15-19 October 2016, Amsterdam, The Netherlands
∗ Cheng Wang, Haojin Yang, Christian Bartz, Christoph Meinel Im-
age Captioning with Deep Bidirectional LSTMs, ACM international
conference on Multimedia (ACM MM 2016), full paper (oral pre-
sentation), 15-19 October 2016, Amsterdam, The Netherlands
∗ Xiaoyin Che, Sheng Luo, Haojin Yang and Christoph Meinel, Sen-
tence Boundary Detection Based on Parallel Lexical and Acoustic
Models, INTERSPEECH 2016, San Francisco, California, USA in
September 8-12, 2016
∗ Cheng Wang, Haojin Yang and Christoph Meinel, Exploring Mul-
timodal Video Representation for Action Recognition, the annual
International Joint Conference on Neural Networks (IJCNN 2016),
Vancouver, Canada, July 24-29, 2016
∗ Haojin Yang, Real-Time Video OCR System, system demonstration
at 41st IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP 2016), Show&Tell session, Shanghai
China, 20-25 March 2016
∗ Xiaoyin Che, Cheng Wang, Haojin Yang and Christoph Meinel,
Punctuation Prediction for Unsegmented Transcript Based on Word
Vector, the 10th International Conference on Language Resources
and Evaluation (LREC 2016), Portoro (Slovenia), 23-28 May 2016
∗ Sheng Luo, Haojin Yang, Cheng Wang, Xiaoyin Che, and Christoph
Meinel, Action Recognition in Surveillance Video Using ConvNets
and Motion History Image, International Conference on Artificial
Neural Networks (ICANN 2016), Barcelona Spain, 6th-9th of Septem-
ber 2016
∗ Sheng Luo, Haojin Yang, Cheng Wang, Xiaoyin Che and Christoph
Meinel, Real-time action recognition in surveillance videos using
241
B. PUBLICATIONS AFTER PH.D.
ConvNets, in the 23rd International Conference on Neural Infor-
mation Processing (ICONIP 2016), in Kyoto (Japan), 16th-21th of
October 2016
∗ Hannes Rantzsch, Haojin Yang and Christoph Meinel Signature
Embedding: Writer Independent Offline Signature Verification with
Deep Metric Learning in 12th International Symposium on Visual
Computing (ISVC’16), Las Vegas USA, December 12-14, 2016
∗ Xiaoyin Che, Sheng Luo, Haojin Yang, Christoph Meinel Sentence-
Level Automatic Lecture Highlighting Based on Acoustic Analysis
16th IEEE International Conference on Computer and Information
Technology (IEEE CIT 2016), Shangri-La’s Fijian Resort, Fiji, 7-10
December 2016
∗ Xiaoyin Che, Thomas Staubitz, Haojin Yang and Christoph Meinel,
Pre-Course Key Segment Analysis of Online Lecture Videos, 16th
IEEE International Conference on Advancing Learning Technolo-
gies (ICALT-2016), Austin, Texas, USA, July 25-28, 2016
– 2015
∗ Cheng Wang, Haojin Yang, Christoph Meinel, Deep Semantic Map-
ping for Cross-Modal Retrieval, the 27th IEEE International Con-
ference on Tools with Artificial Intelligence (ICTAI 2015), Vietri
sul Mare, Italy, November 9-11, 2015
∗ Cheng Wang, Haojin Yang and Christoph Meinel, Does Multilevel
Semantic Representation Improve Text Categorization?, the 26th
International Conference on Database and Expert Systems Appli-
cations (DEXA 2015), Valencia, Spain, September 1-4, 2015
∗ Haojin Yang, Cheng Wang, XiaoYin Che, Sheng Luo and Ch.Meinel.
An Improved System For Real-Time Scene Text Recognition, ACM
International Conference on Multimedia Retrieval (ICMR 2015),
system demonstration session, Shanghai, June 23-26, 2015
∗ Cheng Wang, Haojin Yang, Xiaoyin Che and Christoph Meinel,
Concept-Based Multimodal Learning for Topic Generation, the 21st
242
MultiMedia Modelling Conference (MMM2015), Sydney, Australia,
Jan 5-7, 2015
∗ Sheng Luo, Haojin Yang and Christoph Meinel, Reward-based In-
termittent Reinforcement in Gamification for E-learning, 7th Inter-
national Conference on Computer Supported Education (CSEDU),
Lisbon, Portugal, Mai 23-25, 2015
∗ Xiaoyin Che, Haojin Yang and Christoph Meinel, Table Detection
from Slide Images, 7th Pacific Rim Symposium on Image and Video
Technology (PSIVT2015), 23-27 November, 2015, Auckland, New
Zealand
∗ Xiaoyin Che, Haojin Yang and Christoph Meinel, Adaptive E-Lecture
Video Outline Extraction Based on Slides Analysis, the 14th Inter-
national Conference on Web-based Learning (ICWL 2015), Guangzhou,
China, November 5-8, 2015
∗ Cheng Wang, Haojin Yang and Christoph Meinel, Visual-Textual
Late Semantic Fusion Using Deep Neural Network for Document
Categorization, the 22nd International Conference on Neural In-
formation Processing (ICONIP2015), Istanbul, Turkey, November
9-12, 2015
– 2014
∗ Bernhard Quehl, Haojin Yang and Harald Sack, Improving text
recognition by distinguishing scene and overlay text, the 7th Inter-
national Conference on Machine Vision (ICMV 2014), Milan, Italy,
November 19-21, 2014
– 2013
∗ Xiaoyin Che, Haojin Yang, Christoph Meinel, Lecture Video Seg-
mentation by Automatically Analyzing the Synchronized Slides, The
21st ACM International Conference on Multimedia (ACM MM),
October 21-25, 2013, Barcelona, Spain
∗ Franka Grunewald, Haojin Yang, Christoph Meinel, Evaluating the
Digital Manuscript Functionality - User Testing For Lecture Video
Annotation Features, 12th International Conference on Web-based
243
B. PUBLICATIONS AFTER PH.D.
Learning (ICWL 2013), 6 - 9th October 2013, Kenting, Taiwan.
Springer lecture notes, 2013. (best student paper award)
∗ Xiaoyin Che, Haojin Yang, Christoph Meinel, Tree-Structure Out-
line Generation for Lecture Videos with Synchronized Slides, The
Second International Conference on E-Learning and E-Technologies
in Education (ICEEE2013), 23-25th September 2013, Lodz Poland
244
Appendix C
Deep Learning Applications
An incomplete list of DL applications:
• document processing [HS11]
• image classification and recognition [SZ14b, KSH12, HZRS16]
• video classification [KTS+14]
• sequence generation [Gra13]
• text, speech, image and video processing [LBH15]
• speech recognition and spoken language understanding [HDY+12, ZCY+16]
• text-to-speech generation [WSRS+17, ACC+17]
• sentence classification and modelling [Kim14, KGB14]
• premise selection [ISA+16]
• document and sentence processing [LM14]
• generating image captions [VTBE15, WYBM16]
• photographic style transfer [LPSB17]
• natural image manifold [ZKSE16]
• image colorization [ZIE16]
245
C. DEEP LEARNING APPLICATIONS
• visual question answering [AAL+15]
• generating textures and stylized images [ULVL16]
• visual recognition and description [DAHG+15]
• object detection [SMH+11]
• character motion synthesis and editing [HSK16]
• word repreasentation [MCCD13]
• singing synthesis [BB17]
• person identification [LZXW14]
• face recognition and verification [TYRW14]
• action recognition in videos [SZ14a]
• classifying and visualizing motion capture sequences [CC14]
• handwriting generation and prediction [CHJO16]
• machine translation [BCB14, WSC+16]
• named entity recognition [LBS+16]
• conversational agents [GBC+17]
• cancer detection [EKN+17]
• audio generation [VDODZ+16]
• X-ray CT reconstruction [KMY17]
• hardware acceleration [HLM+16]
• robotics [LLS15]
• autonomous driving [CSKX15]
• pedestrian detection [OW13]
246
References
[20117a] 2017, BraTS: BraTS 2017. https://www.med.upenn.edu/sbia/
brats2017.html, 2017 8, 13, 223
[20117b] 2017, LiTS: LiTS 2017 . https://competitions.codalab.org/
competitions/15595, 2017 8, 13, 223
[AAL+15] Antol, Stanislaw ; Agrawal, Aishwarya ; Lu, Jiasen ;
Mitchell, Margaret ; Batra, Dhruv ; Lawrence Zitnick, C
; Parikh, Devi: Vqa: Visual question answering. In: Proceedings
of the IEEE international conference on computer vision, 2015, S.
2425–2433 246
[ABC+16] Abadi, Martın ; Barham, Paul ; Chen, Jianmin ; Chen, Zhifeng
; Davis, Andy ; Dean, Jeffrey ; Devin, Matthieu ; Ghemawat,
Sanjay ; Irving, Geoffrey ; Isard, Michael u. a.: Tensorflow: a
system for large-scale machine learning. In: OSDI Bd. 16, 2016,
S. 265–283 9, 64
[ACC+17] Arik, Sercan O. ; Chrzanowski, Mike ; Coates, Adam ; Di-
amos, Gregory ; Gibiansky, Andrew ; Kang, Yongguo ; Li,
Xian ; Miller, John ; Ng, Andrew ; Raiman, Jonathan u. a.:
Deep voice: Real-time neural text-to-speech. In: arXiv preprint
arXiv:1702.07825 (2017) 245
[AR15] Aubry, Mathieu ; Russell, Bryan C.: Understanding deep
features with computer-generated imagery. In: Proceedings of
the IEEE International Conference on Computer Vision, 2015, S.
2875–2883 70
249
REFERENCES
[B+09] Bengio, Yoshua u. a.: Learning deep architectures for AI. In:
Foundations and trends R© in Machine Learning 2 (2009), Nr. 1, S.
1–127 21, 37, 60
[BB+95] Bishop, Chris ; Bishop, Christopher M. u. a.: Neural networks
for pattern recognition. Oxford university press, 1995 24
[BB17] Blaauw, Merlijn ; Bonada, Jordi: A neural parametric singing
synthesizer. In: arXiv preprint arXiv:1704.03809 (2017) 246
[BCB14] Bahdanau, Dzmitry ; Cho, Kyunghyun ; Bengio, Yoshua: Neu-
ral machine translation by jointly learning to align and translate.
In: arXiv preprint arXiv:1409.0473 (2014) 63, 246
[BCC+16] Bojarski, Mariusz ; Choromanska, Anna ; Choromanski,
Krzysztof ; Firner, Bernhard ; Jackel, Larry ; Muller, Urs ;
Zieba, Karol: Visualbackprop: visualizing cnns for autonomous
driving. In: arXiv preprint (2016) xv, 59, 60, 70
[BCNN13a] Bissacco, A. ; Cummins, M. ; Netzer, Y. ; Neven, H.: Pho-
toOCR: Reading Text in Uncontrolled Conditions. In: 2013 IEEE
International Conference on Computer Vision, 2013. – ISSN 1550–
5499, S. 785–792 9, 39
[BCNN13b] Bissacco, Alessandro ; Cummins, Mark ; Netzer, Yuval ;
Neven, Hartmut: PhotoOCR: Reading Text in Uncontrolled Con-
ditions. In: Proceedings of the IEEE International Conference on
Computer Vision, 2013, 785-792 101
[BCV13a] Bengio, Y. ; Courville, A. ; Vincent, P.: Representation
Learning: A Review and New Perspectives. In: IEEE Transactions
on Pattern Analysis and Machine Intelligence 35 (2013), Aug, Nr.
8, S. 1798–1828. http://dx.doi.org/10.1109/TPAMI.2013.50. –
DOI 10.1109/TPAMI.2013.50. – ISSN 0162–8828 7
250
REFERENCES
[BCV13b] Bengio, Yoshua ; Courville, Aaron ; Vincent, Pascal: Repre-
sentation learning: A review and new perspectives. In: IEEE trans-
actions on pattern analysis and machine intelligence 35 (2013), Nr.
8, S. 1798–1828 20
[BDTD+16] Bojarski, Mariusz ; Del Testa, Davide ; Dworakowski,
Daniel ; Firner, Bernhard ; Flepp, Beat ; Goyal, Prasoon
; Jackel, Lawrence D. ; Monfort, Mathew ; Muller, Urs ;
Zhang, Jiakai u. a.: End to end learning for self-driving cars. In:
arXiv preprint arXiv:1604.07316 (2016) vi, 5
[BE15] Biological Engineering, Massachusetts Institute of T. o.:
MDA231 human breast carcinoma cell dataset. http://www.
celltrackingchallenge.net/datasets.html, 2015 8, 13, 223
[BGV92] Boser, Bernhard E. ; Guyon, Isabelle M. ; Vapnik, Vladimir N.:
A Training Algorithm for Optimal Margin Classifiers. In: Pro-
ceedings of the Fifth Annual Workshop on Computational Learning
Theory. New York, NY, USA : ACM, 1992 (COLT ’92). – ISBN
0–89791–497–X, 144–152 19
[Blo18] Blog, OpenAI: OpenAI Five, has started to defeat amateur human
teams at Dota 2. 2018 23, 68
[BQ15] Bernhard Quehl, Harald S. Haojin Yang Y. Haojin Yang: Im-
proving text recognition by distinguishing scene and overlay text,
2015, 9445 - 9445 - 5 9
[BYBM18] Bethge, Joseph ; Yang, Haojin ; Bartz, Christian ; Meinel,
Christoph: Learning to Train a Binary Neural Network. In: arXiv
preprint arXiv:1809.10463 (2018) 10, 14, 57, 82, 227
[BYBM19] Bethge, Joseph ; Yang, Haojin ; Bornstein, Marvin ; Meinel,
Christoph: Back to Simplicity: How to Train Accurate BNNs from
Scratch? In: arXiv preprint arXiv:1906.08637 (2019) 10, 14, 82,
227
251
REFERENCES
[BYM17a] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: STN-
OCR: A single Neural Network for Text Detection and Text Recog-
nition. In: CoRR abs/1707.08831 (2017). http://arxiv.org/
abs/1707.08831 9
[BYM17b] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: STN-
OCR: A single Neural Network for Text Detection and Text Recog-
nition. In: arXiv preprint arXiv:1707.08831 (2017) 91
[BYM18] Bartz, Christian ; Yang, Haojin ; Meinel, Christoph: SEE:
Towards Semi-Supervised End-to-End Scene Text Recognition. In:
Proceedings of the 2018 Conference on Artificial Intelligence, 2018
(AAAI ’18) vi, 6, 9, 14, 73, 81, 220
[CC14] Cho, Kyunghyun ; Chen, Xi: Classifying and visualizing motion
capture sequences using deep neural networks. In: Computer Vi-
sion Theory and Applications (VISAPP), 2014 International Con-
ference on Bd. 2 IEEE, 2014, S. 122–130 246
[CHJO16] Carter, Shan ; Ha, David ; Johnson, Ian ; Olah, Chris: Exper-
iments in handwriting with a neural network. In: Distill 1 (2016),
Nr. 12, S. e4 246
[Cho17] Chollet, Francois: Xception: Deep learning with depthwise sep-
arable convolutions. In: arXiv preprint (2017), S. 1610–02357 58
[Cis17] Cisco: Cisco Visual Networking Index: Forecast
and Methodology, 20162021. Version: 2017. https:
//www.cisco.com/c/en/us/solutions/collateral/
service-provider/visual-networking-index-vni/
complete-white-paper-c11-481360.pdf. Cisco public, 2017, 3
vi, 3
[CLL+15] Chen, Tianqi ; Li, Mu ; Li, Yutian ; Lin, Min ; Wang, Naiyan
; Wang, Minjie ; Xiao, Tianjun ; Xu, Bing ; Zhang, Chiyuan ;
Zhang, Zheng: Mxnet: A flexible and efficient machine learning
252
REFERENCES
library for heterogeneous distributed systems. In: arXiv preprint
arXiv:1512.01274 (2015) vii, 6, 9, 10, 101
[CLYM16] Che, Xiaoyin ; Luo, Sheng ; Yang, Haojin ; Meinel, Christoph:
Sentence Boundary Detection Based on Parallel Lexical and Acous-
tic Models. In: Interspeech, 2016, S. 2528–2532 12
[CN15] Chiu, Jason P. ; Nichols, Eric: Named entity recognition with
bidirectional LSTM-CNNs. In: arXiv preprint arXiv:1511.08308
(2015) 35
[CPC16] Canziani, Alfredo ; Paszke, Adam ; Culurciello, Eugenio: An
analysis of deep neural network models for practical applications.
In: arXiv preprint arXiv:1605.07678 (2016) xv, 57, 58
[CSKX15] Chen, Chenyi ; Seff, Ari ; Kornhauser, Alain ; Xiao, Jianx-
iong: Deepdriving: Learning affordance for direct perception in
autonomous driving. In: Proceedings of the IEEE International
Conference on Computer Vision, 2015, S. 2722–2730 246
[CUH15] Clevert, Djork-Arne ; Unterthiner, Thomas ; Hochreiter,
Sepp: Fast and accurate deep network learning by exponential
linear units (elus). In: arXiv preprint arXiv:1511.07289 (2015) 44
[CVMG+14] Cho, Kyunghyun ; Van Merrienboer, Bart ; Gulcehre,
Caglar ; Bahdanau, Dzmitry ; Bougares, Fethi ; Schwenk,
Holger ; Bengio, Yoshua: Learning phrase representations us-
ing RNN encoder-decoder for statistical machine translation. In:
arXiv preprint arXiv:1406.1078 (2014) 35
[CWV+14] Chetlur, Sharan ; Woolley, Cliff ; Vandermersch, Philippe
; Cohen, Jonathan ; Tran, John ; Catanzaro, Bryan ; Shel-
hamer, Evan: cudnn: Efficient primitives for deep learning. In:
arXiv preprint arXiv:1410.0759 (2014) 64
[Cyb89] Cybenko, George: Approximation by superpositions of a sig-
moidal function. In: Mathematics of control, signals and systems
2 (1989), Nr. 4, S. 303–314 25
253
REFERENCES
[CYM13] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Lecture
video segmentation by automatically analyzing the synchronized
slides. In: Proceedings of the 21st ACM international conference
on Multimedia ACM, 2013, S. 345–348 12
[CYM15] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Adaptive
e-lecture video outline extraction based on slides analysis. In: In-
ternational Conference on Web-Based Learning Springer, 2015, S.
59–68 12
[CYM18] Che, Xiaoyin ; Yang, Haojin ; Meinel, Christoph: Automatic
Online Lecture Highlighting Based on Multimedia Analysis. In:
IEEE Transactions on Learning Technologies 11 (2018), Nr. 1, S.
27–40 12, 14, 82
[DAHG+15] Donahue, Jeffrey ; Anne Hendricks, Lisa ; Guadarrama,
Sergio ; Rohrbach, Marcus ; Venugopalan, Subhashini ;
Saenko, Kate ; Darrell, Trevor: Long-term recurrent convo-
lutional networks for visual recognition and description. In: Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition, 2015, S. 2625–2634 246
[DB16] Dosovitskiy, Alexey ; Brox, Thomas: Inverting visual represen-
tations with convolutional networks. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, S.
4829–4837 70
[DDS+09] Deng, J. ; Dong, W. ; Socher, R. ; Li, L.-J. ; Li, K. ; Fei-Fei,
L.: ImageNet: A Large-Scale Hierarchical Image Database. In:
CVPR09, 2009 17, 19, 62
[DHS11] Duchi, John ; Hazan, Elad ; Singer, Yoram: Adaptive sub-
gradient methods for online learning and stochastic optimization.
In: Journal of Machine Learning Research 12 (2011), Nr. Jul, S.
2121–2159 46
254
REFERENCES
[DT05] Dalal, Navneet ; Triggs, Bill: Histograms of oriented gradients
for human detection. In: Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference on
Bd. 1 IEEE, 2005, S. 886–893 19
[EKN+17] Esteva, Andre ; Kuprel, Brett ; Novoa, Roberto A. ; Ko,
Justin ; Swetter, Susan M. ; Blau, Helen M. ; Thrun, Sebas-
tian: Dermatologist-level classification of skin cancer with deep
neural networks. In: Nature 542 (2017), Nr. 7639, S. 115 vi, 5, 63,
246
[Fac18] Facebook: Facebook statistics. http://facebook.com/, 2018 vi,
3
[FH17] Frosst, Nicholas ; Hinton, Geoffrey: Distilling a neural net-
work into a soft decision tree. In: arXiv preprint arXiv:1711.09784
(2017) 71
[Fuk88] Fukushima, Kunihiko: Neocognitron: A hierarchical neural net-
work capable of visual pattern recognition. In: Neural networks 1
(1988), Nr. 2, S. 119–130 28
[FV91] Felleman, Daniel J. ; Van, DC E.: Distributed hierarchical
processing in the primate cerebral cortex. In: Cerebral cortex (New
York, NY: 1991) 1 (1991), Nr. 1, S. 1–47 20
[Gar93] Garofolo, John S.: TIMIT acoustic phonetic continuous speech
corpus. In: Linguistic Data Consortium, 1993 (1993) 62
[GB10] Glorot, Xavier ; Bengio, Yoshua: Understanding the difficulty
of training deep feedforward neural networks. In: Proceedings of
the thirteenth international conference on artificial intelligence and
statistics, 2010, S. 249–256 39, 40
[GBC+17] Ghazvininejad, Marjan ; Brockett, Chris ; Chang, Ming-
Wei ; Dolan, Bill ; Gao, Jianfeng ; Yih, Wen-tau ; Galley,
Michel: A knowledge-grounded neural conversation model. In:
arXiv preprint arXiv:1702.01932 (2017) 246
255
REFERENCES
[GBCB16] Goodfellow, Ian ; Bengio, Yoshua ; Courville, Aaron ; Ben-
gio, Yoshua: Deep learning. Bd. 1. MIT press Cambridge, 2016
22, 24, 37
[GBI+14] Goodfellow, Ian ; Bulatov, Yaroslav ; Ibarz, Julian ;
Arnoud, Sacha ; Shet, Vinay: Multi-digit Number Recogni-
tion from Street View Imagery using Deep Convolutional Neural
Networks. In: ICLR2014, 2014 101
[GDDM14] Girshick, Ross ; Donahue, Jeff ; Darrell, Trevor ; Malik,
Jitendra: Rich feature hierarchies for accurate object detection and
semantic segmentation. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, 2014, S. 580–587 65
[GDG+15] Gregor, Karol ; Danihelka, Ivo ; Graves, Alex ; Rezende,
Danilo J. ; Wierstra, Daan: Draw: A recurrent neural network
for image generation. In: arXiv preprint arXiv:1502.04623 (2015)
66
[Gir15] Girshick, Ross: Fast r-cnn. In: Proceedings of the IEEE inter-
national conference on computer vision, 2015, S. 1440–1448 66
[GPAM+14] Goodfellow, Ian ; Pouget-Abadie, Jean ; Mirza, Mehdi ;
Xu, Bing ; Warde-Farley, David ; Ozair, Sherjil ; Courville,
Aaron ; Bengio, Yoshua: Generative adversarial nets. In: Ad-
vances in neural information processing systems, 2014, S. 2672–
2680 13, 23, 67, 68
[Gra13] Graves, Alex: Generating sequences with recurrent neural net-
works. In: arXiv preprint arXiv:1308.0850 (2013) 245
[GS05] Graves, Alex ; Schmidhuber, Jurgen: Framewise phoneme clas-
sification with bidirectional LSTM and other neural network archi-
tectures. In: Neural Networks 18 (2005), Nr. 5-6, S. 602–610 35
[HCS+16] Hubara, Itay ; Courbariaux, Matthieu ; Soudry, Daniel ; El-
Yaniv, Ran ; Bengio, Yoshua: Binarized neural networks. In:
256
REFERENCES
Advances in neural information processing systems, 2016, S. 4107–
4115 9, 10, 72
[HDY+12] Hinton, Geoffrey ; Deng, Li ; Yu, Dong ; Dahl, George E. ;
Mohamed, Abdel-rahman ; Jaitly, Navdeep ; Senior, Andrew
; Vanhoucke, Vincent ; Nguyen, Patrick ; Sainath, Tara N.
u. a.: Deep neural networks for acoustic modeling in speech recog-
nition: The shared views of four research groups. In: IEEE Signal
processing magazine 29 (2012), Nr. 6, S. 82–97 245
[HGDG17] He, Kaiming ; Gkioxari, Georgia ; Dollar, Piotr ; Girshick,
Ross: Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE
International Conference on IEEE, 2017, S. 2980–2988 66, 73
[HHQ+16] He, Pan ; Huang, Weilin ; Qiao, Yu ; Loy, Chen C. ; Tang,
Xiaoou: Reading scene text in deep convolutional sequences. In:
Proceedings of the Thirtieth AAAI Conference on Artificial Intel-
ligence, AAAI Press, 2016, S. 3501–3508 101
[HL08] Huiskes, Mark J. ; Lew, Michael S.: The MIR flickr retrieval
evaluation. In: Proceedings of the 1st ACM international confer-
ence on Multimedia information retrieval ACM, 2008, S. 39–43 11
[HLM+16] Han, Song ; Liu, Xingyu ; Mao, Huizi ; Pu, Jing ; Pedram,
Ardavan ; Horowitz, Mark A. ; Dally, William J.: EIE: efficient
inference engine on compressed deep neural network. In: Computer
Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International
Symposium on IEEE, 2016, S. 243–254 246
[HMD15] Han, Song ; Mao, Huizi ; Dally, William J.: Deep Compression:
Compressing Deep Neural Networks with Pruning, Trained Quan-
tization and Huffman Coding. (2015), 1–14. http://dx.doi.org/
abs/1510.00149/1510.00149. – DOI abs/1510.00149/1510.00149
58, 71
257
REFERENCES
[HS97] Hochreiter, Sepp ; Schmidhuber, Jurgen: Long short-term
memory. In: Neural computation 9 (1997), Nr. 8, S. 1735–1780 28,
34
[HS06] Hinton, Geoffrey E. ; Salakhutdinov, Ruslan R.: Reducing
the dimensionality of data with neural networks. In: science 313
(2006), Nr. 5786, S. 504–507 23, 61
[HS11] Hinton, Geoffrey ; Salakhutdinov, Ruslan: Discovering binary
codes for documents by learning deep generative models. In: Topics
in Cognitive Science 3 (2011), Nr. 1, S. 74–91 245
[HSK16] Holden, Daniel ; Saito, Jun ; Komura, Taku: A deep learning
framework for character motion synthesis and editing. In: ACM
Transactions on Graphics (TOG) 35 (2016), Nr. 4, S. 138 246
[HSW89] Hornik, Kurt ; Stinchcombe, Maxwell ; White, Halbert: Mul-
tilayer feedforward networks are universal approximators. In: Neu-
ral networks 2 (1989), Nr. 5, S. 359–366 25
[HVD15] Hinton, Geoffrey ; Vinyals, Oriol ; Dean, Jeff: Distill-
ing the knowledge in a neural network. In: arXiv preprint
arXiv:1503.02531 (2015) 226
[HZC+17] Howard, Andrew G. ; Zhu, Menglong ; Chen, Bo ;
Kalenichenko, Dmitry ; Wang, Weijun ; Weyand, Tobias
; Andreetto, Marco ; Adam, Hartwig: MobileNets: Effi-
cient Convolutional Neural Networks for Mobile Vision Applica-
tions. (2017). http://dx.doi.org/arXiv:1704.04861. – DOI
arXiv:1704.04861 58, 71
[HZRS15] He, Kaiming ; Zhang, Xiangyu ; Ren, Shaoqing ; Sun, Jian:
Delving deep into rectifiers: Surpassing human-level performance
on imagenet classification. In: Proceedings of the IEEE interna-
tional conference on computer vision, 2015, S. 1026–1034 40, 44
258
REFERENCES
[HZRS16] He, Kaiming ; Zhang, Xiangyu ; Ren, Shaoqing ; Sun, Jian:
Deep residual learning for image recognition. In: Proceedings of
the IEEE conference on computer vision and pattern recognition,
2016, S. 770–778 29, 54, 100, 245
[IHM+16] Iandola, Forrest N. ; Han, Song ; Moskewicz, Matthew W.
; Ashraf, Khalid ; Dally, William J. ; Keutzer, Kurt:
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
and <0.5MB model size. (2016), 1–13. http://dx.doi.org/10.
1007/978-3-319-24553-9. – DOI 10.1007/978–3–319–24553–9. –
ISBN 978–3–319–24552–2 58, 71
[IS15a] Ioffe, Sergey ; Szegedy, Christian: Batch normalization: Accel-
erating deep network training by reducing internal covariate shift.
In: arXiv preprint arXiv:1502.03167 (2015) 37, 40, 41
[IS15b] Ioffe, Sergey ; Szegedy, Christian: Batch normalization: Accel-
erating deep network training by reducing internal covariate shift.
In: International conference on machine learning, 2015, S. 448–456
100
[ISA+16] Irving, Geoffrey ; Szegedy, Christian ; Alemi, Alexander A.
; Een, Niklas ; Chollet, Francois ; Urban, Josef: Deepmath-
deep sequence models for premise selection. In: Advances in Neural
Information Processing Systems, 2016, S. 2235–2243 245
[IZZE17] Isola, Phillip ; Zhu, Jun-Yan ; Zhou, Tinghui ; Efros,
Alexei A.: Image-to-image translation with conditional adversarial
networks. In: arXiv preprint (2017) 63
[JSD+14] Jia, Yangqing ; Shelhamer, Evan ; Donahue, Jeff ; Karayev,
Sergey ; Long, Jonathan ; Girshick, Ross ; Guadarrama, Ser-
gio ; Darrell, Trevor: Caffe: Convolutional architecture for fast
feature embedding. In: Proceedings of the 22nd ACM international
conference on Multimedia ACM, 2014, S. 675–678 9
259
REFERENCES
[JSVZ14] Jaderberg, M. ; Simonyan, K. ; Vedaldi, A. ; Zisserman, A.:
Synthetic Data and Artificial Neural Networks for Natural Scene
Text Recognition. In: Workshop on Deep Learning, NIPS, 2014
101
[JSVZ15] Jaderberg, Max ; Simonyan, Karen ; Vedaldi, Andrea ; Zis-
serman, Andrew: Reading Text in the Wild with Convolu-
tional Neural Networks. In: International Journal of Computer
Vision 116 (2015), Nr. 1, 1-20. http://dx.doi.org/10.1007/
s11263-015-0823-z. – DOI 10.1007/s11263–015–0823–z. – ISSN
0920–5691, 1573–1405 xvii, 101
[JVZ14a] Jaderberg, Max ; Vedaldi, Andrea ; Zisserman, Andrew:
Deep Features for Text Spotting. In: Computer Vision - ECCV
2014, Springer International Publishing, 2014 (Lecture Notes in
Computer Science 8692). – ISBN 978–3–319–10592–5 978–3–319–
10593–2, 512-528 101
[JVZ14b] Jaderberg, Max ; Vedaldi, Andrea ; Zisserman, Andrew:
Deep Features for Text Spotting. In: Fleet, David (Hrsg.) ; Pa-
jdla, Tomas (Hrsg.) ; Schiele, Bernt (Hrsg.) ; Tuytelaars,
Tinne (Hrsg.): Computer Vision - ECCV 2014, Springer Interna-
tional Publishing, 2014 (Lecture Notes in Computer Science 8692).
– ISBN 978–3–319–10592–5 978–3–319–10593–2, 512-528 101
[KB14] Kingma, Diederik P. ; Ba, Jimmy: Adam: A method for stochas-
tic optimization. In: arXiv preprint arXiv:1412.6980 (2014) 37,
47
[KGB14] Kalchbrenner, Nal ; Grefenstette, Edward ; Blunsom,
Phil: A convolutional neural network for modelling sentences. In:
arXiv preprint arXiv:1404.2188 (2014) 245
[KGBN+15] Karatzas, Dimosthenis ; Gomez-Bigorda, Lluis ; Nicolaou,
Anguelos ; Ghosh, Suman ; Bagdanov, Andrew ; Iwamura,
Masakazu ; Matas, Jiri ; Neumann, Lukas ; Chandrasekhar,
260
REFERENCES
Vijay R. ; Lu, Shijian ; Shafait, Faisal ; Uchida, Seiichi ; Val-
veny, Ernest: ICDAR 2015 Competition on Robust Reading. In:
Proceedings of the 2015 13th International Conference on Docu-
ment Analysis and Recognition (ICDAR). Washington, DC, USA :
IEEE Computer Society, 2015 (ICDAR ’15). – ISBN 978–1–4799–
1805–8, 1156–1160 100
[KH09] Krizhevsky, Alex ; Hinton, Geoffrey: Learning multiple layers
of features from tiny images / Citeseer. 2009. – Forschungsbericht
19
[Kim14] Kim, Yoon: Convolutional neural networks for sentence classifica-
tion. In: arXiv preprint arXiv:1408.5882 (2014) 245
[KL17] Koh, Pang W. ; Liang, Percy: Understanding black-box predic-
tions via influence functions. In: arXiv preprint arXiv:1703.04730
(2017) 70
[KMY17] Kang, Eunhee ; Min, Junhong ; Ye, Jong C.: A deep convolu-
tional neural network using directional wavelets for low-dose X-ray
CT reconstruction. In: Medical physics 44 (2017), Nr. 10 246
[KSH12] Krizhevsky, Alex ; Sutskever, Ilya ; Hinton, Geoffrey E.:
Imagenet classification with deep convolutional neural networks.
In: Advances in neural information processing systems, 2012, S.
1097–1105 17, 36, 48, 245
[KSU+13] Karatzas, Dimosthenis ; Shafait, Faisal ; Uchida, Seiichi ;
Iwamura, Masakazu ; Bigorda, Lluis G. ; Mestre, Sergi R.
; Mas, Joan ; Mota, David F. ; Almazan, Jon A. ; Heras,
Lluis P. l.: ICDAR 2013 robust reading competition. In: 2013 12th
International Conference on Document Analysis and Recognition,
IEEE, 2013, S. 1484–1493 101
[KTS+14] Karpathy, Andrej ; Toderici, George ; Shetty, Sanketh ; Le-
ung, Thomas ; Sukthankar, Rahul ; Fei-Fei, Li: Large-scale
261
REFERENCES
video classification with convolutional neural networks. In: Pro-
ceedings of the IEEE conference on Computer Vision and Pattern
Recognition, 2014, S. 1725–1732 245
[L+15] LeCun, Yann u. a.: LeNet-5, convolutional neural networks. In:
URL: http://yann. lecun. com/exdb/lenet (2015), S. 20 xiii, 29, 48
[LAE+16] Liu, Wei ; Anguelov, Dragomir ; Erhan, Dumitru ; Szegedy,
Christian ; Reed, Scott ; Fu, Cheng-Yang ; Berg, Alexander C.:
Ssd: Single shot multibox detector. In: European conference on
computer vision Springer, 2016, S. 21–37 66
[LBBH98] LeCun, Yann ; Bottou, Leon ; Bengio, Yoshua ; Haffner,
Patrick: Gradient-based learning applied to document recognition.
In: Proceedings of the IEEE 86 (1998), Nr. 11, S. 2278–2324 28
[LBH15] LeCun, Yann ; Bengio, Yoshua ; Hinton, Geoffrey: Deep learn-
ing. In: nature 521 (2015), Nr. 7553, S. 436 20, 24, 37, 245
[LBS+16] Lample, Guillaume ; Ballesteros, Miguel ; Subramanian,
Sandeep ; Kawakami, Kazuya ; Dyer, Chris: Neural ar-
chitectures for named entity recognition. In: arXiv preprint
arXiv:1603.01360 (2016) 246
[LLS15] Lenz, Ian ; Lee, Honglak ; Saxena, Ashutosh: Deep learning for
detecting robotic grasps. In: The International Journal of Robotics
Research 34 (2015), Nr. 4-5, S. 705–724 246
[LM14] Le, Quoc ; Mikolov, Tomas: Distributed representations of sen-
tences and documents. In: International Conference on Machine
Learning, 2014, S. 1188–1196 245
[LMB+14] Lin, Tsung-Yi ; Maire, Michael ; Belongie, Serge ; Hays, James
; Perona, Pietro ; Ramanan, Deva ; Dollar, Piotr ; Zitnick,
C L.: Microsoft coco: Common objects in context. In: European
conference on computer vision Springer, 2014, S. 740–755 11, 222
262
REFERENCES
[LNZ+17] Li, Zefan ; Ni, Bingbing ; Zhang, Wenjun ; Yang, Xiaokang ;
Gao, Wen: Performance Guaranteed Network Acceleration via
High-Order Residual Quantization. (2017). http://arxiv.org/
abs/1708.08687 9
[LOD18] Li, He ; Ota, Kaoru ; Dong, Mianxiong: Learning IoT in edge:
deep learning for the internet of things with edge computing. In:
IEEE Network 32 (2018), Nr. 1, S. 96–101 247
[Low04] Lowe, David G.: Distinctive image features from scale-invariant
keypoints. In: International journal of computer vision 60 (2004),
Nr. 2, S. 91–110 19
[LPSB17] Luan, Fujun ; Paris, Sylvain ; Shechtman, Eli ; Bala, Kavita:
Deep photo style transfer. In: CoRR, abs/1703.07511 2 (2017)
245
[Lu15] Lu, Yao: Unsupervised learning on neural network outputs:
with application in zero-shot learning. In: arXiv preprint
arXiv:1506.00990 (2015) 70
[LW+02] Liaw, Andy ; Wiener, Matthew u. a.: Classification and regres-
sion by randomForest. In: R news 2 (2002), Nr. 3, S. 18–22 19
[LZP17] Lin, Xiaofan ; Zhao, Cong ; Pan, Wei: Towards Accurate Binary
Convolutional Neural Network. In: Advances in Neural Informa-
tion Processing Systems, 2017, S. 344–352 9, 72
[LZXW14] Li, Wei ; Zhao, Rui ; Xiao, Tong ; Wang, Xiaogang: Deepreid:
Deep filter pairing neural network for person re-identification. In:
Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, 2014, S. 152–159 246
[MAJ12] Mishra, Anand ; Alahari, Karteek ; Jawahar, Cv: Scene Text
Recognition using Higher Order Language Priors. In: BMVC 2012-
23rd British Machine Vision Conference, British Machine Vision
Association, 2012. – ISBN 1–901725–46–4, 127.1-127.11 101
263
REFERENCES
[Mar18] Marcus, Gary: Deep learning: A critical appraisal. In: arXiv
preprint arXiv:1801.00631 (2018) 75, 76
[MCCD13] Mikolov, Tomas ; Chen, Kai ; Corrado, Greg ; Dean, Jeffrey:
Efficient estimation of word representations in vector space. In:
arXiv preprint arXiv:1301.3781 (2013) 246
[MHN13] Maas, Andrew L. ; Hannun, Awni Y. ; Ng, Andrew Y.: Rectifier
nonlinearities improve neural network acoustic models. In: Proc.
icml Bd. 30, 2013, S. 3 44
[MKS+13] Mnih, Volodymyr ; Kavukcuoglu, Koray ; Silver, David ;
Graves, Alex ; Antonoglou, Ioannis ; Wierstra, Daan ;
Riedmiller, Martin: Playing atari with deep reinforcement learn-
ing. In: arXiv preprint arXiv:1312.5602 (2013) 23
[MKS+15] Mnih, Volodymyr ; Kavukcuoglu, Koray ; Silver, David ;
Rusu, Andrei A. ; Veness, Joel ; Bellemare, Marc G. ;
Graves, Alex ; Riedmiller, Martin ; Fidjeland, Andreas K. ;
Ostrovski, Georg u. a.: Human-level control through deep rein-
forcement learning. In: Nature 518 (2015), Nr. 7540, S. 529 23,
68
[MLA+13] Malik, M. I. ; Liwicki, M. ; Alewijnse, L. ; Ohyama, W. ;
Blumenstein, M. ; Found, B.: ICDAR 2013 Competitions on
Signature Verification and Writer Identification for On- and Offline
Skilled Forgeries (SigWiComp 2013). In: 2013 12th International
Conference on Document Analysis and Recognition, 2013. – ISSN
1520–5363, S. 1477–1483 9
[MP43] McCulloch, Warren S. ; Pitts, Walter: A logical calculus of
the ideas immanent in nervous activity. In: The bulletin of math-
ematical biophysics 5 (1943), Nr. 4, S. 115–133 25
[MP69] Minsky, Marvin ; Papert, Seymour A.: Perceptrons: An intro-
duction to computational geometry. MIT press, 1969 25
264
REFERENCES
[MSC+13a] Mikolov, Tomas ; Sutskever, Ilya ; Chen, Kai ; Corrado,
Greg S. ; Dean, Jeff: Distributed representations of words and
phrases and their compositionality. In: Advances in neural infor-
mation processing systems, 2013, S. 3111–3119 19
[MSC+13b] Mikolov, Tomas ; Sutskever, Ilya ; Chen, Kai ; Corrado,
Greg S. ; Dean, Jeff: Distributed representations of words and
phrases and their compositionality. In: Advances in neural infor-
mation processing systems, 2013, S. 3111–3119 228
[MYM18] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph:
Dropout-GAN: Learning from a Dynamic Ensemble of Discrimi-
nators. In: arXiv preprint arXiv:1807.11346 (2018) 228
[MYM19a] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph:
Dropout-GAN: Learning from a Dynamic Ensemble of Discrimi-
nators (under review). In: Proceedings of the 2019 Conference on
Artificial Intelligence, 2019 (AAAI ’19) 13
[MYM19b] Mordido, Goncalo ; Yang, Haojin ; Meinel, Christoph: Mi-
croGAN: Promoting Variety through Micro-Batch Discrimination
(under review). In: Proceedings of the International Conference on
Learning Representations, 2019 (ICLR ’19) 13
[NH10] Nair, Vinod ; Hinton, Geoffrey E.: Rectified linear units im-
prove restricted boltzmann machines. In: Proceedings of the 27th
international conference on machine learning (ICML-10), 2010, S.
807–814 29, 37, 43
[OMS17] Olah, Chris ; Mordvintsev, Alexander ; Schubert, Ludwig:
Feature visualization. In: Distill 2 (2017), Nr. 11, S. e7 70
[OPM02] Ojala, Timo ; Pietikainen, Matti ; Maenpaa, Topi: Multires-
olution gray-scale and rotation invariant texture classification with
local binary patterns. In: IEEE Transactions on pattern analysis
and machine intelligence 24 (2002), Nr. 7, S. 971–987 19
265
REFERENCES
[OW13] Ouyang, Wanli ; Wang, Xiaogang: Joint deep learning for pedes-
trian detection. In: Proceedings of the IEEE International Confer-
ence on Computer Vision, 2013, S. 2056–2063 246
[PCC+] Paszke, Adam ; Chintala, Soumith ; Collobert, Ronan ;
Kavukcuoglu, Koray ; Farabet, Clement ; Bengio, Samy ;
Melvin, Iain ; Weston, Jason ; Mariethoz, Johnny: Pytorch:
Tensors and dynamic neural networks in python with strong gpu
acceleration, may 2017 9
[Qia99] Qian, Ning: On the momentum term in gradient descent learning
algorithms. In: Neural networks 12 (1999), Nr. 1, S. 145–151 45
[RCPC+10] Rasiwasia, Nikhil ; Costa Pereira, Jose ; Coviello,
Emanuele ; Doyle, Gabriel ; Lanckriet, Gert R. ; Levy, Roger
; Vasconcelos, Nuno: A new approach to cross-modal multi-
media retrieval. In: Proceedings of the 18th ACM international
conference on Multimedia ACM, 2010, S. 251–260 11
[RDGF16] Redmon, Joseph ; Divvala, Santosh ; Girshick, Ross ;
Farhadi, Ali: You only look once: Unified, real-time object de-
tection. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, S. 779–788 66
[RDS+15] Russakovsky, Olga ; Deng, Jia ; Su, Hao ; Krause, Jonathan
; Satheesh, Sanjeev ; Ma, Sean ; Huang, Zhiheng ; Karpathy,
Andrej ; Khosla, Aditya ; Bernstein, Michael u. a.: Imagenet
large scale visual recognition challenge. In: International Journal
of Computer Vision 115 (2015), Nr. 3, S. 211–252 54, 62
[RHGS15] Ren, Shaoqing ; He, Kaiming ; Girshick, Ross ; Sun, Jian:
Faster r-cnn: Towards real-time object detection with region pro-
posal networks. In: Advances in neural information processing
systems, 2015, S. 91–99 66, 73
266
REFERENCES
[RHW86a] Rumelhart, David E. ; Hinton, Geoffrey E. ; Williams,
Ronald J.: Learning representations by back-propagating errors.
In: nature 323 (1986), Nr. 6088, S. 533 20
[RHW86b] Rumelhart, David E. ; Hinton, Geoffrey E. ; Williams,
Ronald J.: Learning representations by back-propagating errors.
In: nature 323 (1986), Nr. 6088, S. 533 26
[RORF16] Rastegari, Mohammad ; Ordonez, Vicente ; Redmon, Joseph
; Farhadi, Ali: Xnor-net: Imagenet classification using binary
convolutional neural networks. In: European Conference on Com-
puter Vision Springer, 2016, S. 525–542 9, 10, 72
[Ros58] Rosenblatt, Frank: The perceptron: a probabilistic model for
information storage and organization in the brain. In: Psychologi-
cal review 65 (1958), Nr. 6, S. 386 25
[RVRK16] Richter, Stephan R. ; Vineet, Vibhav ; Roth, Stefan ;
Koltun, Vladlen: Playing for Data: Ground Truth from Com-
puter Games. In: Leibe, Bastian (Hrsg.) ; Matas, Jiri (Hrsg.)
; Sebe, Nicu (Hrsg.) ; Welling, Max (Hrsg.): European Con-
ference on Computer Vision (ECCV) Bd. 9906, Springer Interna-
tional Publishing, 2016 (LNCS), S. 102–118 xiv, 39
[RYHH10] Rashtchian, Cyrus ; Young, Peter ; Hodosh, Micah ; Hock-
enmaier, Julia: Collecting image annotations using Amazon’s
Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Work-
shop on Creating Speech and Language Data with Amazon’s Me-
chanical Turk Association for Computational Linguistics, 2010, S.
139–147 11, 222
[RYM16] Rantzsch, Hannes ; Yang, Haojin ; Meinel, Christoph: Signa-
ture Embedding: Writer Independent Offline Signature Verification
with Deep Metric Learning. In: Bebis, George (Hrsg.) ; Boyle,
Richard (Hrsg.) ; Parvin, Bahram (Hrsg.) ; Koracin, Darko
267
REFERENCES
(Hrsg.) ; Porikli, Fatih (Hrsg.) ; Skaff, Sandra (Hrsg.) ; En-
tezari, Alireza (Hrsg.) ; Min, Jianyuan (Hrsg.) ; Iwai, Daisuke
(Hrsg.) ; Sadagic, Amela (Hrsg.) ; Scheidegger, Carlos (Hrsg.)
; Isenberg, Tobias (Hrsg.): Advances in Visual Computing. Cham
: Springer International Publishing, 2016. – ISBN 978–3–319–
50832–0, S. 616–625 9, 247
[RYM18] Razaei, Mina ; Yang, Haojin ; Meinel, Christoph: Instance Tu-
mor Segmentation using Multitask Convolutional Neural Network.
In: IEEE Joint Conference on Neural Networks (IJCNN) IEEE,
2018 13
[RYM19a] Razaei, Mina ; Yang, Haojin ; Meinel, Christoph: Conditional
Generative Adversarial Refinement Networks for Unbalanced Med-
ical Image Semantic Segmentation, 2019 (WACV ’19) 13
[RYM19b] Rezaei, Mina ; Yang, Haojin ; Meinel, Christoph: Recurrent
generative adversarial network for learning imbalanced medical im-
age semantic segmentation. In: Multimedia Tools and Applications
(2019), Feb. http://dx.doi.org/10.1007/s11042-019-7305-1.
– DOI 10.1007/s11042–019–7305–1. – ISSN 1573–7721 13, 14, 82
[SBY16] Shi, Baoguang ; Bai, Xiang ; Yao, Cong: An end-to-end train-
able neural network for image-based sequence recognition and its
application to scene text recognition. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence (2016) 101
[Sch15] Schmidhuber, Jurgen: Deep learning in neural networks: An
overview. In: Neural networks 61 (2015), S. 85–117 20
[SDBR14] Springenberg, Jost T. ; Dosovitskiy, Alexey ; Brox, Thomas
; Riedmiller, Martin: Striving for simplicity: The all convolu-
tional net. In: arXiv preprint arXiv:1412.6806 (2014) 69
[SFH17] Sabour, Sara ; Frosst, Nicholas ; Hinton, Geoffrey E.: Dy-
namic routing between capsules. In: Advances in Neural Informa-
tion Processing Systems, 2017, S. 3856–3866 74
268
REFERENCES
[SGZ+16] Salimans, Tim ; Goodfellow, Ian ; Zaremba, Wojciech ;
Cheung, Vicki ; Radford, Alec ; Chen, Xi: Improved tech-
niques for training gans. In: Advances in Neural Information Pro-
cessing Systems, 2016, S. 2234–2242 68
[SHK+14] Srivastava, Nitish ; Hinton, Geoffrey ; Krizhevsky, Alex ;
Sutskever, Ilya ; Salakhutdinov, Ruslan: Dropout: a simple
way to prevent neural networks from overfitting. In: The Journal
of Machine Learning Research 15 (2014), Nr. 1, S. 1929–1958 37,
42
[SHS+17] Silver, David ; Hubert, Thomas ; Schrittwieser, Julian ;
Antonoglou, Ioannis ; Lai, Matthew ; Guez, Arthur ; Lanc-
tot, Marc ; Sifre, Laurent ; Kumaran, Dharshan ; Grae-
pel, Thore u. a.: Mastering chess and shogi by self-play with
a general reinforcement learning algorithm. In: arXiv preprint
arXiv:1712.01815 (2017) 23, 68
[SHZ+18] Sandler, Mark ; Howard, Andrew ; Zhu, Menglong ; Zhmogi-
nov, Andrey ; Chen, Liang-Chieh: Inverted residuals and linear
bottlenecks: Mobile networks for classification, detection and seg-
mentation. In: arXiv preprint arXiv:1801.04381 (2018) 72
[SLJ+15] Szegedy, Christian ; Liu, Wei ; Jia, Yangqing ; Sermanet,
Pierre ; Reed, Scott ; Anguelov, Dragomir ; Erhan, Dumitru ;
Vanhoucke, Vincent ; Rabinovich, Andrew ; Others: Going
deeper with convolutions Cvpr, 2015 37, 50, 51
[SMDH13] Sutskever, Ilya ; Martens, James ; Dahl, George ; Hinton,
Geoffrey: On the importance of initialization and momentum in
deep learning. In: International conference on machine learning,
2013, S. 1139–1147 39
[SMH+11] Susskind, Joshua ; Mnih, Volodymyr ; Hinton, Geoffrey u. a.:
On deep generative models with applications to recognition. In:
269
REFERENCES
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on IEEE, 2011, S. 2857–2864 246
[SSS+17] Silver, David ; Schrittwieser, Julian ; Simonyan, Karen ;
Antonoglou, Ioannis ; Huang, Aja ; Guez, Arthur ; Hubert,
Thomas ; Baker, Lucas ; Lai, Matthew ; Bolton, Adrian u. a.:
Mastering the game of go without human knowledge. In: Nature
550 (2017), Nr. 7676, S. 354 vi, 5, 23, 63, 68
[SVK17] Su, Jiawei ; Vargas, Danilo V. ; Kouichi, Sakurai: One
pixel attack for fooling deep neural networks. In: arXiv preprint
arXiv:1710.08864 (2017) 70
[SVZ13] Simonyan, Karen ; Vedaldi, Andrea ; Zisserman, Andrew:
Deep inside convolutional networks: Visualising image classifica-
tion models and saliency maps. In: arXiv preprint arXiv:1312.6034
(2013) 69
[SWL+16] Shi, Baoguang ; Wang, Xinggang ; Lyu, Pengyuan ; Yao, Cong
; Bai, Xiang: Robust scene text recognition with automatic rec-
tification. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, S. 4168–4176 101, 102
[SZ14a] Simonyan, Karen ; Zisserman, Andrew: Two-stream convolu-
tional networks for action recognition in videos. In: Advances in
neural information processing systems, 2014, S. 568–576 246
[SZ14b] Simonyan, Karen ; Zisserman, Andrew: Very deep convolutional
networks for large-scale image recognition. In: arXiv preprint
arXiv:1409.1556 (2014) 245
[SZ15] Simonyan, Karen ; Zisserman, Andrew: Very deep convolutional
networks for large-scale image recognition. In: ICLR, 2015 37, 50
[TCP+18] Tan, Mingxing ; Chen, Bo ; Pang, Ruoming ; Vasudevan, Vijay
; Le, Quoc V.: MnasNet: Platform-Aware Neural Architecture
Search for Mobile. In: arXiv preprint arXiv:1807.11626 (2018)
72, 75
270
REFERENCES
[TOHC15] Tokui, Seiya ; Oono, Kenta ; Hido, Shohei ; Clayton, Justin:
Chainer: a next-generation open source framework for deep learn-
ing. In: Proceedings of workshop on machine learning systems
(LearningSys) in the twenty-ninth annual conference on neural in-
formation processing systems (NIPS) Bd. 5, 2015, S. 1–6 9
[TYRW14] Taigman, Yaniv ; Yang, Ming ; Ranzato, Marc’Aurelio ; Wolf,
Lior: Deepface: Closing the gap to human-level performance in
face verification. In: Proceedings of the IEEE conference on com-
puter vision and pattern recognition, 2014, S. 1701–1708 246
[ULVL16] Ulyanov, Dmitry ; Lebedev, Vadim ; Vedaldi, Andrea ; Lem-
pitsky, Victor S.: Texture Networks: Feed-forward Synthesis of
Textures and Stylized Images. In: ICML, 2016, S. 1349–1357 246
[VDODZ+16] Van Den Oord, Aaron ; Dieleman, Sander ; Zen, Heiga ;
Simonyan, Karen ; Vinyals, Oriol ; Graves, Alex ; Kalch-
brenner, Nal ; Senior, Andrew W. ; Kavukcuoglu, Koray:
WaveNet: A generative model for raw audio. In: SSW, 2016, S.
125 66, 246
[VLBM08] Vincent, Pascal ; Larochelle, Hugo ; Bengio, Yoshua ; Man-
zagol, Pierre-Antoine: Extracting and composing robust features
with denoising autoencoders. In: Proceedings of the 25th interna-
tional conference on Machine learning ACM, 2008, S. 1096–1103
23, 61
[VRD+15] Venugopalan, Subhashini ; Rohrbach, Marcus ; Donahue,
Jeffrey ; Mooney, Raymond ; Darrell, Trevor ; Saenko, Kate:
Sequence to sequence-video to text. In: Proceedings of the IEEE
international conference on computer vision, 2015, S. 4534–4542
63
[VTBE15] Vinyals, Oriol ; Toshev, Alexander ; Bengio, Samy ; Erhan,
Dumitru: Show and tell: A neural image caption generator. In:
271
REFERENCES
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, S. 3156–3164 228, 245
[WBB11] Wang, Kai ; Babenko, B. ; Belongie, S.: End-to-end scene
text recognition. In: 2011 International Conference on Computer
Vision, 2011, S. 1457–1464 101, 102
[WSC+16] Wu, Yonghui ; Schuster, Mike ; Chen, Zhifeng ; Le, Quoc V. ;
Norouzi, Mohammad ; Macherey, Wolfgang ; Krikun, Maxim
; Cao, Yuan ; Gao, Qin ; Macherey, Klaus u. a.: Google’s neural
machine translation system: Bridging the gap between human and
machine translation. In: arXiv preprint arXiv:1609.08144 (2016)
63, 246
[WSRS+17] Wang, Yuxuan ; Skerry-Ryan, RJ ; Stanton, Daisy ; Wu,
Yonghui ; Weiss, Ron J. ; Jaitly, Navdeep ; Yang, Zongheng ;
Xiao, Ying ; Chen, Zhifeng ; Bengio, Samy u. a.: Tacotron: A
fully end-to-end text-to-speech synthesis model. In: arXiv preprint
(2017) 245
[WYBM16] Wang, Cheng ; Yang, Haojin ; Bartz, Christian ; Meinel,
Christoph: Image Captioning with Deep Bidirectional LSTMs.
In: Proceedings of the 2016 ACM on Multimedia Conference. New
York, NY, USA : ACM, 2016 (MM ’16). – ISBN 978–1–4503–3603–
1, 988–997 vi, 7, 12, 35, 63, 73, 222, 245
[WYM15a] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Deep se-
mantic mapping for cross-modal retrieval. In: Tools with Artificial
Intelligence (ICTAI), 2015 IEEE 27th International Conference on
IEEE, 2015, S. 234–241 11
[WYM15b] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Visual-
Textual Late Semantic Fusion Using Deep Neural Network for
Document Categorization. In: International Conference on Neural
Information Processing Springer, 2015, S. 662–670 11
272
REFERENCES
[WYM16a] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: A deep
semantic framework for multimodal representation learning. In:
Multimedia Tools and Applications 75 (2016), Aug, Nr. 15, 9255–
9276. http://dx.doi.org/10.1007/s11042-016-3380-8. – DOI
10.1007/s11042–016–3380–8 vi, 7, 11, 14, 82, 222
[WYM16b] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Exploring
multimodal video representation for action recognition. In: Neural
Networks (IJCNN), 2016 International Joint Conference on IEEE,
2016, S. 1924–1931 11
[WYM18] Wang, Cheng ; Yang, Haojin ; Meinel, Christoph: Image Cap-
tioning with Deep Bidirectional LSTMs and Multi-Task Learning.
In: ACM Trans. Multimedia Comput. Commun. Appl. 14 (2018),
April, Nr. 2s, 40:1–40:20. http://dx.doi.org/10.1145/3115432.
– DOI 10.1145/3115432. – ISSN 1551–6857 vi, 7, 12, 14, 82, 222
[WYX+17] Wang, Bokun ; Yang, Yang ; Xu, Xing ; Hanjalic, Alan ; Shen,
Heng T.: Adversarial cross-modal retrieval. In: Proceedings of the
2017 ACM on Multimedia Conference ACM, 2017, S. 154–162 73
[WZZ+13] Wan, Li ; Zeiler, Matthew ; Zhang, Sixin ; Le Cun, Yann ;
Fergus, Rob: Regularization of neural networks using dropcon-
nect. In: International Conference on Machine Learning, 2013, S.
1058–1066 42
[XWCL15] Xu, Bing ; Wang, Naiyan ; Chen, Tianqi ; Li, Mu: Empirical
evaluation of rectified activations in convolutional network. In:
arXiv preprint arXiv:1505.00853 (2015) 37, 45
[YFBM17a] Yang, Haojin ; Fritzsche, Martin ; Bartz, Christian ; Meinel,
Christoph: Bmxnet: An open-source binary neural network imple-
mentation based on mxnet. In: Proceedings of the 2017 ACM on
Multimedia Conference ACM, 2017, S. 1209–1212 10
[YFBM17b] Yang, Haojin ; Fritzsche, Martin ; Bartz, Christian ; Meinel,
Christoph: BMXNet: An Open-Source Binary Neural Network
273
REFERENCES
Implementation Based on MXNet. In: Proceedings of the 2017
ACM on Multimedia Conference. New York, NY, USA : ACM,
2017 (MM ’17). – ISBN 978–1–4503–4906–2, 1209–1212 14, 57,
58, 81
[YLHH14] Young, Peter ; Lai, Alice ; Hodosh, Micah ; Hockenmaier,
Julia: From image descriptions to visual denotations: New sim-
ilarity metrics for semantic inference over event descriptions. In:
Transactions of the Association for Computational Linguistics 2
(2014), S. 67–78 11, 222
[You18] YouTube: YouTube statistics. http://youtube.com/, 2018 vi,
3
[YWBM16] Yang, Haojin ; Wang, Cheng ; Bartz, Christian ; Meinel,
Christoph: SceneTextReg: A Real-Time Video OCR System. In:
Proceedings of the 2016 ACM on Multimedia Conference. New
York, NY, USA : ACM, 2016 (MM ’16). – ISBN 978–1–4503–
3603–1, 698–700 vi, 6, 9, 14, 38, 81, 219
[YWC+15] Yang, Haojin ; Wang, Cheng ; Che, Xiaoyin ; Luo, Sheng ;
Meinel, Christoph: An Improved System For Real-Time Scene
Text Recognition. In: Proceedings of the 5th ACM on International
Conference on Multimedia Retrieval. New York, NY, USA : ACM,
2015 (ICMR ’15). – ISBN 978–1–4503–3274–3, 657–660 9
[ZCS+17] Zhang, Quanshi ; Cao, Ruiming ; Shi, Feng ; Wu, Ying N. ;
Zhu, Song-Chun: Interpreting cnn knowledge via an explanatory
graph. In: arXiv preprint arXiv:1708.01785 (2017) 71
[ZCWZ17] Zhang, Quanshi ; Cao, Ruiming ; Wu, Ying N. ; Zhu, Song-
Chun: Mining object parts from cnns via active question-
answering. In: Proc IEEE Conf on Computer Vision and Pattern
Recognition, 2017, S. 346–355 71
[ZCY+16] Zhang, Yu ; Chen, Guoguo ; Yu, Dong ; Yaco, Kaisheng ;
Khudanpur, Sanjeev ; Glass, James: Highway long short-term
274
REFERENCES
memory rnns for distant speech recognition. In: Acoustics, Speech
and Signal Processing (ICASSP), 2016 IEEE International Con-
ference on IEEE, 2016, S. 5755–5759 245
[ZCZ+17] Zhang, Quanshi ; Cao, Ruiming ; Zhang, Shengming ; Red-
monds, Mark ; Wu, Ying N. ; Zhu, Song-Chun: Interactively
transferring CNN patterns for part localization. In: arXiv preprint
arXiv:1708.01783 (2017) 71
[ZF14] Zeiler, Matthew D. ; Fergus, Rob: Visualizing and understand-
ing convolutional networks. In: European conference on computer
vision Springer, 2014, S. 818–833 69
[ZIE16] Zhang, Richard ; Isola, Phillip ; Efros, Alexei A.: Colorful
image colorization. In: European Conference on Computer Vision
Springer, 2016, S. 649–666 245
[ZKSE16] Zhu, Jun-Yan ; Krahenbuhl, Philipp ; Shechtman, Eli ;
Efros, Alexei A.: Generative visual manipulation on the natu-
ral image manifold. In: European Conference on Computer Vision
Springer, 2016, S. 597–613 245
[ZL16] Zoph, Barret ; Le, Quoc V.: Neural architecture search with re-
inforcement learning. In: arXiv preprint arXiv:1611.01578 (2016)
75
[ZVSL17] Zoph, Barret ; Vasudevan, Vijay ; Shlens, Jonathon ; Le,
Quoc V.: Learning transferable architectures for scalable image
recognition. In: arXiv preprint arXiv:1707.07012 2 (2017), Nr. 6
72
[ZWN+16] Zhou, Shuchang ; Wu, Yuxin ; Ni, Zekun ; Zhou, Xinyu ; Wen,
He ; Zou, Yuheng: DoReFa-Net: Training Low Bitwidth Convo-
lutional Neural Networks with Low Bitwidth Gradients. 1 (2016),
Nr. 1, 1–14. http://arxiv.org/abs/1606.06160 9, 10, 72
275
REFERENCES
[ZZ18] Zhang, Quan-shi ; Zhu, Song-Chun: Visual interpretability for
deep learning: a survey. In: Frontiers of Information Technology
& Electronic Engineering 19 (2018), Nr. 1, S. 27–39 69
[ZZLS17] Zhang, Xiangyu ; Zhou, Xinyu ; Lin, Mengxiao ; Sun, Jian:
ShuffleNet: An Extremely Efficient Convolutional Neural Network
for Mobile Devices. (2017), 1–10. http://arxiv.org/abs/1707.
01083 58, 72
276
Acronyms
ACM Association for Computing Machin-
ery
AI Artificial Intelligence
AM Acoustic Model
ASR Automated Speech Recognition
BIC Bayesian Information Criteria
CC Connected Component
CMU Carnegie Mellon University
CNN Convolutional Neural Network
CPU Central Processing Unit
CV Computer Vision
DCT Discrete Cosine Transform
DL Deep Learning
DNN Deep Neural Network
eLBP edge-Based Local Binary Pattern
ERSB Energy Ratio in Subband
FTP File Transfer Protocol
GPU Graphics Processing Unit
GUI Graphic User Interface
HOG Histogram of Oriented Gradients
HPI Hasso Plattner Institute
HSV Hue Saturation Value
HTTP Hypertext Transfer Protocol
ICDAR International Conference on Docu-
ment Analysis and Recognition
IDC International Data Corporation
IEEE Institute of Electrical and Electron-
ics Engineers
IoT Internet of Things
IPv6 Internet Protocol Version 6
IR Information Retrieval
KFS Key-Frames Selection
LALM Line Against Line Matching
LBP Local Binary Pattern
LM Language Model
LOD Linked Open Data
LSTM Long Short Term Memory
MFCC Mel Frequency Cepstral Coefficient
MSE Mean Square Error
NPR Non-Pitch Ratio
NSR Non-Silence Ratio
OCR Optical Character Recognition
OpenCV Open Source Computer Vision Li-
brary
OS Operation System
RBF Radial Basis Function
RNN Recurrent Neural Network
SEMEX SEmantic Media EXplorer
SIFT Scale Invariant Feature Transform
SMIL Synchronized Multimedia Integra-
tion Language
SOAP Simple Object Access Protocol
SPARQL SPARQL Protocol and RDF Query
Language
SPR Smooth Pitch Ratio
Stanford POS Tagger Stanford Log-linear
Part-Of-Speech Tagger
277
ACRONYMS
SVM Support Vector Machine, are super-
vised learning models with associ-
ated learning algorithms that analyze
data and recognize patterns, used for
classification and regression analysis.
TCP Transmission Control Protocol
TED Technology Entertainment and De-
sign
tele-TASK tele-Teaching Anywhere Solution
Kit
TFIDF Term Frequency Inverse Document
Frequency
TREC Text REtrieval Conference
UDP User Datagram Protocol
UIMA Unstructured Information Manage-
ment Architecture
VDR Volume Dynamic Range
VSM Vector Space Model
VSTD Volume Standard Deviation
WER Word Error Rate
WWW World Wide Web
ZCR Zero Crossing Rate
ZSTD Standard Deviation of Zero Crossing
Rate
278