KAUR-DISSERTATION-2020.pdf - Treasures @ UT Dallas
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of KAUR-DISSERTATION-2020.pdf - Treasures @ UT Dallas
EFFICIENT COMBINATION OF NEURAL AND SYMBOLIC LEARNING
FOR RELATIONAL DATA
by
Navdeep Kaur
APPROVED BY SUPERVISORY COMMITTEE:
Sriraam Natarajan, Chair
Gopal Gupta
Nicholas Ruozzi
Gautam Kunapuli
Kristian Kersting
EFFICIENT COMBINATION OF NEURAL AND SYMBOLIC LEARNING
FOR RELATIONAL DATA
by
NAVDEEP KAUR, BTech, MTech, MS
DISSERTATION
Presented to the Faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY IN
COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT DALLAS
December 2020
ACKNOWLEDGMENTS
I would extend my gratitude to my PhD advisor, Dr. Sriraam Natarajan, for having my back
especially in this last year amid the entire pandemic. Although I decided to pursue a futuristic
topic in my PhD dissertation, still you always ensured that I followed my research interests and my
PhD met its successful end. I am thankful to you for your continuous support.
I offer my sincerest thanks to Dr. Gautam Kunapuli who has selflessly helped me during the entire
time he was a part of the StARLinG lab. I am indebted to you for spending hours and hours of your
valuable time teaching research to me. I grew immensely as a researcher under your mentorship.
Behind every successful woman is a father who believed in the power of her dreams. I wish to
thank my father for being the wind beneath my wings; I hope I have made you proud. Further, I
am immensely thankful to my mother for providing me the strength and motivation that I needed
to keep going during the lowest phases of my PhD through hours-long phone calls. I am thankful
to my sister for her constant love and care and my brother for being my “ATM” during the times
I went broke as a graduate student. I would also like to acknowledge my extended family: my
bother-in-law, my sister-in-law for always being there for me; and especially my niece and my
nephew for teaching me the meaning of love all over again.
My gratitude list would not be complete without thanking my peers at StARLinG lab, especially
Phillip Odom and Mayukh Das who helped me so much during my initial days in the lab when
I was still finding my feet. You both have taught me an important lesson of a lifetime: to come
forward and help others when they are struggling with their research. Finally, I wish to extend my
love to Nandini and Srijita for being such good friends.
I would like to acknowledge the support of AFOSR award FA9550-18-1-0462 for generously fund-
ing my research. Any opinions, findings and conclusion or recommendations are those of the
authors and do not necessarily reflect the view of the US government or AFOSR.
November 2020
v
EFFICIENT COMBINATION OF NEURAL AND SYMBOLIC LEARNING
FOR RELATIONAL DATA
Navdeep Kaur, PhDThe University of Texas at Dallas, 2020
Supervising Professor: Sriraam Natarajan, Chair
Much has been achieved in AI but to realize its true potential, it is imperative that the AI sys-
tem should be able to learn generalizable and actionable higher-level knowledge from lowest level
percepts. Inspired by this goal, neuro-symbolic systems have been developed for the past four
decades. These systems encompass the complementary strengths of fast adaptive learning of neural
networks from low-level input signals and the deliberative, generalizable models of the symbolic
systems. The advent of deep networks has accelerated the development of these neuro-symbolic
systems. While successful, there are several open problems to be addressed in these systems, a
few of which we tackle in this dissertation. These include: (i) several primitive neural network ar-
chitectures have not been well studied in the symbolic context; (ii) lack of generic neuro-symbolic
architectures that are do not make distributional assumptions; (iii) generalization abilities of many
such systems are limited. The objective of this dissertation is to develop novel neuro-symbolic
models that (i) induce symbolic reasoning capabilities to fundamental yet unexplored neural net-
work architectures, and (ii) provide unique solutions to the generalization issues that occur during
neuro-symbolic integration.
More specifically, we consider one of the primitive models, Restricted Boltzmann Machines, that
was originally employed for pre-training the deep neural networks and propose two unique solu-
tions to lift them for relational model. For the first solution, we employ relational random walks to
vi
generate relational features for Boltzmann machines. We train the Boltzmann machines by passing
these resulting features through a novel transformation layer. For the second solution, we employ
the mechanism of functional gradient boosting to learn the structure and the parameters of the
lifted Restricted Boltzmann Machines simultaneously. Next, most of the neuro-symbolic models
designed till date have focused on incorporating neural capabilities in specific models, resulting in
lack of a general relational neural network architecture. To overcome this, we develop a generic
neuro-symbolic architecture that exploits the concept of relational parameter tying and combining
rules to incorporate the first-order logic rules into the hidden layers of the proposed architecture.
One of the prevalent neuro-symbolic models called knowledge graph embedding models encode
the symbols as learnable vectors in Euclidean space and lose an important characteristic of gener-
alizability to newer symbols while doing so. We propose two unique solutions to circumvent this
problem by exploiting the text description of entities in addition to the knowledge graph triples in
both the models. In our first model, we train both the text and knowledge graph data in genera-
tive setting, while in the second model, we posit the two data sources in adversarial setting. Our
broad results across these several directions demonstrate the efficacy and efficiency of the proposed
approaches on benchmarks and novel data sets.
In summary, this dissertation takes one of the first steps towards realizing the grand vision of the
neuro-symbolic integration by proposing novel models that allow for symbolic reasoning capabil-
ities inside neural networks.
vii
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Neuro-Symbolic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Aim of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2 TECHNICAL BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Relational Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Functional Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Generative Knowledge Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
PART I NEURAL STATISTICAL RELATIONAL LEARNING MODELS . . . . . . . . . 20
CHAPTER 3 RELATIONAL RESTRICTED BOLTZMANN MACHINES . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Statistical Relational Learning Models . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Structure Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Propositionalization Approaches . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Why study Relational Boltzmann Machines ? . . . . . . . . . . . . . . . . . . . . 25
3.4 Relational Restricted Boltzmann Machines: The Proposed Approach . . . . . . . . 26
3.4.1 Step 1: Relational data representation . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Step 2: Relational transformation layer . . . . . . . . . . . . . . . . . . . 28
viii
3.4.3 Step 3: Learning Relational RBMs . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Relation to Statistical Relational Learning Models . . . . . . . . . . . . . 31
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
CHAPTER 4 BOOSTING RELATIONAL RESTRICTED BOLTZMANN MACHINES . . 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Relational Functional Gradient Boosting based models . . . . . . . . . . . 44
4.2.2 Neuro-Symbolic models . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Boosting of Lifted RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Functional Gradient Boosting of Lifted RBMs . . . . . . . . . . . . . . . 50
4.3.2 Representation of Functional Gradients for LRBMs . . . . . . . . . . . . . 53
4.3.3 Learning Relational Regression Trees . . . . . . . . . . . . . . . . . . . . 54
4.3.4 LRBM-Boost Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Experimental Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Comparison of LRBM-Boost to other neuro-symbolic models . . . . . . . 57
4.4.3 Comparison of LRBM-Boost to other relational gradient-boosting models 59
4.4.4 Effectiveness of boosting relational ensembles . . . . . . . . . . . . . . . 60
4.4.5 Interpretability of LRBM-Boost . . . . . . . . . . . . . . . . . . . . . . 61
4.4.6 Inference in a Lifted RBM . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CHAPTER 5 NEURAL NETWORKS WITH RELATIONAL PARAMETER TYING . . . 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Lifted Relational Neural Networks . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Relational Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
5.2.3 Tensor Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.4 Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Neural Networks with Relational Parameter Tying: The proposed approach . . . . 72
5.3.1 Generating Lifted Random Walks . . . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Network Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.2 Baselines and Experimental Details . . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Relation with Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 88
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
PART II KNOWLEDGE GRAPH EMBEDDING MODELS . . . . . . . . . . . . . . . . 89
CHAPTER 6 TOPIC AUGMENTED KNOWLEDGE GRAPH EMBEDDINGS . . . . . . 90
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.1 Knowledge graph embeddings models . . . . . . . . . . . . . . . . . . . . 94
6.2.2 Text-aware Knowledge graph embeddings models . . . . . . . . . . . . . 95
6.2.3 Gaussian Embeddings in Knowledge graphs . . . . . . . . . . . . . . . . . 97
6.2.4 LDA based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Topic Augmented Knowledge Graph Embeddings: the proposed TAKE approach . 99
6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.2 Learning the model parameters . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3 TAKE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.1 Knowledge Graph Completion . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4.2 Entity Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4.3 Interpretability of the proposed model . . . . . . . . . . . . . . . . . . . . 119
6.4.4 Effect on sparsely occurring entities . . . . . . . . . . . . . . . . . . . . . 120
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
x
CHAPTER 7 TEXT AUGMENTED ADVERSARIAL KNOWLEDGE GRAPH EMBED-DINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 Adversarial Approach to learning KB embedding model . . . . . . . . . . . . . . 126
7.3.1 The Generator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.2 The Discriminator Design . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.1.1 Knowledge Graph Alignment . . . . . . . . . . . . . . . . . . . . . . . . 134
8.2 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
CURRICULUM VITAE
xi
LIST OF FIGURES
2.1 Discriminative Restrictive Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 11
2.2 Relational random walks on variablized relational graph. The background file con-tains the schema of the dataset which is represented as a graph. After performingconstrained random walks on it, we convert each random walk into a first order logicclause. We use −1 to denote the inverse of a relation which is considered a uniquerelation in itself. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Functional Gradient Boosting, where the loss function is mean squared error. . . . . . 14
2.4 The generative process of triples T in a given knowledge graph K = {E ,R, T }.The embeddings h and t are generated by the zero-mean spherical Gaussian priorN (0, λ−1e I), the relation r is generated by the zero-mean spherical Gaussian priorN (0, λ−1r I) and triple (h, r, t) is generated by the probability 0.5∗(
(softmax1(score
(h, r, t)) ∗ softmax2(score(h, r, t)))
. . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Lifted random walks are converted into feature vectors by explicitly grounding everyrandom walk for every training example. Nodes and edges of the graph in (a) representtypes and predicates, and underscore ( Pr) represents the inverted predicates. Therandom walks counts (b) are then used as feature values for learning a discriminativeRBM (DRBM). An example of random walk represented as clause is (c). . . . . . . . 28
3.2 Weights learned by Alchemy and RRBMs for a clause vs. size of the domain. . . . . . 33
3.3 The number of RRBM features grows exponentially with maximum path length ofrandom walks. We set λ = 6 to balance tractability with performance. . . . . . . . . . 36
3.4 (Q1): Results show that RRBMs generally outperform baseline MLN and decision-tree (Tree-C) models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 (Q2) Results show better or comparable performance of RRBM-C and RRBM-CE to MLN-Boost, which all use counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 (Q2) Results show better or comparable performance of RRBM-E and RRBM-CE toRDN-Boost, which all use existentials. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 (Q4) Results show better or comparable performance of our random-walk-based fea-ture generation approach (RRBM) compared to propositionalization (BCP-RBM). . . . . . 40
4.1 An example of a lifted RBM. The atomic predicates each have a corresponding nodein the visible layer (fi). Atomic predicates can be used to create richer features asconjunctions, which are represented as hidden nodes (hj); the connections between thevisible and hidden layers are sparse and only exist when the predicate correspondingto fi appears in the compound feature hj . The output layer is a one-hot vectorizationof a multi-class label y, and has one node for each class yk. The connections betweenthe hidden and output layers are dense and allow all features to contribute to reasoningover all the classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xii
4.2 Weights in a lifted RBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 A general relational regression tree for lifted RBMs when learning a target predicatet(x). Each path from root to leaf is a compound feature (also a logical clause Clauser)that enters the RBM as a hidden node hr. The leaf node contains the weights θr ={dr, cr,W r, U r
0 , Ur1} of all edges introduced into the lifted RBM when this hidden
node/discovered feature is introduced into the RBM structure. . . . . . . . . . . . . . 53
4.4 Comparing LRNN, RRBM-C, MLN-Boost and LRBM-Boost on AUC-ROC. . . . . . . . . . 58
4.5 Comparing LRNN, RRBM-C, MLN-Boost and LRBM-Boost on AUC-PR. . . . . . . . . . . 59
4.6 An example of combined lifted tree learned from ensemble of trees. To construct thistree, we compute the regression value of each training example by traversing throughall the boosted trees. A single large tree is overfit to this (modified) training set togenerate a single tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Lifted RBM obtained from the combined tree in Figure 4.6. Each path along the treein that figure represents the corresponding hidden node of LRBM. . . . . . . . . . . . 62
4.8 Ensemble of trees learned during training of LRBM-Boost. The ensemble of trees isgenerated in SPORTS domain where predicate P, T, Z represent plays(sports, team),teamplaysagainstteam(team, team) and athleteplaysforteam(athlete, team)respectively and target R represents teamplayssport(team, sports). . . . . . . . . 63
4.9 Demonstration of the conversion of two lifted trees in Figure 4.8 to LRBM. We createone hidden node for each path in each regression tree. . . . . . . . . . . . . . . . . . . 63
4.10 LRBM inference for Example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 The relational neural network is unrolled in three stages, ensuring that the output isa function of facts through two hidden layers: the combining rules layer (with liftedrandom walks) and the grounding layer (with instantiated random walks). Weights aretied between the input and grounding layers based on which fact/feature ultimatelycontributes to which rule in the combining rules layer. . . . . . . . . . . . . . . . . . 76
5.2 Example: unrolling the network with relational parameter tying. . . . . . . . . . . . . 79
6.1 An example of entity descriptions in Freebase . . . . . . . . . . . . . . . . . . . . . . 91
6.2 The proposed TAKE approach. Both the entities h and t in the triple (h, r, t) are drawnfrom the distribution N (θ, λ−1e ) and the relation r is drawn from N (0, λ−1r ), whereasthe probability of triple (h, r, t) being true, P(yh,r,t = 1) is drawn from Equation6.14 where P(1) and P(0) refers to the true part and three false terms in the equationrespectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Interpretability in Knowledge Graph embeddings on FB15K dataset. we randomlypick 10 entities from dataset and we represent each entity as mixture of top-two topics,and we further pick two most probable words in each topic. . . . . . . . . . . . . . . . 119
xiii
6.4 Table displays top two topics learnt along each of first 10 dimensions of 100-dimensionalFB15K entity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 The effect of proposed model on sparsely occurring entities’ embeddings. The Y-axisplots average of offset=(e−θ)ᵀ(e−θ) value of each embedding while the X-axis plotsthe number of times an embedding occurs in the KG. . . . . . . . . . . . . . . . . . . 121
8.1 A Finite State Transducer. Operation a : b represent that the finite state transducerwould read input character a ∈ x and outputs character b ∈ y. . . . . . . . . . . . . . 137
8.2 Knowledge graph alignment by string-edit distance in embedding space. . . . . . . . . 139
xiv
LIST OF TABLES
4.1 Comparison of LRBM-Boost and RDN-Boost. . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Comparison of (a) an ensemble of trees learned by LRBM-Boost, (b) an explainableLifted RBM constructed from the ensemble of trees learned by LRBM-Boost and (c)learning a single, large, relational probability tree (LRBM-NoBoost). . . . . . . . . . 61
5.1 Data sets used in our experiments to answer Q1–Q3. The last column shows thenumber of sampled groundings of random walks per example for NNRPT. . . . . . . . . 81
5.2 Comparison of different learning algorithms based on AUC-ROC and AUC-PR. NNRPTis comparable or better than standard SRL methods across all data sets. . . . . . . . . . 84
5.3 Comparison of NNRPT with propositionalization-based approaches. NNRPT is signifi-cantly better on a majority of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Comparison of NNRPT and LRNN on AUC-ROC and AUC-PR on different data sets.Both the models were provided expert hand-crafted rules from Sourek et al., (Soureket al., 2018). NNRPT is capable of employing rules to improve performance in somedata sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Comparsion of LRNN and NNRPT using relational random walk features. Across all thedomains NNRPT could better exploit the power of relational random walks. . . . . . . . 86
5.6 Comparison of NNRPT and LRNN on AUC-ROC and AUC-PR on different data sets.Both the models were provided clauses learnt by PROGOL, (Muggleton, 1995). NNRPTis capable of employing rules to improve performance in some data sets. . . . . . . . . 87
6.1 Data sets used in our experiments on TAKE model (Xie et al., 2016) . . . . . . . . . . 115
6.2 Mean Rank and Hits@10 (entity prediction) for models tested on FB15K dataset . . . 116
6.3 Mean Rank and Hits@1 (relation prediction) for models tested on FB15K dataset . . . 117
6.4 The MAP Results for entity classification in FB15K and FB20K datasets . . . . . . . . 118
xv
CHAPTER 1
INTRODUCTION
Developing AI agents that can mimic the human cognitive system has been a long cherished goal
of AI. In order to realize this goal, such an agent must act in the presence of real-world data which
is inherently relational as it captures the interactions between objects in domain through specific
relations. This necessitates the need for learning methods that can faithfully learn from relational
data without the requirement of representing it first as fixed-length feature vectors as is typically
needed by standard machine learning models (Cristianini and Shawe-Taylor, 2000; Quinlan, 1993).
Fueled by this, the field of Inductive Logic Programming (ILP, (Lavrac and Dzeroski, 1993)) was
born. Given the background knowledge about the given domain and the positive and negative
examples of the task to be learned, ILP models learn a set of first-order logic rules that entail
all the positive examples and none of the negative examples. One of the major strength of the
ILP models is that they possess symbolic reasoning capabilities as their representation employs
first-order logical rules that can perform deductive reasoning to answer queries.
Though compelling, one of the major drawback of the ILP models is their inability to deal with
noise and uncertainty intrinsic to relational data. To overcome this limitation, the field of Statistical
Relational Learning (SRL, (Getoor and Taskar, 2007; De Raedt et al., 2016)) emerged as a powerful
machine learning paradigm that can exploit the rich structure present among the objects while
handling the uncertainty in the data. In these models, the complex interaction between objects
is typically modeled by first-order logic clauses while the uncertainty in them is quantified by
annotating these clauses with either probability distributions (Raedt et al., 2007; Natarajan et al.,
2012), or weights (Richardson and Domingos, 2006; Khot et al., 2011; Ramanan et al., 2018).
These models range from directed models (Getoor et al., 2001; Jaeger, 1997; Kersting and Raedt,
2007) to bi-directed models (Richardson and Domingos, 2006; Taskar, 2002), and sampling-based
approaches (Kameya and Sato, 2011; Poole, 1993). Though expressive, scalability has been a
major issue in full model learning i.e. when the rules and the parameters are learned from data.
1
Over the past decade, deep learning (Goodfellow et al., 2016; Bengio, 2009) has deservedly
attracted significant attention in major research fields such as speech recognition (Hinton et al.,
2012), computer vision (Krizhevsky et al., 2012; He et al., 2016), natural language processing
(Sutskever et al., 2014) and reinforcement learning (Silver et al., 2017; Mnih et al., 2015). The
success of deep learning can be attributed to multiple factors: the automated feature engineering
performed by the hidden layers of a given model where each successive layer is learning com-
binations of features of its preceding layer resulting in improved performance; the accessibility
of large datasets that are needed to train the deep models as function approximators; and finally
the availability of advanced hardware architectures like GPUs and HPCs that can parallelize the
execution and thus facilitate the training of deep models.
We now briefly compare the strengths and the weakness of the neural/deep models and the
symbolic models (ILP/SRL models) along six dimensions. First, the majority of the deep models
proposed till date operate at the signal level where their input is in the form of pixels, speech
signals, text characters whereas ILP/SRL models function at symbolic level where the models suc-
cinctly represent probabilistic dependencies among the attributes of different related objects. Sec-
ond, while it is hard to understand the meaning of the parameters in the hidden layers of the deep
models, the first-order logic rules learnt in ILP models are interpretable. For instance, first-order
clause: bornInCity(A, B)∧cityInCountry(B, C)⇒ bornInCountry(A, C) can be interpreted as
every person A who is born in city B, which is further located in country C, is born in country C.
Third, deep learning models are data-hungry and require millions of training examples to train
efficiently. The major reason for this is that deep models generally have a large number of param-
eters in their hidden layers which require sufficient number of examples in order to move them
into regions corresponding to the optimal solutions. On the other hand, symbolic methods can
effectively leverage domain knowledge as both search bias and as inductive bias. This can make
them potentially learn with fewer examples compared to the deep models (Evans and Grefenstette,
2018). As a flip-side of the above feature, the neural network models are scalable and can train
2
with massive datasets without difficulty. The scalability to train on domains with large data has
been major bottleneck of ILP/SRL systems. The structure learning methods followed by these
models learn locally by greedily adding one literal at a time to the partially-built clause; infer the
coverage of this new clause and finally select the literal with the best coverage to be added the
clause. This process of performing inference at the inner loop of learning slows down the learning
of probabilistic graphical models.
Fifth, the performance of neural networks deteriorate when the test data is significantly larger
than the train data (Evans and Grefenstette, 2018) while the efficiency of symbolic models is un-
affected by the size of test data because they learn lifted clauses which endow them with gener-
alization ability to reason over any number of new objects introduced at test time. Finally, the
deep models are efficient. This claim is bolstered by the fact that they have yielded state-of-the-art
results in all the major application domains: (a) recommender systems (Zhang et al., 2019), (b)
question-answering systems (Xiong et al., 2017), (c) games (Mnih et al., 2015) to name a few
whereas symbolic models are still to prove their efficiency at the same level of success.
1.1 Neuro-Symbolic Systems
Both the symbolic and the neural network models have complimentary strengths and weakness.
Consequently, it is natural to design systems that bridge the gap between the symbolic and neural
models such that the resulting models have the best of both the worlds. This field of integrating the
symbolic reasoning and the neural network models is called neuro-symbolic systems (Raedt et al.,
2020; Garcez et al., 2002) and is the theme of this dissertation. Neuro-symbolic integration has
been longstanding goal of AI where an ideal model would operate analogous to human cognitive
system. One important goal of such neuro-symbolic systems is that the neural network component
will function at the perceptron level analogous to human eyes when they view a scene before them
while the symbolic component would act analogous to human mind/cognition performing higher-
level logical reasoning in order to explain the viewed scene (Besold et al., 2017).
3
While successful, primitive deep models were limited in their application to relational data.
This led to significant growth in neuro-symbolic models specifically designed for relation data.
Neuro-symbolic models proposed in the past decade can be divided into two major sub-categories:
• the first set of models brought symbols into a form (i.e. flat-feature vectors) that was readily
acceptable to neural networks. The key idea here is that the objects and relations present in
given relational data are represented as learnable vectors (called knowledge graph embeddings
or simply embeddings) in a k-dimensional Euclidean space (Bordes et al., 2013; Lin et al.,
2015; Ma et al., 2017; Trouillon et al., 2016; Yang et al., 2015). The plausibility of a relation
between objects is expressed as a scoring function, which is obtained by different types of
algebraic operations among relations and the objects. The major appeal of this sub-field is
scalability, as one can learn embedding over millions or billions of facts present in a given
knowledge graph.
• Very recently, there has been another set of models that consider already existing ILP/SRL
models and bring neural networks into them by introducing a differentiable counterpart of
symbolic operations existing in classical logic models. Unlike knowledge graph embed-
dings, these models operate more at the symbolic level. For instance, DeepProblog (Man-
haeve et al., 2018) learns the probability distribution of a predicate by employing a neu-
ral network while leaving the rest of the standard ProbLog model (Raedt et al., 2007) un-
changed. Similarly, Neural Markov Logic Networks (MLN) (Marra and Kuzelka, 2019)
learns the potential function of standard MLN (Richardson and Domingos, 2006) by utiliz-
ing the neural networks. RelNN (Kazemi and Poole, 2018) stacks multiple layers of standard
RLR (Kazemi et al., 2014) model in order to learn latent properties of the target object.
• Another model in this category is Neural Theorem Prover (NTP) (Rocktaschel and Riedel,
2017) that performs inference in first-order logic clauses by standard backward-chaining
procedure except that soft unification between the goal and the head of a given clause is
4
performed in embedding space. In order to make ILP models robust to noise and uncertainty,
recently ∂ILP (Evans and Grefenstette, 2018) proposed differentiable version of ILP model
that could deduce a fact by performing forward chaining on definite clauses. We call these
subset of neuro-symbolic models as neural SRL models in this dissertation.
1.2 Aim of the dissertation
Motivated by the successes of neuro-symbolic integration, we aim to develop novel models that
complement the existing research by lifting the relatively unexplored neural models, or by design-
ing a generic neuro-symbolic architecture, or by proposing the solutions to the problems that have
emerged as a side-effect of introducing neural networks into symbolic models.
This thesis is spread across the two sub-fields of neuro-symbolic systems discussed in the
previous section. The first half of the thesis focuses on proposing novel neural SRL models. In our
proposed models, instead of using a neural network as a differentiable component inside an existing
standard SRL model, as done in DeepProbLog (Manhaeve et al., 2018), Neural MLN (Marra and
Kuzelka, 2019) or NTP (Rocktaschel and Riedel, 2017), we take the inverse approach. We built
upon an existing neural network model, namely Restricted Boltzmann Machines (Rumelhart and
McClelland, 1987; Larochelle and Bengio, 2008) and propose two novel models: RRBM and
LRBM-Boost, to instill relational capabilities into the model through first-order logic rules. The
motivation to study Boltzmann Machines in relational context arises from the fact that they were
employed as a pre-training model in each layer of one of the primitive deep models: Deep Belief
Networks (Bengio et al., 2006; Hinton and Osindero, 2006). Lifting a model existing at one layer
of deep architectures may eventually lead us towards achieving the final goal of designing stacked
architecture inside neuro-symbolic systems.
Further, we propose a general neuro-symbolic architecture, that we call NNRPT, which is
inspired from two concepts in standard SRL. Firstly, all the instances of a given logical rule share
the same parameters, a concept known as parameter tying. Also, a logical variable can have varying
5
number of parents (known as multiple-parents problem) in its ground network (Natarajan et al.,
2008). Such models can be described by independently considering the probability of each logical
variable conditioned on each parent variable. These conditional probabilities can then be combined
by combining rules. We propose a neuro-symbolic model that exploits relational parameter tying
and combining rules to incorporate the first-order logic rules into the hidden layers of the proposed
architecture. The parameters of the model are trained by employing backpropagation technique. As
shown in our experimental evaluations, the three neural SRL models proposed in this dissertation
are efficient, generalizable to newer objects encountered at test time, and can perform complex
reasoning inside the neural architecture.
The second half of the dissertation concentrates on sub-field of knowledge graph embedding
models and proposes two novel solutions to one of the fundamental problem faced by embedding
models: generalizability of embeddings. Most of the knowledge graph embedding models perform
learning on the ground atoms, making them unsuitable to reason over new objects encountered at
the test time. We propose two unique solutions to tackle the problem of generalizability in knowl-
edge graph embeddings: TAKE and TAAKE. In both the models, we utilize the supplementary
text information available alongside the knowledge graphs (KG). In TAKE model, we exploit topic
modeling to extract the hidden topic information about entities from the text which serve as embed-
dings for newer entities encountered in the knowledge graphs. We also utilize the interpretability
of topic models to assign a human-readable topic to each dimension of a given embedding.
Conversely, TAAKE model employs the text information and the knowledge graph data as two
adversaries against each other such that the sub-modules handling two type of data are satisfy-
ing the opposite constraints. The goal of the text based sub-module is to bring the text and the KG
embeddings closer to each other whereas the goal of KG based sub-module to drive the KG embed-
dings away from the text embeddings. The competition would generate high-quality embeddings.
Collectively, all the proposed models in this thesis is our effort towards tighter integration of
the symbolic and the deep models in order to harness the strengths of both of them resulting
in neuro-symbolic models that are effective, scalable, and have complex-reasoning abilities.
6
1.3 Dissertation Statement
This dissertation aims at developing novel neuro-symbolic models that lift the neural networks
to relational domains in order to induce symbolic reasoning capabilities in them and further
solve the specific problems that are encountered during the neuro-symbolic integration.
1.4 Dissertation Contributions
I. Proposing novel architectures in neural SRL sub-field where the goal is to:
(i) lift a primitive neural network: Restricted Boltzmann Machines, to relational domains.
(ii) propose a neuro-symbolic model that does not make any distributional assumptions.
(iii) to retain the symbolic reasoning capability in proposed neural architectures.
(iv) take a first-step towards structure learning of neuro-symbolic systems.
II. Solve the problems encountered in knowledge graph embeddings sub-field including:
(i) proposing efficient solutions to the generalizability issue encountered in embeddings.
(ii) endowing the embeddings with interpretability along each dimension.
1.5 Dissertation Outline
As discussed previously, this dissertation has been divided into two high-level parts. Part I outlines
our approaches proposed in neural SRL sub-field. Part II describes our approaches to unresolved
challenges in knowledge graph embedding sub-field.
Chapter 2 presents the necessary technical background which lays the foundation for under-
standing all the models proposed in this dissertation. We first introduce the Restricted Boltzmann
Machines. This is followed by the introduction of relational random walks and concept of func-
tional gradient boosting. These mechanisms are utilized to lift the Restricted Boltzmann Machines.
7
Next, we introduce the concept of generative knowledge graph embeddings, latent Dirichlet alloca-
tion and generative adversarial networks - the three concepts that lay the groundwork for proposing
two unique solutions to generalizability in knowledge graph embeddings.
Part I
Chapter 3 details our first proposed approach, RRBM, for learning Boltzmann machine clas-
sifiers from relational data. We use lifted random walks to generate features for predicates that
are then used to construct the observed features in the RBM in a manner similar to Markov Logic
Networks. We empirically evaluate our proposed model on six relational domains to show that the
proposed model is comparable or better than the state-of-the-art probabilistic relational learning.
Chapter 4 presents our second solution to lifting Boltzmann machines by employing gradient-
boosted approach to learn the structure and the parameters of the Relational Restricted Boltzmann
Machines simultaneously (LRBM-Boost). Here, we learn a set of weak relational regression trees
whose paths from root to leaf represents the model structure and the leafs of the tree represent the
model parameters. These trees are compiled into lifted Restricted Boltzmann Machines where the
paths along tree form the hidden layers of the resultant model and the leafs of the trees represent
the connection of the model resulting in an explainable model.
Chapter 5 proposes a generic neural network architecture, NNRPT, for relational data. We learn
relational random-walk-based features to capture local structural interactions in the relational data.
These relational features form the template network architecture for all the examples, which is
further unrolled for each example by exploiting parameter tying of the network weights, where
instances of the same example share parameters.
8
Part II
Chapter 6 develops a novel solution to the issue of generalizability of knowledge graph embed-
dings and proposes a model, TAKE, that exploits two sources of data: knowledge graphs triples
and the text description of entities and considers the generative modeling of both the sources to
learn the knowledge graph embeddings. The topics learnt from the text act as substitute for em-
beddings when newer data is encountered at the test time. As another contribution, we employ text
topics to interpret the significance of each embedding dimension.
Chapter 7 posits first of its kind solution (TAAKE) to the generalizability of embeddings by
positioning the text and knowledge graph data in adversarial setting. The two sources of data form
two independent sub-modules competing against each other. Text-based module aims at driving
the text embedding and the knowledge graph embeddings of entities closer to each other while
the knowledge graph based module intends to drive the text based embeddings away from the
knowledge graph embeddings. We hypothesize that the competition to stay ahead of the other
module could result in superior embeddings.
Chapter 8 presents our concluding remarks and introduces open problems (Kaur et al., 2020a)
that remain to be addressed in order to achieve tighter neuro-symbolic integration.
9
CHAPTER 2
TECHNICAL BACKGROUND
In this chapter, we present the necessary technical background for the dissertation. We begin by
introducing Restricted Boltzmann Machines in Section 2.1. Next, we outline the two mechanisms
that were employed by us to lift them to relational domains. Specifically, we introduce relational
random walks in Section 2.2 and relational functional gradient boosting in Section 2.3. Then, we
introduce generative knowledge graph embeddings in Section 2.4 and LDA model in Section 2.5
which was exploited to learn generative knowledge graph embeddings in the presence of text in
Chapter 6. Finally, we introduce the generative adversarial networks (GAN) in Section 2.6 which
is the foundation upon which our proposed model of learning from multi-modal data in adversarial
setting is built in Chapter 7.
2.1 Restricted Boltzmann Machines
In Chapters 3 and 4, we introduce two novel neuro-symbolic models that combine the rich struc-
tural information present in relational data with a specific connectionist model, namely Restricted
Boltzmann machines (RBM, (Rumelhart and McClelland, 1987)). We introduce them here.
Restricted Boltzmann Machines are stochastic neural networks that consists of two layers: layer
of visible units and another layer of hidden units. The restriction imposed on the model is that the
nodes within the same layer are not connected and they only interact with the nodes in the other
layer. Although RBMs proposed originally are generative, we consider Discriminative Restricted
Boltzmann Machines proposed in Larochelle and Bengio (2008) in this dissertation. This is due
to the fact that many relational tasks such as entity resolution, link prediction etc are naturally
discriminative. Mathematically, RBMs use a Bernoulli input layer (visible layer, x), a Bernoulli
hidden layer (h) and a softmax output layer y. The joint configuration (y, x, h) of the model has
the following energy function:
E(y,x,h) = −hᵀWx− bᵀx− cᵀh− dᵀy − hᵀUy, (2.1)
10
Figure 2.1: Discriminative Restrictive Boltzmann Machines
where W are the weights connecting visible and the hidden layers, U are the weights connecting
hidden and output layers and b, c, d are, respectively, the biases of the visible, hidden and the output
layers of the model. In a multi-class setting, if there are C classes to be predicted then yl =(1Ci=l
)represents the one-hot vectorization of the target class l. The joint probability of RBM is defined
as p(y, x,h) = 1Ze−E(y,x,h) where Z is the normalization constant defined as Z =
∑y,x,h e
−E(y,x,h).
Though computing p(y, x,h) is intractable, the conditional version p(y|x) can be computed exactly
as follows:
p(y|x) =exp
(dl +
∑j ζ(cj + Ujl +
∑kWjkxk)
)∑
l∗∈{1,2,..C}
exp(dl∗ +
∑j ζ(cj + Ujl∗ +
∑kWjkxk)
) , (2.2)
where ζ(a) = log(1 + ea), the softplus function.
Next, we explain relational random walks in detail. We leverage them for structure learning in
Relational Restricted Boltzmann Machines (RRBM) in Chapter 3 and Neural Network with
Relational Parameter Tying (NNRPT) models in Chapter 5.
11
Figure 2.2: Relational random walks on variablized relational graph. The background file containsthe schema of the dataset which is represented as a graph. After performing constrained randomwalks on it, we convert each random walk into a first order logic clause. We use −1 to denote theinverse of a relation which is considered a unique relation in itself.
2.2 Relational Random Walks
We assume the graphical representation of the schema of relational data, where nodes represent
the object type or variables (e.g. person, venue or course) and an edge represents relation between
two object types (see Figure 2.2). A relational random walk on a lifted graph will comprise of
randomly following a path along the sequence of edges of the graph (Lao and Cohen, 2010):
Type0Relation1−−−−−→ Type1
Relation2−−−−−→ Type2 . . .Relationt−−−−−→ Typet
In this dissertation, we constrain our random walks by two ways: (i) we set the maximum length
of random walks to be a predefined parameter t (ii) we constrain the end of each random walk
to coincide with the object types of the target Target(Type0, Typet) under consideration. Conse-
quently, we obtain Horn clauses by representing each random walk as body of clause and target
under consideration to be the head of the clause. For instance, the Type0Relation1−−−−−→ Type1
Relation2−−−−−→
12
Type2Relation3−−−−−→ Type3 will be converted into the clause:
Relation1(Type0, T ype1) ∧ Relation2(Type1, T ype2)∧
Relation3(Type2, T ype3)⇒ Target(Type0, T ype3)
The resulting first-order logic clauses obtained from relational random walks form the observed
layer of our proposed neural SRL models. The advantages of leveraging relational random walks
for neural network learning are:
(a) Random walks, in general, are a faster mechanism for performing structure learning in rela-
tional domains than, say, an ILP learner (Quinlan, 1990; Lavrac and Dzeroski, 1993). In ILP
learner, each potential clause is scored in order to finally obtain the clauses that offer best
coverage of the examples in a given domain. Though effective, the scoring of clauses serve
as a bottleneck in these models. On the other hand, random walks are faster as they do not
involve scoring of clauses.
(b) We acquire a large number of random walks on relational data to perform structure learning;
even though vast majority of random walks might not be highly predictive, some random
walks will capture meaningful structure present in the data that would endow the model the
power to discriminate between positive and negative examples. The argument is similar to
classical ensemble methods where a large number of weak classifiers form a strong classifier.
This hypothesis is further validated by our experimental evaluation in Sections 3.5 and 5.4.
Hence, structure learning by performing random walks is both efficient and effective. We now
introduce relational functional gradient boosting. This mechanism is employed while boosting
relational RBM model in Chapter 4.
2.3 Functional Gradient Boosting
Functional gradient boosting (FGB), introduced by Friedman 2001 in 2001, has emerged as a state-
of-the-art ensemble method. Functional gradient boosting aims to learn a model f(·) by optimizing
13
Figure 2.3: Functional Gradient Boosting, where the loss function is mean squared error.
a loss function L[f ] by emulating gradient descent. At iteration m, however, instead of explicitly
computing the gradient ∂L[fm−1](xi, yi), FGB approximates the gradient using a weak regression
tree 1, ∆m.
For a probabilistic model, the loss function is replaced by a (log-) likelihood function (L[ψ]),
which is described in terms of a potential function ψ(·), which FGB aims to learn. FGB begins
with an initial potential ψ0; intuitively, ψ0 represents the prior of the probability distribution of
target atom. This initial potential can be any function: a constant, a prior probability distribution
or any function that incorporates background knowledge available prior to learning.
At iteration m, FGB approximates the true gradient by a functional gradient ∆m. That is,
gradient boosting will attempt to identify an approximate gradient ∆m that corrects the errors of
1A weak base estimator is any model that is “simple” and underfits (hence, weak). From machine-learning stand-point, such weak learners are high bias, low variance and easy to learn. Shallow decision trees are a popular choicefor weak base estimators for ensemble learning, owing to their algorithmic efficiency and interpretability.
14
the current potential, ψm−1. This ensures that the new potential ψm = ψm−1 + ∆m continues to
improve. Like most boosting algorithms, FGB learns ∆m as a weak regression tree, and ensembles
several such weak trees to learn a final potential function (see Figure 2.3). Thus, the final model is
a sum of regression trees ψm = ψ0 + ∆1 + . . .+ ∆m (Figure 2.3). In relational models, regression
trees are replaced by relational regression trees (RRTs, (Blockeel and De Raedt, 1998)). The past
models including Natarajan et al. (2011), Khot et al. (2011), Natarajan et al. (2012), Yang et
al. (2016), Natarajan et al. (2017), Ramanan et al. (2018), Das et al. (2020) - have utilized this
technique in order to learn efficient relational models.
We now proceed to providing the necessary background required to understand the proposed
models in part II of the dissertation on knowledge graph embeddings. We begin by introducing two
concepts: generative knowledge graph embeddings (Section 2.4) and Latent Dirichlet Allocation
(Section 2.5) both of which will be utilized to learn the proposed multi-modal knowledge graph
embeddings model in Chapter 6.
2.4 Generative Knowledge Graph Embeddings
A standard knowledge graph is represented as K = (E ,R, T ) consisting of set E of entities, setR
of relations and set T = {(h, r, t)}|T |n=1 of knowledge graph triples. Further, h ∈ RK , t ∈ RK and
r ∈ RK representK-dimensional embedding of head, tail and relation respectively of a given triple
(h, r, t) in KG. Additionally, we use symbol e ∈ RK to denote both head and tail embedding. Our
generative model is inspired by the Bayesian matrix factorization proposed in Salakhutdinov and
Mnih (2007). As a first step, the prior probability of an entity e ∈ E (which could represent either
head h or tail t) is drawn from zero-mean spherical Gaussian prior with variance σ2e :
P(E | σ2
e
)=
|E|∏i=1
N(ei | 0, σ2
eI)
(2.3)
15
Similarly, the prior probability of a relation r is drawn from zero-mean spherical Gaussian prior
with the variance σ2r :
P(R | σ2
r
)=
|R|∏p=1
N(rp | 0, σ2
rI)
(2.4)
The likelihood over all the triples T in KB K is defined as:
P(T | E , R
)=
|T |∏n=1
P(yhn,rn,tn = 1 | hn, rn, tn) (2.5)
In the above expression, the log-probability that a given triple (h, r, t) is true, i.e. P(yh,r,t =
1 | h, r, t), is defined as the product of two softmax functions which are generated by corrupting
either the head or the tail of the triple in their respective denominators and mathematically defined
by following expression (Lacroix et al., 2018):
P(yh,r,t = 1 | h, r, t) =(softmax1(score(h, r, t)) ∗ softmax2(score(h, r, t))
)(2.6)
=
(exp
(score(h, r, t)
)∑t∈E exp
(score(h, r, t)
) ∗ exp(
score(h, r, t))∑
h∈E exp(
score(h, r, t))) (2.7)
where (h, r, t) (or (h, r, t)) is the corrupt (false) triple in the knowledge graph generated by cor-
rupting the tail entity t (or head entity h). Although one could potentially consider several existing
models (Bordes et al., 2013; Lin et al., 2015; Trouillon et al., 2016) to score the relation triples
(h, r, t) in knowledge graph K, we employ the DistMult (Salehi et al., 2018; Yang et al., 2015)
model in Chapter 6. This allows us to define the scoring function of a triple as:
score(h, r, t) =K∑l=1
hlrltl (2.8)
where hl represents the embedding’s value along the l-th dimension. Consequently, the log of the
posterior distribution over the entities and relations’ embeddings given the triples T is given as:
log P(E , R | T , σ2
r , σ2e
)= log P
(T | E , R
)+ log P
(E | σ2
e
)+ log P
(R | σ2
r
)(2.9)
=
|T |∑n=1
log P(y = 1 | hn, rn, tn
)− λe
2
|E|∑i=1
(eᵀi ei)− λr
2
|R|∑j=1
(rᵀjrj
)+ C (2.10)
16
Figure 2.4: The generative process of triples T in a given knowledge graph K = {E ,R, T }.The embeddings h and t are generated by the zero-mean spherical Gaussian prior N (0, λ−1e I), therelation r is generated by the zero-mean spherical Gaussian prior N (0, λ−1r I) and triple (h, r, t) isgenerated by the probability 0.5 ∗ (
(softmax1(score (h, r, t)) ∗ softmax2(score(h, r, t))
)where C represents the constant terms that do not depend on the model parameters, λe = 1/σ2
e and
λr = 1/σ2r . Now, the generative process of knowledge graph triples can be described as follows
(see Figure 2.4):
1. For each entity e, draw its corresponding embedding e ∼ N(0, λ−1e I
).
2. For each relation r, draw its corresponding embedding r ∼ N(0, λ−1r I
).
3. Draw a triple (h, r, t) according to probability P(yh,r,t = 1 | h, r, t) in Equation (2.7).
Next, we discuss the latent Dirichlet allocation model which is utilized in Chapter 6 to learn a
generative model over text description of knowledge graph entities.
2.5 Latent Dirichlet Allocation
In text mining literature (Feldman and Sanger, 2006; Blei and Lafferty, 2009), a topic is defined
as probability distribution over fixed set of vocabulary words. The goal of topic modeling (Blei
et al., 2003; Blei and Lafferty, 2005; Hofmann, 1999) is to automatically uncover the underlying
topics (or themes) being discussed in a given document by analyzing the original text. Once the
17
topics are discovered, topic modeling can act as a powerful technique for clustering documents that
have similar topics, exploring how different topics are connected and how they trend over time, for
performing document classification and information retrieval (Blei and Lafferty, 2009). Though
various models have been developed for discovering the topics of a document, the most seminal
work for topic modeling has been Latent Dirichlet Allocation (LDA, (Blei et al., 2003)). LDA is
a hierarchical, Bayesian model which posits that each document can be generated as a mixture of
topics and each topic, in turn, is characterized by distribution over words present in the vocabulary.
In order to capture the topics in a document, LDA is formulated as a hidden variable model such
that the words in the document represent the visible data; the topic distribution of a given document
and the topic of each word in a document are learnt as the hidden variable of the model.
Let D = {di}Mi=1 be the set of documents under consideration and each document di be repre-
sented as set di = (wij)Nii=1 of Ni words. Further, assume that K be the number of topics present
in any document and V be the size of the vocabulary of words. Note that when we formulate our
new model in Chapter 6, the number of hidden topics K that each document has is same as the
dimensionality K that knowledge graph embeddings in Section 2.4 can exhibit. Let θ ∈ RK be
the topic distribution of a given document and β ∈ RK×V be the word distribution of each of the
K topics. Finally, index i, j, k represent i-th document, j-th word and k-th topic. An LDA model
generates a document by the following generative process:
1. For each document di:
(a) draw a vector of topic distribution θi ∼ Dir(~α)
(b) for each word wij in document:
i. draw the topic assignment of j-th word in di as zij ∼Mult(θi), zij ∈ {1, 2, . . . K}
ii. draw a word wij ∼Mult(βzij), wij ∈ {1, 2, . . . V }
Here zij is a hidden variable that represents the topic of j-th word in i-th document, Dir(~α)
is Dirichlet distribution with parameter ~α, which is K-dimensional positive vector and Mult(θi)
18
is multinomial distribution with parameter θi. The central problem in LDA model is to infer the
posterior distribution of the hidden variables i.e. {θ, z} given a text document d. However, the
exact solution of this problem is intractable (Blei et al., 2003); thus, the model relies on variational
EM algorithm to learn variational parameters corresponding to document-topics distribution θ and
the topic of each word z.
We now describe generative adversarial networks (GAN, (Goodfellow et al., 2014)) which was
the first model to pose the given datasets in adversarial setting. This forms the basis for Chapter 7.
2.6 Generative Adversarial Networks
Generative Adversarial Networks (GANs, (Goodfellow et al., 2014)) are one of the most influen-
tial generative model put forward by deep learning community. In this model, two sub-models -
namely generator G and discriminator D are playing minimax game against each other. The goal
of generator is to generate noisy data z ∼ pz(z) that is as real as the true distribution x ∼ pdata(x)
while the opposing goal of discriminator is to learn to discern between the noisy and true distribu-
tion. The mutual competition between both the models drive them to optimize the opposing goals
simultaneously until the generator becomes capable of generating the true data distribution at the
global optimum, which is the end goal of GANs. The aim of discriminator D is to optimize the
following objective function:
maxD(Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]
)(2.11)
While the opposing goal of the generator is to minimize the following objective function:
minG(Ez∼pz(z)[log(1−D(G(z)))]
)(2.12)
In the original work, both G and D were represented and trained as multi-layer perceptron models.
One of major drawback of standard GANs is they exhibit training instability, which is amelio-
rated by works like Wassertein GAN (Arjovsky et al., 2017; Gulrajani et al., 2017) that propose a
novel optimization functions for training GANs. We develop a new model in Chapter 7 which is
motivated by this concept of positioning two models as adversaries against each other.
19
CHAPTER 3
RELATIONAL RESTRICTED BOLTZMANN MACHINES
In this chapter, we present our first proposed approach (Kaur et al., 2017) for learning Boltzmann
machines from relational data (RRBM) and show that our method of constructing RBM is compa-
rable or better than the state-of-the-art probabilistic relational learning approaches.
3.1 Introduction
Restricted Boltzmann machines (RBMs, (Rumelhart and McClelland, 1987; Lecun et al., 2006))
are popular models for learning probability distributions due to their expressive power. Conse-
quently, they have been applied to various tasks such as collaborative filtering (Salakhutdinov et al.,
2007), motion capture (Taylor et al., 2007) and video sequences (Sutskever and Hinton, 2007).
Similarly, there has been significant research on the theory of RBMs: approximating log-likelihood
gradient by contrastive divergence (CD) (Hinton, 2002), persistent CD (Tieleman, 2008), parallel
tempering (Desjardins et al., 2010), extending them to handle real-valued variables and developing
discriminative versions of these RBMs.
While these models are powerful, they make the standard assumption of using flat feature vec-
tors to represent the problem. On the other hand, general Statistical Relational Learning (SRL
(Getoor and Taskar, 2007; De Raedt et al., 2016)) methods use richer symbolic features during
learning; however, they have not been fully exploited in deep-learning methods. Learning SRL
models is computationally intensive (Natarajan et al., 2016) however, particularly model structure
(qualitative relationships). This is due to the fact that structure learning requires searching over ob-
jects, their attributes, and attributes of related objects. Hence, the state-of-the-art learning method
for SRL models learns a series of weak relational rules that are combined during prediction.
Another limitation is that these method leads to rules that are dependent on each other making
them uninterpretable, since weak rules cannot always model rich relationships that exist in the
21
domain. For instance, a weak rule could say something like: “a professor is popular if he teaches
a course”. When learning discriminatively, this rule could have been true if some professors teach
at least one course, while at least one not so popular popular professor did not teach a course in the
current data set. Our first contribution is to use a set of interpretable rules based on the successful
Path Ranking Algorithm (PRA, (Lao and Cohen, 2010)).
Our second contribution is to employ these relational rules in learning RBMs. Recently, Hu
et al. (Hu et al., 2016), employed logical rules to enhance the representation of neural networks.
There has also been work on lifting neural networks to relational settings (Blockeel and Uwents,
2004; DiMaio and Shavlik, 2004; Sourek et al., 2018). While specific methodologies differ, at
a higher-level all these methods employ relational and logic rules as features of neural networks
and train them on relational data. In this spirit, we propose a methodology for lifting RBMs to
relational data. While previous methods on lifting relational networks employed logical constraints
or templates, we use relational random walks to construct relational rules, which are then used as
features in an RBM. Specifically, we consider random walks constructed by the PRA approach
of Lao and Cohen (2010) to develop features that can be trained using RBMs. We consider the
formalism of discriminative RBMs as our base classifier and use these relational walks with them.
We propose two approaches to instantiating RBM features: (1) similar to the approach of
Markov Logic Networks (MLNs, (Domingos and Lowd, 2009)) and Relational Logistic Regres-
sion (RLR, (Kazemi et al., 2014)), we instantiate features with counts of the number of times a
random walk is satisfied for every training example; and (2) similar to Relational Dependency
Networks (RDNs, (Natarajan et al., 2012)), we instantiate features with existentials (1 if ∃ at least
one instantiation of the path in the data, otherwise 0). Given these features, we train a discrimina-
tive RBM with the following assumptions: the input layer is multinomial (to capture counts and
existentials), the hidden layer is sigmoidal, and the output layer is Bernoulli.
To summarize, we make the following contributions: (1) we combine the powerful formal-
ism of RBMs with the representation ability of relational logic; (2) we develop a relational RBM
22
(RRBM) that does not fully propositionalize the data; (3) we show the connection between our
proposed neuro-symbolic method and standard SRL approaches such as RDNs, MLNs and RLR,
and (4) we demonstrate the effectiveness of this novel approach by empirically comparing against
state-of-the-art methods that also learn from relational data.
The rest of the chapter is organized as follows: Section 3.2 presents the past research closely
related with our work, Section 3.3 describes the significance of studying the relational counterpart
of RBMs. Section 3.4 present our RRBM approach and algorithm in detail, and explore its connec-
tions to some well-known probabilistic relational models. Section 3.5 presents the experimental
results on standard relational data sets. Finally, the last section concludes the paper by outlining
future research directions.
3.2 Related Work
Our related work touches in general on standard Statistical Relational Learning models and specif-
ically focuses on structure learning approaches in SRL, followed by propositionalization based
models, and finally Restricted Boltzmann machines.
3.2.1 Statistical Relational Learning Models
Markov Logic Networks (Domingos and Lowd, 2009) are relational undirected models, where
first-order logic formulas correspond to cliques of a Markov network, and formula weights corre-
spond to the clique potentials. An MLN can be instantiated as a Markov network with a node for
each ground predicate (atom) and a clique for each ground formula. All groundings of the same
formula are assigned the same weight leading to the following joint probability distribution over
all atoms: P (X=x) = 1Z
exp (∑
iwini(x)), where ni(x) is the number of times the i-th formula
is satisfied by possible world x, and Z is a normalization constant. Intuitively, a possible world
where formula fi is true one more time than a different possible world is ewi times as probable,
all other things being equal. While typical MLN learning methods can learn the full joint model
23
of all the relations (predicates) in the domain, we focus on discriminative learning of MLNs in the
next subsection where the goal is to learn a conditional distribution of one relation given all the
other relations. One discriminative model that explicitly models the conditional distribution of one
relation given the others is relational logistic regression (RLR) (Kazemi et al., 2014). RLR extends
logistic regression to relational settings to handle varying population sizes of the feature space for
different examples. An interesting observation is that RLR can be considered as an aggregator
when there are multiple values for the same set of features.
3.2.2 Structure Learning Approaches
Many structure learning approaches for Statistical Relational Learning (SRL), including MLNs,
use graph representations. For example, Learning via Hypergraph Lifting (LHL) (Kok and Domin-
gos, 2009) builds a hypergraph over ground atoms; LHL then clusters the atoms to create a “lifted”
hypergraph, and traverses this graph to obtain rules. Specifically, they use depth-first traversal to
create the paths in this “lifted” hypergraph to create potential clauses by using the conjunction of
predicates from the path as the body of the clause.
Learning with Structural Motifs (LSM) (Kok and Domingos, 2010) performs random walks
over the graph to cluster nodes and performs depth-first traversal to generate potential clauses. We
use random walks over a lifted graph to generate all possible clauses, and then use a non-linear
combination (through the hidden layer) of ground clauses, as opposed to linear combination in
MLNs. Our hypothesis space includes the clauses generated by both these approaches without the
additional complexity of clustering the nodes.
3.2.3 Propositionalization Approaches
To learn powerful deep models on relational data, propositionalization is used to convert ground
atoms into a fixed-length feature vector. For instance, kFoil (Landwehr et al., 2010) uses a dy-
namic approach to learn clauses to propositionalize relational examples for SVMs. Each clause is
24
converted into a Boolean feature that is 1, if an example satisfies the clause body and each clause
is scored based on the improvement of the SVM learned using the clause features. Alternately, the
Path Ranking Algorithm (PRA) (Lao and Cohen, 2010), which has been used to perform knowl-
edge base completion, creates features for a pair of entities by generating random walks from a
graph. We use a similar approach to perform random walks on the lifted relational graph to learn
the structure of our relational model.
3.3 Why study Relational Boltzmann Machines ?
The motivation to study Boltzmann Machines in relational context is two folds, inspired by two
different perspectives (Fischer and Igel, 2012):
(a) they can be viewed as undirected graphical models, particularly as Markov random fields
(MRF, (Pearl, 1988)). When considered as MRFs, Boltmann machines have two set of vari-
ables: the visible variables as in the case of standard MRF, but in addition, it also includes
hidden variables.
(b) Boltzmann machines can be viewed through the lens of feed-forward neural networks where
they are interpreted as stochastic neural networks with one hidden layer of non-linear pro-
cessing units.
We would now consider each perspective in detail here. Markov Logic Networks, one of the
most popular SRL model, is defined as a set of weighted first order logic clauses. When instan-
tiated, the resulting clauses represent a MRF whose each feature is represented by one possible
grounding of first order logic formula. As discussed previously, Boltzmann machines are also
MRFs with latent variables as additional component present in the graph. This motivates the need
of relational Boltzmann machines that would, potentially, perform as efficiently as MLNs. This
is based on the intuition that both the models have originated from MRF. Furthermore, relational
Boltzmann machines would also leverage the hidden features of MRFs, that would enable it to
25
capture complex latent features present in relational data. These latent data are not easily captured
only by visible features as in the case of MLNs.
We now consider an alternative view of Boltzmann machines as feed-forward neural network to
help us better understand the need of lifting them. Among the first few models that were employed
to prove that deep neural networks can be trained without getting stuck in the local optima were
Deep Belief Networks (Bengio et al., 2006; Hinton and Osindero, 2006). The idea was to perform
greedy layer-wise training of the DBN by considering one layer at a time keeping the parameters of
all other layers fixed. It was mathematically proven that training each layer of DBN is equivalent
to optimizing the parameters of Boltzmann machines at each layer. This motivates us to learn
relational Boltzmann machines as they can serve as the starting point to further lift the complex,
deep models to relational domains. The advantage of learning such relational deep architectures is
that, like standard deep architectures, the hidden layer of the resulting deep neuro-symbolic models
will capture the higher order abstractions present in the relational data.
3.4 Relational Restricted Boltzmann Machines: The Proposed Approach
Reconsider MLNs, arguably one of the leading relational approaches unifying logic and probabil-
ity. The use of relational formulas as features within a log-linear model allows the exploitation of
“deep” knowledge. Nevertheless, this is still a shallow architecture as there are no “hierarchical”
formulas defined from lower levels. The hierarchical stacking of layers, however, is the essence
of deep learning and, as we demonstrate in this work, critical for relational data, even more than
for propositional data. This is due to one of the key features of relational modeling: predictions of
the model may depend on the number of individuals, that is, the population size. Sometimes this
dependence is desirable, and in other cases, model weights may need to change. In either case,
it is important to understand how predictions change with population size when modeling or even
learning the relational model (Kazemi et al., 2014).
26
We now introduce Relational RBMs (RRBM), relational classifier that can learn hierarchical
relational features through its hidden layer and model non-linear decision boundaries. The idea is
to use lifted random walks to generate relational features for predicates that are then counted (or
used as existentials) to become RBM features. Of course, more than one RBM could be trained,
stacking them on top of each other. For the sake of simplicity, we focus on a single layer; however,
our approach is easily extended to multiple layers. Our learning task can be defined as follows:
Given: Relational data, D; Target Predicate, T .
Learn: Relational Restricted Boltzmann Machine (RRBM) in a discriminative fashion.
We are given data, D = {(xi, yi)`i=1}, where each training example is a vector, xi ∈ Rm with a
multi-class label, yi ∈ {1, . . . , C}. The training labels are represented by a one-hot vectorization:
yi ∈ {0, 1}C with yki = 1 if yi = k and zero otherwise. For instance, in a three-class problem,
if yi = 2, then yi = [0, 1, 0]. The goal is to train a classifier by maximizing the log-likelihood,
L =∑`
i=1 log p(yi | xi). In this work, we employ discriminative RBMs, for which we make
some key modeling assumptions:
1. input layers (relational features) are modeled using a multinomial distribution, for counts or
existentials;
2. the output layer (target predicate) is modeled using a Bernoulli distribution
3. hidden layers are continuous, with a range in [0, 1].
3.4.1 Step 1: Relational data representation
We use a lifted-graph representation to model relational data, D. Each type corresponds to a node
in the graph and the predicate r(t1, t2) is represented by a directed edge from the node t1 to t2
in the graph. For N -ary predicates, say r(t1, ..., tn), we introduce a special compound value type
27
Figure 3.1: Lifted random walks are converted into feature vectors by explicitly grounding everyrandom walk for every training example. Nodes and edges of the graph in (a) represent types andpredicates, and underscore ( Pr) represents the inverted predicates. The random walks counts (b)are then used as feature values for learning a discriminative RBM (DRBM). An example of randomwalk represented as clause is (c).
(CVT)1, rCVT, for each n-ary predicate. For each argument tk, an edge erk is added between the
nodes rCVT and tk. Similarly for unary predicates, r(t) we create a binary predicate isa(t, r).
3.4.2 Step 2: Relational transformation layer
Now, we generate the input feature vector xi from a relational example, T(a1j, a2j). Inspired by
the Path Ranking Algorithm (Lao and Cohen, 2010), we use random walks on our lifted relational
graph to encode the local relational structure for each example. We generate m unique random
walks connecting the argument types for the target predicate to define the m dimensions of x.
Specifically, starting from the node for the first argument’s type, we repeatedly perform random
walks till we reach the node for the second argument. For further details of relational random walk
generation process, refer to Section 2.2. Since random walks also correspond to the set of candidate
clauses considered by structure-learning approaches for MLNs (Kok and Domingos, 2009, 2010),
this transformation function can be viewed as structure of our relational model.
1wiki.freebase.com/wiki/Compound Value Type
28
A key feature of an RBM trained on standard i.i.d. data is that the feature set x is defined
in advance and is finite. With relational data, this set can potentially be infinite, and feature size
can vary with each training instance. For instance, if the random walk is a paper written by a
professor− student combination, not all professor− student combinations will have the
same number of feature values. This is commonly referred as multiple-parent problem (Natarajan
et al., 2008). To alleviate this problem, SRL methods consider one of two approaches – aggregators
or combining rules. Aggregators combine multiple values to a single value, while combining
rules combine multiple probability distributions into one. While these solutions are reasonable for
traditional probabilistic models that estimate distributions, they are not computationally feasible
for the current task.
Our approach to the multiple-parent problem is to consider existential semantics: if there ex-
ists at least one instance of the random walk that is satisfied for an example, the feature value
corresponding to that random walk is set to 1 (otherwise, to 0). This approach was also recently
(and independently of our work) used by Wang and Cohen (2016) for ranking via matrix factoriza-
tion. This leads to our first model: RRBM-Existentials, or RRBM-E, where E denotes the existential
semantics used to construct the RRBM. One limitation of RRBM-E is that it does not differentiate
between a professor− student combination that has only one paper and another that has 10
papers, that is, it does not take into account how often a relationship is true in the data. Inspired
by MLNs, we also consider counts of the random walks as feature values, a model we denote
RRBM-Counts or RRBM-C (Figure 3.1). For example, if a professor− student combination has
written 10 papers, the feature value corresponding to this random walk for that combination is 10.
To summarize, we define two transformation functions, xj = g(a1j, a2j)
• ge(a1j, a2j, p) = 1, if ∃ a grounding of the pth random walk connecting object a1j to object
a2j, otherwise 0 (RRBM-E);
• gc(a1j, a2j, p) = #groundings of pth random walk connecting object a1j to a2j (RRBM-C).
29
For example, consider that the walk takes(S, C) ∧ taughtBy(C, P) is used to generate a feature
for advisedBy(s1, p1). The function gc: |{C | takes(s1, C) ∧ taughtBy(C, p1)}| would generate
the required count feature. On the other hand, with the function, ge, this feature would be set to 1,
if ∃C, takes(s1, C) ∧ taughtBy(C, p1).
These transformation functions also allow us to relate our approach to other well-known rela-
tional models. For instance, gc uses counts similar to MLNs, while ge uses existential semantics
similar to RDNs (Natarajan et al., 2012). Using features from ge to learn weights for a logistic
regression model would lead to an RLR model, while using features from gc would correspond to
learning an MLN (as we show later). One could also imagine using RLR as an aggregator from
these random walks, but this is beyond the scope of our work. While counts are more informative
and connect to existing SRL formalisms such as MLNs, exact counting is computationally expen-
sive in relational domains. This can be mitigated by using approximate counting approaches, such
as in Das et al. (2016) that leverages the power of graph databases. Our empirical evaluation did
not require count approximations; we defer integration of approximate counting to future research.
3.4.3 Step 3: Learning Relational RBMs
The output of the relational transformation layer is fed into multilayered discriminative RBM
(DRBM) to learn a regularized, non-linear, weighted combination of features. The relational trans-
formation layer stacked on top of the DRBM forms the Relational RBM model. Due to non-
linearity, we are able to learn a much more expressive model than traditional MLNs and RLRs.
Recall that the DRBM as defined by Larochelle and Bengio (2008) consists of n hidden units,
h, and the joint probability is modeled as p(y,x,h) ∝ e−E(y,x,h), where the energy function is
parameterized Θ ≡ (W,b, c,d,U):
E(y,x,h) = −hTWx− bTx− cTh− dTy − hTUy. (3.1)
30
As with most generative models, computing the joint probability p(y,x) is intractable, but the
conditional distribution P (y|x) can be computed exactly (Salakhutdinov et al., 2007) as
p(y|x) =edy+
∑nj=1 ζ(cj+Ujy+
∑mf=1 Wjfxf )∑C
k=1 edk+
∑nj=1 ζ(cj+Ujk+
∑mf=1 Wjfxf )
. (3.2)
In Equation 3.2, ζ(z) = log(1+ez), the softplus function, and the index f sums over all the features
xf of example x. During learning, the log-likelihood function is maximized to compute the DRBM
parameters Θ. The gradient of the conditional probability (Equation 3.2) can be computed as:
∂
∂θlog p(yi|xi) =
n∑j=1
σ (oyj(xi))∂oyj(xi)
∂θ+
C∑k=1
n∑j=1
σ (okj(xi)) p(k|xi)∂okj(xi)
∂θ. (3.3)
In Equation 3.3, oyj(xi) = cj + Ujy +∑m
f=1 Wjfxif , where x refers to random-walk features for
every training example. As mentioned earlier, we assume that input features are modeled using a
multinomial distribution. To consider counts as multinomials, we use an upper bound on counts:
2 max(count(xji )) for every feature; bounds are the same for both train and test sets to avoid
overfitting. In other words, the bound is simply twice the max feature count over all the examples
of the training set. We can choose the scaling factor through cross-validation, but value 2 seems to
be a reasonable scale in our experiments. For the test examples, we can use the random walks to
generate the features and the RBM layers to generate predictions from these features.
RRBM Algorithm: The complete approach to learn Relational RBMs is shown in Algorithm 1.
In Step 1, we generate type-restricted random walks using PRA. These random walks (rw) are
used to construct the feature matrix. For each example, we obtain exact counts for each random
walk, which becomes the corresponding feature value for that example (Step 7). A DRBM can be
trained on the features (Step 12) as explained in Section 3.4.3.
3.4.4 Relation to Statistical Relational Learning Models
The random walks can be interpreted as logical clauses (that are used to generate features) and
the DRBM input feature weights b in Equation 3.1 can be interpreted as clause weights (wp).
31
Algorithm 1 LearnRRBM(T, G, P): Relational Restricted Boltzmann MachinesInput: T(t1, t2): target predicate, G: lifted graph over types, m: number of features
1: . Generate m random walks between t1 and t22: rw := PerformRandomWalks(G, t1, t2, m)3: for 0 ≤ j < l do . Iterate over all training examples4: . Generate features for T(a1j, a2j)5: for 0 ≤ p < m do . Iterate over all the paths6: . pth feature computed from the arguments of xj7: xj[p] := gc(a1j, a2j, rw[p])8: end for9: end for
10: x := {xj} . Input matrix11: . Learn DRBM from the features and examples12: Θ := LearnDRBM(x, y)13: return RRBM(Θ, rw)
This interpretation highlights connections between our approach and Markov logic networks. In-
tuitively, the relational transformation layer captures the structure of MLNs and the RBM layer
captures the weights of the MLNs. More concretely, exp(bTx) in Equation 3.1 can be viewed as
exp(∑
pwpnp(x)) in the probability distribution for MLNs. To verify this intuition, we compare
the weights learned for clauses in MLNs to weights learned by RRBM-C. We generated a synthetic
data set for a university domain with varying number of objects (professors and students). We
picked a subset of professor− student pairs to have an advisedBy relationship and add com-
mon papers or common courses based on the following two clauses:
1. author(A, P) ∧ author(B, P)→ advisedBy(A, B)
2. teach(A, C) ∧ registered(B, C)→ advisedBy(A, B)
The first clause states that if a professor A co-authors a paper P with the student B, then A
advises B. The second states that if a student B registers for a course C taught by professor A
then A advises B. Figure 3.2 shows the weights learned by discriminative and generative weight
learning in Alchemy and RRBM for these two clauses as a function of the number of objects in
32
(a) Co-author clause (b) Course clause
Figure 3.2: Weights learned by Alchemy and RRBMs for a clause vs. size of the domain.
the domain. Recall that in MLNs, the weight of a rule captures the confidence in that rule — the
higher the number of instances satisfying a rule, the higher is the weight of the rule. As a result, the
weight of the rule learned by Alchemy also increases in Figure 3.2. We observe a similar behavior
with the weight learned for this feature in our RRBM formulation as well. While the exact values
differ due to difference in the model formulation, this illustrates clearly that the intuitions of the
model parameters from standard SRL models are still applicable.
In contrast to standard SRL models, RRBMs are not a shallow architecture. This can be better
understood by looking at the rows of the weights W in the energy function (Equation 3.1): they
act as additional filter features, combining different clause counts. That is, E(y,x,h) looks at how
well the usage profile of a clause aligns with different filters associated with rowsWj·. These filters
are shared across different clauses, but different clauses will make comparisons with different
filters by controlling clause-dependent biases Ujy in the σ terms. Notice also, that two similar
clauses could share some filters in W , that is, both could simultaneously have large positive values
of Ujy for some rows Wj·. This can be viewed as a form of statistical predicate invention as it
discovers new concepts and is akin to (discriminative) second-order MLNs (Kok and Domingos,
2007). In contrast to second-order MLNs, however, no second-order rules are required as input
33
to discover new concepts. While MLNs can learn arbitrary N -ary target predicates, due to the
definition of random walks in the original work, we are restricted to learning binary relations.
3.5 Experiments
To compare RRBM approaches to state-of-the-art algorithms, we consider RRBM-E, RRBM-C and
RRBM-CE. The last approach, RRBM-CE combines features from both existential and count RRBMs
(i.e., union of count and existential features). Our experiments answers the following questions:
Q1: How do RRBM-E and RRBM-C compare to baseline MLNs and Decision Trees?
Q2: How do RRBM-E and RRBM-C compare to the state-of-the-art SRL approaches?
Q3: How do RRBM-E, RRBM-C, and RRBM-CE generalize across all domains?
Q4: How do random-walk generated features compare to propositionalization?
To answer Q1, we compare RRBMs to Learning with Structural Motifs (LSM) (Kok and Domin-
gos, 2010). Specifically, we perform structure learning with LSM followed by weight learning with
Alchemy (Kok et al., 2010) and denote this as MLN. We would also like to answer the question:
how crucial is it to use a RBM, and not some other ML algorithm? We use decision trees (Quinlan,
1993) as a proof-of-concept for demonstrating that a good probabilistic model when combined
with our random walk features can potentially yield better results than naive combination of ML
algorithm with features. We denote the decision tree model as Tree-C. For LSM, we used the
parameters recommended by Kok and Domingos (2010). However, we set the maximum path
length of random walks of LSM structure learning to 6 to be consistent with the maximum path
length used in RRBM. We used both discriminative and generative weight-learning for Alchemy
and present the best-performing result.
To answer Q2, we compare RRBM-C to MLN-Boost (Khot et al., 2011), and RRBM-E to RDN-
Boost (Natarajan et al., 2012) both of which are SRL models that learn the structure and param-
eters simultaneously. For MLN-Boost and RDN-Boost, we used default settings and 20 gradient
34
steps for all data sets. For RRBM, since path-constrained random walks (Lao and Cohen, 2010)
are performed on binary predicates, we convert unary and ternary predicates into binary predi-
cates. For example, predicates such as teach(a1, a2, a3) are converted to three binary predicates:
teachArg1(id, a1), teachArg2(id, a2), teachArg3(id, a3) where id is the unique identifier for
a predicate. As another example, unary predicates such as student(s1) are converted to binary
predicates of the form isa(s1, student). To ensure fairness, we used binary predicates as inputs
to all the methods considered here. We also allow inverse relations in random walks, that is, we
consider a relation and its inverse to be distinct relations. For one-to-one and one-to-many rela-
tions, this sometimes leads to uninteresting random walks of the form relation→ relation−1
→ relation. In order to avoid this situation, we add additional sanity constraints on walks that
prevent relations and their inverses from immediately following one another and avoid loops.
To answer Q4, we compare our method with Bottom Clause Propositionalization (Franca et al.,
2014) (BCP-RBM), which generates one bottom clause for each example and considers each atom
in the body of the bottom clause to be a unique feature. We utilize Progol (Muggleton, 1995) to
generate bottom clauses by using its default configuration but setting variable depth = 1 to handle
large data sets. Contrary to the original work (Franca et al., 2014) that uses a neural network, we
use RBM as the learning model, as our goal is to demonstrate the usefulness of random walks to
generate features.
In our experiments, we subsample training examples at 2 : 1 ratio of negatives to positives. The
number of RBM hidden nodes are set to 60% of visible nodes, the learning rate, η = 0.05 and the
number of epochs to 5. These hyperparameters have been optimized by line search.
A Note On Hyperparameter Selection: An important hyperparameter for RRBMs is the maxi-
mum path length of random walks, which influences the number of RRBM features. Figure 3.3
shows that the number of features generated grows exponentially with maximum path length. We
restricted the maximum path length of random walks to λ = 6 in order to strike a balance between
tractability and performance; λ = 6 demonstrated consistently good performance across a vari-
35
Figure 3.3: The number of RRBM features grows exponentially with maximum path length ofrandom walks. We set λ = 6 to balance tractability with performance.
ety of data sets, while keeping the feature size tractable. As mentioned above, other benchmark
methods such as LSM were also restricted to maximum random walk length of 6 for fairness.
Hyperparameter selection is an open issue in both relational learning as well as deep learning;
in the latter, careful tuning of hyperparameters and architectures such as regularization constants
and number of layers is critical. Recent work on automated hyperparameter selection can also
be used with RRBMs, if a more systematic approach to hyperparameter selection for RRBMs is
desired, especially in practical settings. Bergstra and Bengio (2012) demonstrated that random
search is more efficient for hyperparameter optimization than grid search or manual tuning. This
approach can be used to select optimal η and λ jointly. Snoek et al (2012) recently used Bayesian
optimization for automated hyperparameter tuning. While this approach was shown to be highly
effective across diverse machine learning formalisms including for support vector machines (Cris-
tianini and Shawe-Taylor, 2000), Latent Dirichlet Allocation (Blei et al., 2003) and convolutional
neural networks (Goodfellow et al., 2016), it requires powerful computational capabilities and
parallel processing to be feasible in practical settings.
36
3.5.1 Data Sets
We used several benchmark data sets to evaluate the performance of our algorithms. We compare
several approaches using conditional log-likelihood (CLL), area under ROC curve (AUC-ROC),
and area under precision-recall curve (AUC-PR). Measuring PR performance on skewed relational
data sets yields a more conservative view of learning performance (Davis and Goadrich, 2006). As
a result, we use this metric to report statistical significant improvements at p = 0.05. We employ
5-fold cross validation across all data sets.
UW-CSE: The UW-CSE data set (Richardson and Domingos, 2006) is a standard benchmark that
consists of predicates and relations such as professor, student, publication, hasPosition
and taughtBy etc. The data set contains information from five different areas of computer science
about professors, students and courses, and the task is to predict the advisedBy relationship be-
tween a professor and a student. For MLNs, we present results from generative weight learning as
it performed better than discriminative weight learning.
Mutagenesis: The MUTAGENESIS data set2 has two entities: atom and molecule, and consists of
predicates that describe attributes of atoms and molecules, as well as the types of relationships that
exist between atom and molecule. The target predicate is moleatm(aid, mid), to predict whether a
molecule contains a particular atom. For MLN, we present results from generative weight learning
as it had better results than discriminative learning.
Cora Entity Resolution is a citation matching data set (Poon and Domingos, 2007); in the citation-
matching problem, a “group” is a set of citations that refer to the same publication. Here, a large
fraction of publications belong to non-trivial groups, that is, groups that have more than one ci-
tation; the largest group contains as many as 54 citations, which makes this a challenging prob-
lem. It contains the predicates such as Author, Title, Venue, HasWordAuthor, HasWordTitle,
SameAuthor and the target predicate is SameVenue. Alchemy did not complete running after 36
2cs.sfu.ca/∼oschulte/BayesBase/input
37
hours and therefore we report results from Khot et al. (2011).
IMDB: This data set was first created by Mihalkova and Mooney (2007) and contains nine predi-
cates: gender, genre, movie, samegender, samegenre, samemovie, sameperson, workedunder,
actor and director; we predict the workedUnder as target relation. Since actor and director
are unary predicates, we converted them to one binary predicate isa(person, designation)
where designation can take two values - actor and director. For MLNs, we report the gen-
erative weight learning results here.
Yeast: contains millions of facts (Lao and Cohen, 2010) from papers published between 1950 and
2012 on the yeast organism Saccharomyces cerevisiae. It includes predicates like gene, journal,
author, title, chem, etc. The target predicate is cites, that is, we predict the citation link be-
tween papers. As in the original paper, we need to prevent models from using information obtained
later than the publication date. While calculating features for a citation link, we only considered
facts that were earlier than a publication date. Since we cannot enforce this constraint in LSM, we
do not report Alchemy results for Yeast.
Sports: NELL (Carlson et al., 2010) is an online3 Never-Ending Learning system that extracts in-
formation from online text data, and converts this into a probabilistic knowledge base. We consider
NELL data from the sports domain consisting of information about players and teams. The task
is to predict whether a team plays a particular sport or not. Alchemy did not complete its run after
36 hours, thus we do not report its result for this data set.
3.5.2 Results
Q1: Figure 3.4 compares our approaches to baseline MLNs and decision trees to answer Q1.
RRBM-E and RRBM-C have significant improvement over Tree-C on UW-CSE and Yeast data sets,
with comparable performance on the other four. Across all data sets (except Cora) and all metrics,
RRBM-E and RRBM-C beat the baseline MLN approach. Thus, we can answer Q1 affirmatively:
3rtw.ml.cmu.edu/rtw/
38
Figure 3.4: (Q1): Results show that RRBMs generally outperform baseline MLN and decision-tree(Tree-C) models.
Figure 3.5: (Q2) Results show better or comparable performance of RRBM-C and RRBM-CE to MLN-Boost, which all use counts.
RRBM models outperform baseline approaches in most cases.
Q2: We compare RRBM-C to MLN-Boost (count-based models) and RRBM-E to RDN-Boost (ex-
istential -based models) in Figures 3.5 and 3.6. Compared to MLN-Boost on CLL, RRBM-C has a
statistically significant improvement or is comparable on all data sets. RRBM-E is comparable to
RDN-Boost on all the data sets with statistical significant CLL improvement on Cora. We also see
significant AUC-ROC improvement of RRBM-C on Cora and RRBM-E on IMDB. Thus, we confirm
that RRBM-E and RRBM-C are better or comparable to the current best structure learning methods.
39
Figure 3.6: (Q2) Results show better or comparable performance of RRBM-E and RRBM-CE toRDN-Boost, which all use existentials.
Figure 3.7: (Q4) Results show better or comparable performance of our random-walk-based fea-ture generation approach (RRBM) compared to propositionalization (BCP-RBM).
Q3: Broadly, the results show that RRBM approaches generalize well across different data sets.
The results also indicate that RRBM-CE generally improves upon RRBM-C and has comparable per-
formance to RRBM-E. This shows that existential features are sufficient or better at modeling. This
is also seen in the boosting approaches, where RDN-Boost (existential semantics), generally out-
performs MLN-Boost (count semantics).
Q4: Since BCP-RBM only generates existential features, we compare BCP-RBM with RRBM-E to
answer Q4. Figure 3.7 shows that RRBM-E has statistically significantly better performance than
BCP-RBM on three data sets on CLL. Further, RRBM-E demonstrates significantly better performance
40
than BCP-RBM on four data sets: Cora, Mutagenesis, IMDB and Sports - both on AUC-ROC and
AUC-PR. This allows us to state positively that random-walk features yield better or comparable
performance than propositionalization. For IMDB, BCP-RBM generated identical bottom clauses for
all positive examples, resulting in an extreme case of just a single positive example to be fed into
RBM. This results in a huge skew (distinctly observable in AUC-PR of IMDB for BCP-RBM).
3.6 Conclusion
Relational data and knowledge graphs are useful in many tasks, but feeding them to deep learners
is a challenge. To address this problem, we have presented a combination of deep and symbolic
learning, which gives rise to a powerful deep architecture for relational classification tasks, called
Relational Restricted Boltzmann Machines. In contrast to propositional approaches that use deep
learning features as inputs to log-linear models (e.g. (Deng, 2015)), we proposed and explored
a paradigm connecting relational features as inputs to deep learning. While statistical relational
models depend much more on the discriminative quality of the clauses that are fed as input, Re-
lational RBMs can learn useful hierarchical relational features through its hidden layer and model
non-linear decision boundaries. The benefits were illustrated on several SRL benchmark data sets,
where RRBMs outperformed state-of-the-art structure learning approaches—showing the tight in-
tegration of deep learning and symbolic learning models.
41
CHAPTER 4
BOOSTING RELATIONAL RESTRICTED BOLTZMANN MACHINES
Relational Restricted Boltzmann Machine (RRBM) approach discussed in the previous chapter
employs a rule learner (for structure learning) and a weight learner (for parameter learning) se-
quentially. In this chapter, we develop a novel gradient-boosted approach for learning Relational
RBM (LRBM-Boost) (Kaur et al., 2020b) that performs both tasks simultaneously.
4.1 Introduction
Restricted Boltzmann Machines (RBMs) (Rumelhart and McClelland, 1987) have emerged as one
of the most popular probabilistic learning methods. Coupled with advances in theory of learning
RBMs: contrastive divergence (CD, (Hinton, 2002)), persistent CD (Tieleman, 2008), and parallel
tempering (Desjardins et al., 2010) to name a few, their applicability has been extended to a variety
of tasks (Taylor et al., 2007). While successful, most of these models have been used with a flat
feature representation (vectors, matrices, tensors) and not necessarily in the context of relational
data. In problems where data is relational, these approaches typically flatten the data by either
propositionalizing them or constructing embeddings that allowed them to employ standard RBMs.
This results in the loss of “natural” interpretability that is inherent to relational representations, as
well as a possible decline in performance due to imperfect propositionalization/embedding.
Consequently, there has been recent interest in developing neural models that directly operate
on relational data. Specifically, significant research has been conducted on developing graph con-
volutional neural networks (Schlichtkrull et al., 2018) that model graph data (a restricted form of
relational data). Most traditional truly relational/logical learning methods (De Raedt et al., 2016;
Getoor and Taskar, 2007) are capable of learning with data of significantly greater complexity,
including hypergraphs. Such representations have also been recently adapted to learning neural
models (Pham et al., 2017; Kazemi and Poole, 2018; Sourek et al., 2018). One recent approach in
42
this direction is our Relational RBMs (Kaur et al., 2017) discussed in the previous chapter, where
relational random walks were learned over data (effectively, randomized compound relational fea-
tures) and then employed as input layer to an RBM.
While reasonably successful, this method still propositionalized relational features by con-
structing two forms of data aggregates: counts and existentials, which results in loss of valuable
information. Motivated by this limitation, we propose a full, Lifted Restricted Boltzmann Ma-
chines (LRBM), where the inherent representation is relational. Additionally, the LRBM can be
learned without significant feature engineering, that is, a key component of our approach is discov-
ering the structure of lifted RBMs. We propose a gradient-boosting approach for learning both
the structure and parameters of LRBMs simultaneously. The resulting hidden nodes are newly
discovered features, represented as conjunctions of logical predicates.
These hidden layers are learned using the machinery of functional-gradient boosting (Fried-
man, 2001) on relational data. The idea is to learn a sequence of relational regression trees (RRTs)
and then transform them to an LRBM by identifying appropriate transformations. There are a few
salient features of our approach: (1) in addition to being well-studied and widely used (Natarajan
et al., 2011; Khot et al., 2011; Natarajan et al., 2012; Gutmann and Kersting, 2006), RRTs can be
parallelized and adapted easily to new, real-world domains; (2) our approach can handle hybrid
data easily, which is an issue for many logical learners; (3) perhaps most important, our approach
is explainable, unlike other neural models. This is due to the fact that the hidden layers of the
LRBM are simple conjunctions (paths in a tree), and can be easily interpreted as opposed to com-
plex embeddings1. Finally, (4) due to the nature of our learning method, we learn sparser LRBMs
compared to employing random walks.
1Embedding approaches transform data from the input space to a feature space. A familiar example of this isPrincipal Components Analysis, which transforms input features to compound features via linear combination; the newfeatures are no longer naturally interpretable. This is also the case with deep learning, which diminish interpretabilityby chaining increasingly complex feature combinations across successive layers (for example, autoencoders).
43
We make a few key contributions in this work: (1) as far as we are aware, this is the first
principled approach to learning truly lifted RBMs from relational data; (2) our representation en-
sures that the resulting RBM is interpretable and explainable (due to the hidden layer being simple
conjunctions of logical predicates). We present (3) a gradient-boosting algorithm for simultane-
ously learning the structure and parameters of LRBMs as well as (4) a transformation process
to construct a sparse LRBM from an ensemble of relational regression trees produced by gradi-
ent boosting. Finally, (5) our empirical evaluation clearly demonstrates three aspects: efficacy,
efficiency and explainability of our approach compared to the state-of-the-art on several data sets.
The rest of the chapter is organized as follows - we review the related work in Section 4.2
followed by our proposed model in Section 4.3. We then present our empirical evaluations in
Section 4.4 before concluding the chapter in Section 4.5.
4.2 Related Work
We categorize our related work into two groups: past functional gradient boosting models and the
neuro-symbolic systems proposed so far. Each of them is discussed below.
4.2.1 Relational Functional Gradient Boosting based models
Since our proposed model relies on the mechanism of relational Functional Gradient Boosting,
we review the past models that have utilized this technique in order to learn efficient models. The
most popular among them are Relational Dependency Networks (Natarajan et al., 2012), Relational
Logistic Regression (Ramanan et al., 2018), discriminative training of undirected models (Khot
et al., 2011), temporal models (Yang et al., 2016) and learning relational policies (Natarajan et al.,
2011). Inspired by the success of these methods, we propose to learn the hidden layer of an LRBM
using functional gradient boosting.
44
4.2.2 Neuro-Symbolic models
Because LRBM model is a neural model deveoped for relational data, we review the recent neuro-
symbolic models here. Among them, relational embeddings (Nickel et al., 2011; Bordes et al.,
2013; Socher et al., 2013; Yang et al., 2015; Nickel et al., 2016; Trouillon et al., 2016) have gained
popularity recently. A common theme among current approaches is to learn a vector representation,
that is, an embedding for each relation and each entity present in the knowledge base. Most
of these approaches also assume binary relations, which is a rather restrictive assumption that
cannot capture the richness of real-world relational domains. Further, million of parameters (of
embeddings) are needed to be learned to train these model. Finally, and possibly most concerning:
many embedding approaches cannot easily generalize to new data, and the entire set of embeddings
has to be relearned with new data, or for every new task.
Approaches closest to our proposed work are what we called as neural SRL models in Chap-
ter 1 (Kazemi and Poole, 2018; Sourek et al., 2018; Franca et al., 2014; DiMaio and Shavlik,
2004; Lodhi, 2013); these approaches also represent the structure of a neural network as first-order
clauses as we do. The key difference however, is that in all these models, clauses have already
been obtained either from an expert or an independent ILP system. That is to say, domain rules
that make up its structure and the resulting neural network architectures are manually specified,
and these approaches typically only perform parameter learning.
Recently, relational neural networks have been proposed for vision tasks (Santoro et al., 2017;
Sung et al., 2018; Hu et al., 2018). While promising, these networks have fixed, manually-specified
structures and the nature of the relations captured between objects is also not interpretable or ex-
plainable. In contrast, our model learns the structure and parameters of neural network simultane-
ously. One common theme among all these models is that they learn latent features of relational
data in their hidden layers, but our model, being still in its nascent stage, cannot do so yet.
A few approaches for learning neural network on graphs exist. Graph convolutional networks
(Niepert et al., 2016) enable graph data to be trained directly on convolutional networks. Another
45
set of popular approaches (Scarselli et al., 2009) train a recurrent neural network on each node of
the graph by accepting the input from neighboring nodes until a fixed point is reached. The work
of Schlichtkrull et al. (2018) extends this by learning embeddings for entities and relations in the
relational graph.
Recently, Pham et al. (2017) proposed a neural network architecture where connections in the
different nodes of network are encoded according to given graph structure. RBMs have also been
considered in the context of relational data. For instance, two tensor based models (Huang et al.,
2015; Li et al., 2014) proposed to lift RBMs by incorporating a four-order tensor into their architec-
ture that captures interaction between quartet consisting of two objects, relation existing between
them and hidden layer. Finally, our previous approach (Chapter 3) learns relational random walks
and uses the counts of the groundings as observed layer of an RBM.
4.3 Boosting of Lifted RBMs
In our proposed model, scalars are denoted in lower-case (y, w), vectors in bold face (y, w), and
matrices in upper case (Y , W ). uᵀv denotes the dot product between u and v.
Recall that our goal is to learn a truly lifted RBM. Consequently, both the hidden and observed
layers of the RBM should be lifted (parameterized as against propositional RBMs). This is to
say that, the observed layers are the predicates (logical relations describing interactions) in the
domain, while the hidden layer consists of conjunctions of predicates (logical rules) learned from
data. Instead of a complete network, connections exist only between predicates and hidden nodes
that are present in the conjunction. We illustrate RBM lifting with the following example.
Example. Consider a movie domain that contains the entity types (variables) Person(P), Movie(M)
and Genre(G). Predicates in this domain describe relationships between the various entities, such
as DirectedBy(M, P), ActedIn(P, M), InGenre(M, G) and entity resolution predicates such as
SamePerson(P1, P2) and SameGenre(G1, G2). These predicates are the atomic domain features,
46
fi. The task is to predict the nature of the collaboration between two persons P1 and P2; this task
can be represented via the target predicate:
Collaborated(P1, P2) =
0, P1, P2 never collaborated,
1, P1 worked under P2,
2, P2 worked under P1,
3, P1, P2 collaborated at the same level.
To perform this 4-class classification task, we can construct more complex lifted features through
conjunctions of the atomic domain features. For example, consider the following lifted feature, h1:DirectedBy(M1, P1) ∧ InGenre(M1, G1)∧
ActedIn(P2, M2) ∧ InGenre(M2, G2)∧
¬ SameGenre(G1, G2)
⇒ ( Collab(P1, P2) = 0 ). (h1)
This lifted feature is a compound domain rule (essentially a typical conjunction in logic models)
made up of several atomic domain features that describes one possible classification condition of
the target predicate. Specifically, the lifted feature h1 expresses the situation where two persons P1
and P2 are unlikely to have collaborated if they work in different genres. Every such compound do-
main rule becomes lifted feature with a corresponding hidden node. In this example, we introduce
two others:
DirBy(M1, P1) ∧ ActedIn(P3, M1) ∧ SamePer(P3, P2)⇒ ( Collab(P1, P2) = 1 ) , (h2)
ActedIn(P1, M) ∧ ActedIn(P2, M)⇒ ( Collab(P1, P2) = 3 ). (h3)
The key intuition is that these rules, or lifted features, capture the latent structure of the domain
and are a critical component of lifting RBMs. The layers of lifted RBM are as follows (Figure 4.1):
• Visible layer, atomic domain predicates: We create a visible node vi for each lifted atomic
domain predicate fi. Thus, we can express any possible structure that can be enumerated
as conjunction of these atomic features. In Figure 4.1, the visible layer consists of the five
atomic predicates introduced above, f1, . . . , f5.
47
Figure 4.1: An example of a lifted RBM. The atomic predicates each have a corresponding nodein the visible layer (fi). Atomic predicates can be used to create richer features as conjunctions,which are represented as hidden nodes (hj); the connections between the visible and hidden layersare sparse and only exist when the predicate corresponding to fi appears in the compound featurehj . The output layer is a one-hot vectorization of a multi-class label y, and has one node for eachclass yk. The connections between the hidden and output layers are dense and allow all features tocontribute to reasoning over all the classes.
• Hidden layer, compound domain rule: Each of the compound features can be represented
as a node in the hidden layer, hi. In this manner, the lifted RBM is able to construct and
use complex structural rules to reason over the domain. This is similar to classical neural
networks, propositional RBMs and deep learning, where the hidden layer neurons represent
rich and complex feature combinations.
The key difference from existing architectures is that the connections between the visible
and hidden layers are not dense; rather, they are extremely sparse and depend only on the
atomic predicates that appear in the corresponding lifted compound features. In Figure 4.1,
the hidden node h1 is connected to the atomic predicate nodes f1, f2, f3 and f5, while the
hidden node h3 is connected to only the atomic predicate node f2. This allows the lifted RBM
to represent the domain structure in a compact manner. Furthermore, such “compression”
48
can enable acceleration of weight learning as unnecessary edges are not introduced into the
model structure.
• Output layer, one-hot vectorization: As mentioned above, the lifted RBM formulation can
easily handle multi-class classification. In this example, the target predicate can take 4 val-
ues as it corresponds to a 4-class classification problem. This can be modelled with four
output nodes y1, . . . , y4 through one-hot vectorization of the labels. Note that the connec-
tions between the hidden and output layers are dense. This is to ensure that all features can
contribute to the classification of all the labels.
Furthermore, this enables the lifted RBM to reason with uncertainty. For example, consider
the compound domain feature h1, which describes a condition for two persons to have never
collaborated. By ensuring that the hidden-to-output connections are dense, we allow for the
contribution of this rule to the final prediction to be soft rather than hard. This is similar to
how Markov logic networks learn different rule weights to quantify the relative importance
of the domain rules/lifted features. In a similar manner, the lifted RBM allows for reason-
ing under uncertainty by learning the network weights to reflect the relative significance of
various features to different labels.
Our task now is to learn such lifted RBMs. Specifically, we propose to learn the structure
(compound features as hidden nodes) as well as the parameters (weights on all the edges and
biases within the nodes). This is a key novelty as our approach uses gradient boosting to learn
sparser LRBMs, unlike the fully connected propositional ones. To learn an LRBM, we need to
(1) formulate the (lifted) potential definitions, (2) derive the functional gradients, (3) transform the
gradients to explainable hidden units of the RBM, and (4) learn the parameters of the RBM. We
now present each of these steps in detail.
49
4.3.1 Functional Gradient Boosting of Lifted RBMs
The conditional Equation 2.2 which is the basis of an RBM, is formulated for propositional data,
where each feature of a training example xi is modeled as a node in the input layer x. We now
extend this definition of the RBM to handle logical predicates (i.e., parameterized relations). Note
that these lifted features (conjunctions) can be obtained in several different ways: (i) as with many
existing work on neuro-symbolic reasoning, these could be provided by a domain expert, or (ii)
can be learned from data similar to the research inside Inductive Logic Programming (Muggleton
and Raedt, 1994) or, (iii) performing random walks in the domain that result in rule structures
(Chapter 3), to name a few. Any rule induction technique could be employed in this context. In
this work, we adapt a gradient-boosting technique. Given such lifted features (or rules) fk(x) on
training examples x, we can rewrite Equation 2.2 as
p(y | x) =exp
(dy +
∑j ζ(cj + Ujy +
∑kWjkfk(x))
)∑
y∗∈{1,2,..C}
exp(dy∗ +
∑j ζ(cj + Ujy∗ +
∑kWjkfk(x))
) . (4.1)
Contrast this expression to the propositional discriminative RBM (in Equation 2.2), which
models p(y | x). The key difference is that the propositional features∑
kWjkxk are replaced
with lifted features∑
kWjkfk(x); while features in a propositional data set are just the data
columns/attributes, the features in a relational data set are typically represented in predicate logic
(as shown in the example above) and are rich and expressive conjunctions of objects, their attributes
and relations between them.
We now introduce some additional functional notation to simplify (Equation 4.1). Without loss
of generality, we restrict our discussion to the case of binary targets (with labels ` ∈ {0, 1}) and
note that this exposition can easily be extended to the case of multiple classes. For each label `, we
define functional
E(xi | c,W,d`, U`) := d` +∑j
ζ
(cj + Uj` +
∑k
Wjkfk(xi)
).
50
This functional represents the “energy” of the combination (xi, yi = `). For binary classification,
(Equation 4.1) is further simplified to
p(yi = 1 | xi) =eE(xi|c,W,d1,U1)
eE(xi|c,W,d0,U0) + eE(xi|c,W,d1,U1). (4.2)
This reformulation is critical for the extension of the discriminative RBM framework to relational
domains as it allows us to rewrite the probability p(yi = 1 | xi) in terms of a functional that
represents the potential and OF(xi), the observed features of the training example xi.
One of our goals is to learn lifted features from the set of all possible features. In simpler terms,
if x is the set of all predicates in the domain and x is the current target, then the goal is to identify
the set of features OF(x) s.t, P (x | x) = P (x | OF(x)). In Markov network terminology, this
refers to the Markov blanket of the corresponding variable. In a discriminative MLN framework,
OF(x) is the set of weighted clauses in which the predicate x appears. We can now define the
probability in (Equation 4.2) as
pψ (yi = 1 | OF(xi)) =eψ(yi=1|OF(xi))
1 + eψ(yi=1|OF(xi)), where (4.3)
ψ(yi = 1 | OF(xi)) = E(xi | c,W, d1, U1)− E(xi | c,W, d0, U0). (4.4)
Note that OF(xi) does not include all the features in the domain, but only the specific features that
are present in the hidden layer. An example of this can be observed in Figure 4.1. This LRBM
consists of three lifted features 〈h1, h2, h3〉 that correspond to the three rules mentioned earlier. We
can thus explicitly write the potential function for a lifted RBM (Equation 4.4) in functional form:
ψ(yi = 1 | OF(xi)) = d+∑j
log
(1 + exp (cj + Uj1 +
∑kWjkfk(xi))
1 + exp (cj + Uj0 +∑
kWjkfk(xi))
), (4.5)
where d = d1−d0. This potential functional is parameterized by θ = {d, c,W, U0, U1}, consisting
of (see Figure 4.2) edge weights and biases. The edge weights to be learned are Wjk (between
visible node corresponding to feature fk(xi) and hidden node hj) and Uj` (between hidden node
hj and output node y`). The biases to be learned are cj on the hidden nodes and d` on the output
51
Figure 4.2: Weights in a lifted RBM.
nodes. However, instead of learning two biases d1 and d0, we can learn a single bias d = d1 − d0
as the functional ψ only depends on the difference (see Equation 4.5). Given this functional form,
we can now derive a functional gradient that maximizes the overall log-likelihood of the data
L({xi, yi}ni=1 | ψ) = logn∏i=1
pψ (yi = 1 | OF(xi)) =n∑i=1
log pψ (yi = 1 | OF(xi)) .
The (pointwise) functional gradient of L({xi, yi}ni=1 | ψ) with respect to ψ(yi = 1 | OF(xi)) can
be computed as follows,
∂ log pψ(yi = 1 | OF(xi))
∂ψ(yi = 1 | OF(xi))= I(yi = 1)− P (yi = 1 | OF(xi)) := ∆i,
where I(yi = 1) is an indicator function. The pointwise functional gradient has an elegant inter-
pretation. For a positive example (I(yi = 1) = 1), the functional gradient ∆i aims to improve the
model such that 1 − P (yi = 1) is as small as possible, in effect pushing P (yi = 1) → 1. For a
negative example, (I(yi = 1) = 0), the functional gradient ∆i aims to improve the model such that
0 − P (yi = 1) is as small as possible, in effect pushing P (yi = 1) → 0. Thus, the gradient of
each training example xi is simply the adjustment required for the probabilities to match the true
observed labels yi. The functional gradient derived here has a similar form to the functional gra-
dients in other relational tasks such as boosting relational dependency networks (Natarajan et al.,
2012) Markov logic networks (Khot et al., 2011) to specify a few.
52
Figure 4.3: A general relational regression tree for lifted RBMs when learning a target predicatet(x). Each path from root to leaf is a compound feature (also a logical clause Clauser) that entersthe RBM as a hidden node hr. The leaf node contains the weights θr = {dr, cr,W r, U r
0 , Ur1} of
all edges introduced into the lifted RBM when this hidden node/discovered feature is introducedinto the RBM structure.
4.3.2 Representation of Functional Gradients for LRBMs
Our goal now is to approximate the true functional gradient by fitting a regression function ψ(x)
that minimizes squared error over the pointwise gradients of all the individual training examples:
ψ(x) = arg minψ
n∑i=1
(ψ(xi | OF(xi))−∆i)2. (4.6)
We consider learning the representation of ψ as a sum of relational regression trees. The key
advantage is that a relational tree can be easily interpreted and explained. To learn a tree to model
the functional gradients, we need to change the typical tree learner. Specifically, the splitting
criteria of the tree at each node is modified; to identify the next literal to add to the tree, r(x), we
greedily search for the literal that minimizes the squared error (Equation 4.6).
For a tree-based representation, we employ a relational regression-tree (RRT) learner (Blockeel
and De Raedt, 1998) to learn a function to approximately fit the gradients on each example. If we
learn an RRT to fit ψ(xi | OF(xi)) in Equation (4.5), each path from the root to a leaf can be viewed
as a clause, and the leaf nodes correspond to an RBM that evaluates to the weight of the clause for
that training example. Figure 4.3 shows an RRT when learning a lifted RBM via gradient boosting
53
for some target predicate t(x). The node q(x) can be any subtree that has been learned thusfar,
and a new predicate r(x) has been identified as a viable candidate for splitting. On splitting, we
obtain two new compound features as evidenced by two distinct paths from root to the two new leaf
nodes. These clauses (paths) along with their corresponding leaf nodes identify a new structural
component of the RBM, along with corresponding parameters
(Clause1) θ1 : q(x) ∧ r(x)⇒ t(x),
(Clause2) θ2 : q(x) ∧ ¬r(x)⇒ t(x).
Note that the clause q(·) and the predicate r(·) are expressed generally, and their arguments are
denoted broadly as x. In practice, q(·) and r(·) can be of different arities and take any possible
entity types in the domain.
4.3.3 Learning Relational Regression Trees
Let us assume that we have learned a relational regression tree till q(x) in Figure 4.3. Now assume
that, we are adding a literal r(x) to the tree at the left-most node of the subtree q(x). Let the feature
corresponding to the left branch (Clause1) be f1(x) = I(q(x) ∧ r(x)), that is, feature f1(x) = 1
for all training examples x that end up at the leaf θ1 and zero otherwise. Similarly, let the feature
corresponding to the right branch (Clause2) be f2(x) = I(q(x) ∧ ¬r(x)). The potential function
ψ(yi = 1 | OF(xi)) can be written using (Equation 4.5) as:
ψ(yi = 1 | OF(xi)) =∏k=1,2
[dk + log
(1 + exp
(ck + Uk
1 +W kfk(xi))
1 + exp(ck + Uk
0 +W kfk(xi)))]fk(xi)
. (4.7)
In this expression, when a training example xi satisfies Clause1, it reaches leaf node θ1 and
consequently, we have f1(xi) = 1 and f2(xi) = 0. When a training example xi satisfies Clause2,
the converse is true and we have f1(xi) = 0 and f2(xi) = 1. Thus, only one term is active in
the expression above and delivers the potential corresponding to whether the training example xi
is classified to the left leaf θ1 or the right leaf θ2. We can now substitute this expression for the
54
potential in Equation 4.7 into the loss function (Equation 4.6). The loss function is used in two
ways to grow the RRTs:
1. First, we identify the next literal to add to the tree, r(x), by greedily searching for the atomic
domain predicate that minimizes the squared error. This is similar to the splitting criterion
used in other lifted gradient boosting models such as MLN-boosting (Khot et al., 2011).
2. Next, after splitting, we learn parameters for the newly introduced leaf nodes. That is, for
each split of the tree at r(x), we learn θ1 = [d1, c1,W 1, U10 , U
11 ] for the left subtree and θ2 =
[d2, c2,W 2, U20 , U
21 ] for the right subtree. We perform parameter learning via coordinate
descent (Wright, 2015).
4.3.4 LRBM-Boost Algorithm
We now describe LRBM-BOOST algorithm (Algorithm 2) to learn structure and parameters of
LRBM. The algorithm takes instantiated ground facts (Data) and training examples of target T as
input and learns N regression trees that fits the example gradients to set of trees. The algorithm
starts by considering prior potential value ψ0 in F0, and in order to learn a new tree, it first generates
the regression examples, S=[(xi,yi), ∆i], in (line 4) where regression value ∆i (I-P ) is computed
by performing inference of previously learned trees. These regression examples S serve as input
to FITREGRESSIONTREE function (line 5) along with the maximum number of leaves (L) in the
tree. The next tree Fn is then added to set of existing tree (line 6). The final probability of LRBM
can be computed by performing inference on all N trees in order to obtain ψ.
FITREGRESSIONTREE function (line 10) generates a relational regression tree with L leaf
nodes. It starts with an empty tree and greedily adds one node at a time to the tree. In order to
add next node to the tree, it first considers the current node N to expand as the one that has the
best score in the beam (line 14). The potential children C of this node N (line 15) are constructed
by greedily considering and scoring clauses where the parameters are learned using coordinate
descent. Once the c is determined, it is added as the leaf to the tree and the process is repeated.
55
Algorithm 2 LRBM-Boost: Relational FGB for Lifted RBMs1: function LRBM-BOOST(Data, T , N )2: F0 = ψ0 . set prior of potential function3: for 1 ≤ n ≤ N do4: S := GENERATEEXAMPLES(Fn−1, Data, T ) . examples for next tree5: Fn := FITREGRESSIONTREE(S, L, T ) . learn regression tree6: Fn = Fn + Fn−1 . add new tree to existing set7: end for8: P(yi = 1 | OF(xi)) ∝ exp(ψ(yi = 1| OF(xi)))9: end function
10: function FITREGRESSIONTREE(S, L, T )11: Tree := CREATETREE(T (X)) . create empty Tree12: Beam := {Root(Tree)}13: while (i ≤ L) do . till max clauses L is reached14: N := CURRENTNODETOEXPAND(Beam)15: C := GENERATEPOTENTIALCHILDREN(N)16: for each c in C do . greedily search best child17: [ SL, ∆L ] := EXAMPLESATISFACTION(N ∧ c, S) . left subtree18: Θc
L := COORDINATEDESCENT([SL,∆L]) . learn LRBM params19: [ SR, ∆R ] := EXAMPLESATISFACTION(N ∧ ¬c, S) . right subtree20: Θc
R := COORDINATEDESCENT([SR,∆R]) . learn LRBM params21: scorec := COMPUTESSE(Θc
L,ΘcR,∆L,∆R) . using eq.(4.6)
22: end for23: c := argmin
c(scorec)
24: ADDCHILD(Tree,N, c)25: INSERT(Beam, c.left, c.left.score)26: INSERT(Beam, c.right, c.right.score)27: end while28: return Tree29: end function
4.4 Experimental Section
We aim to answer the following questions in our experiments:
Q1 How does LRBM-Boost2 compare to other neuro-symbolic models?
Q2 How does LRBM-Boost compare to other relational functional gradient boosting models?
2https://github.com/navdeepkjohal/LRBM-Boost
56
Q3 Is an ensemble of weak relational regression trees more effective than a single strong rela-
tional regression tree for constructing Lifted RBMs?
Q4 Can we generate an interpretable lifted RBM from the ensemble of weak relational regres-
sion trees learned by LRBM-Boost?
4.4.1 Experimental setup
To answer these questions, we employ seven standard SRL data sets, six of which – UW-CSE,
IMDB, CORA, SPORTS, MUTAGENESIS, YEAST2 – have already been explained in Chapter
3. The target predicate to be predicted for each of them is also similar as in Chapter 3 - AdvisedBy,
WorkedUnder, SameVenue, TeamPlaysSport, MoleAtm, Cites respectively. The details of the
seventh dataset are explained below.
WEBKB (Mihalkova and Mooney, 2007) contains information about webpages of students,
professors, courses etc. from four universities. We aim to predict whether a person is CourseTA
of a given course.
For all data sets, we generate positive and negative examples in 1 : 2 ratio, perform 5-fold cross
validation for every method being compared, and report AUC-ROC and AUC-PR on the resulting
folds respectively. For all baseline methods, we use default settings provided by their respective
authors. For our model, we learn 20 RRTs, each with a maximum number of 4 leaf nodes. The
learning rate of online coordinate descent was 0.05.
4.4.2 Comparison of LRBM-Boost to other neuro-symbolic models
To answer Q1, we compare our model to two recent relational neural models. The first baseline is
Relational RBM (RRBM-C) (Kaur et al., 2017); as explained in previous chapter, this approach uses
relational random walks to generate relational features that describe the structure of the domain.
In fact, it propositionalizes and aggregates counts on these relational random walks as features to
57
Figure 4.4: Comparing LRNN, RRBM-C, MLN-Boost and LRBM-Boost on AUC-ROC.
describe each training example. It should be noted that a key limitation of RRBM-C is that it can
only handle binary predicates; LRBM-Boost on the other hand, can handle any arity.
The second baseline is Lifted Relational Neural Networks (LRNN) (Sourek et al., 2018). LRNN
mainly focuses on parameter optimization; the structure of the network is identified using a clause
learner: PROGOL (Muggleton, 1995). PROGOL generated four, eight, six, three, ten, five rules for
CORA, IMDB, MUTAGENESIS, SPORTS, UW-CSE and WEBKB respectively. As LRNN cannot
handle the temporal restrictions of YEAST2, we do not evaluate LRNN on it.
Figures 4.4 and 4.5 present the results of this comparison on AUC-ROC and AUC-PR. LRBM-
Boost is significantly better than LRNN for MUTAGENESIS and CORA on both AUC-ROC and
AUC-PR. Further, it also achieves better AUC-ROC and AUC-PR than LRNN on SPORTS and UW-
CSE data set. Compared to RRBM-C, LRBM-Boost performs better for SPORTS and WEBKB on
both AUC-ROC and AUC-PR. Also, our proposed model performs better on YEAST2 on AUC-
58
Figure 4.5: Comparing LRNN, RRBM-C, MLN-Boost and LRBM-Boost on AUC-PR.
ROC. Q1 can be now be answered affirmatively: LRBM-Boost either performs comparably to or
outperforms state-of-the-art relational neural networks.
4.4.3 Comparison of LRBM-Boost to other relational gradient-boosting models
Since LRBM-Boost is a neuro-symbolic model as well as a relational boosting model, we next
compare it to two state-of-the-art relational functional gradient-boosting baselines: MLN-Boost
(Khot et al., 2011) and RDN-Boost (Natarajan et al., 2012). Figures 4.4 and 4.5 compare
LRBM-Boost to MLN-Boost. LRBM-Boost performs better than MLN-Boost for CORA and WE-
BKB on AUC-ROC metric. Also, it performs better than MLN-Boost for IMDB, UW-CSE, SPORTS
and WEBKB on AUC-PR. For all other data sets, both models have comparable performance.
We compare LRBM-Boost to RDN-Boost in a separate experiment, owing to a key dif-
ference in experimental setting. We do not convert the arity of predicates to binary; rather, we
compare RDN-Boost and LRBM-Boost by maintaining the original arity of all the predicates.
59
Table 4.1: Comparison of LRBM-Boost and RDN-Boost.
Data Set Target Measure LRBM-Boost RDN-Boost
UW−CSE ADVISEDBYAUC-ROC 0.9719 0.9731AUC-PR 0.9158 0.9049
IMDB WORKEDUNDERAUC-ROC 0.9610 0.9499AUC-PR 0.8789 0.8537
CORA SAMEVENUEAUC-ROC 0.9469 0.8985AUC-PR 0.9207 0.8451
WEBKB COURSETAAUC-ROC 0.6142 0.6057AUC-PR 0.4553 0.4490
The results of this experiment on four domains are reported in Table 4.1. LRBM-Boost out-
performs RDN-Boost in across the board, and substantially so on larger domains such as Cora.
These comparisons allow us to answer Q2 affirmatively: LRBM-Boost performs comparably or
outperforms state-of-the-art SRL boosting baselines.
4.4.4 Effectiveness of boosting relational ensembles
To understand the importance of boosting trees to construct an LRBM, we compared the perfor-
mance of the ensemble of relational trees learned by LRBM-Boost to a single relational tree,
similar to trees produced by the TILDE tree learner (Blockeel and De Raedt, 1998; Neville et al.,
2003). For the latter, we learn a large lifted tree (of depth 10), construct an RBM with the hidden
layer being every path from root to leaf of this tree and refer to it as LRBM-NoBoost.
Table 4.2 compares the performance of an ensemble (first row) vs. a single large tree (last
row). LRBM-Boost is statistically superior on SPORTS, YEAST2 and CORA on both AUC-ROC
and AUC-PR and is comparable on others. This asserts the efficacy of learning ensembles of
relational trees by LRBM-Boost rather than learning a single tree, affirmatively answering Q3.
60
Table 4.2: Comparison of (a) an ensemble of trees learned by LRBM-Boost, (b) an explainableLifted RBM constructed from the ensemble of trees learned by LRBM-Boost and (c) learning asingle, large, relational probability tree (LRBM-NoBoost).
Model AUC Sports IMDB UW-CSE Yeast2 Cora WebKBEnsembleLRBM
ROC 0.78±0.03 0.95±0.05 0.98±0.02 0.77±0.02 0.86±0.14 0.63±0.05PR 0.64±0.03 0.86±0.11 0.94±0.06 0.64±0.03 0.82±0.21 0.46±0.08
ExplainableLRBM
ROC 0.75±0.01 0.95±0.05 0.95±0.04 0.65±0.05 0.80±0.19 0.61±0.13PR 0.57±0.01 0.85±0.14 0.89±0.06 0.53±0.06 0.70±0.29 0.46±0.10
NoBoostLRBM
ROC 0.75±0.03 0.95±0.05 0.98±0.02 0.64±0.12 0.75±0.21 0.66±0.09PR 0.61±0.01 0.86±0.11 0.94±0.05 0.50±0.14 0.61±0.30 0.48±0.07
4.4.5 Interpretability of LRBM-Boost
While Q1–Q3 can be answered quantitatively, Q4 requires a qualitative analysis. It should be
noted that boosted relational models (here, boosted LRBMs) learn and represent the underlying
relational model as a sum of relational trees. When performing inference, this ensemble of trees
is not converted to a large SRL model as it is far more efficient to aggregate the predictions of the
individual relational trees in the ensemble.
For LRBM-Boost, however, it is possible to convert the ensemble-of-trees representation into
a single LRBM. This step is typically performed to endow the model with interpretability, ex-
plainability or for relationship discovery. For LRBM-Boost, this procedure is not exact, and the
resulting single large LRBM is almost, but not exactly, equivalent to the ensemble of trees repre-
sentation. The procedure itself is fairly straightforward:
• learn a single RRT from the set of boosted RRTs (Craven and Shavlik, 1995) that make up
the LRBM, that is, we empirically learn a single RRT by overfitting it to the predictions of
the weak RRTs (Figure 4.6).
61
Figure 4.6: An example of combined lifted tree learned from ensemble of trees. To constructthis tree, we compute the regression value of each training example by traversing through all theboosted trees. A single large tree is overfit to this (modified) training set to generate a single tree.
Figure 4.7: Lifted RBM obtained from the combined tree in Figure 4.6. Each path along the treein that figure represents the corresponding hidden node of LRBM.
• transform this single RRT to a LRBM (Figure 4.7); each path from root to leaf is a conjunc-
tion of relational features and enters LRBM as a hidden node, with connections to all the
output nodes and to the input nodes corresponding to the predicates that appear in that path.
62
Figure 4.8: Ensemble of trees learned during training of LRBM-Boost. The ensemble oftrees is generated in SPORTS domain where predicate P, T, Z represent plays(sports, team),teamplaysagainstteam(team, team) and athleteplaysforteam(athlete, team) respec-tively and target R represents teamplayssport(team, sports).
Figure 4.9: Demonstration of the conversion of two lifted trees in Figure 4.8 to LRBM. We createone hidden node for each path in each regression tree.
This construction leads to sparsity as it allows for only one hidden node to be activated for each
example. Of course, using clauses instead of trees as with boosting MLNs (Khot et al., 2011), could
relax this sparsity as needed. For our current domains, this restriction does not significantly affect
the performance as seen in Table 4.2 showing the quantitative results of comparing the explainable
LRBM with the original ensemble LRBM. There is no noticeable loss in performance as the AUC
values decrease marginally, if at all.
63
A simpler approach to constructing an explainable LRBM to skip aggregating the RRTs into a
large tree and directly map every path in every tree to a hidden node in the LRBM. For instance,
if the ensemble learned 20 balanced trees with 4 paths in each of them, the resulting LRBM has
80 lifted features. An example transformation is shown in Figure 4.9 from two trees in Figure
4.8. Note that corresponding LRBM has 10 hidden features which are conjunctions of the original
trees. While in principle it results in an interpretable LRBM, this can result in a large number of
hidden units and thus pruning strategies need to be employed, a direction that we will explore in
the near future. In summary, it can be said that our LRBM is effective and explainable allowing
when compared to the state-of-the-art approaches in several tasks.
4.4.6 Inference in a Lifted RBM
Figure 4.10: LRBM inference for Example 1.
The lifted RBM is a template that is grounded for each example during inference. We first
unify the example with the head of the clause (present at the output layer of LRBM), to obtain a
partial grounding of the body of the clause. The full grounding is then obtained by unifying the
64
partially-ground clause with evidence to find at least one instantiation of the body of the clause.
We illustrate the inference procedure for a Lifted RBM with three hidden nodes, and each hidden
node corresponding to the rules (h1)–(h3).
Example 1 We are given facts: ActedIn(p1, m1), ActedIn(p1, m2), ActedIn(p2, m1), ActedIn
(p2, m2). The number of substitutions depends on the query. Let us assume that the query
is Collab(p1, p2) (did p1 and p2 collaborate?), which results in the partial substitution: θ =
{P1/p1, P2/p2}. The inference procedure will proceed as follows:
• The bodies of the clauses (h1)–(h3) are partially grounded using θ = {P1/p1, P2/p2}:DirectedBy(M1, p1) ∧ InGenre(M1, G1)∧
ActedIn(p2, M2) ∧ InGenre(M2, G2)∧
¬ SameGenre(G1, G2)
(h1)
DirBy(M1, p1) ∧ ActedIn(P3, M1) ∧ SamePerson(P3, p2) (h2)
ActedIn(p1, M) ∧ ActedIn(p2, M). (h3)
• Next, since the facts do not contain any information about DirectedBy or SamePerson, h1
and h2 will not be satisfied.
• In order to prove the satisfiability of h3, we look at all the available facts as we attempt to
unify each fact with the partially-grounded clause. Say we first unify ActedIn(p1, m1) with
h3 that gives us:
ActedIn(p1, m1), ActedIn(p2, M), (h3)
resulting in the grounding: θ = {P1/p1, M/m1, P2/p2}. The second fact ActedIn(p1, m2)
does not unify with this partially-grounded clause. However, the third fact ActedIn(p2, m1)
unifies with h3 giving us a fully-grounded clause:
ActedIn(p1, m1), ActedIn(p2, m1). (h3)
65
The input nodes corresponding to the unified facts ActedIn(p1, m1), Acted In(p2, m1) are
activated. As soon as this clause is satisfied the search terminates. The main conclusion to
be drawn is that as soon as the clause is satisfied once, model does not check for another
satisfaction and terminates the search by returning true.
• The inputs are then propagated through the RBM, and the class output probabilities are
computed based on the RBM edge parameters/weights. The activation paths for inference
given this query and facts are shown in Figure 4.10.
Example 2 We are given facts: DirectedBy(m1, p1), InGenre(m1, g1), ActedIn (p2, m2),
InGenre(m2, g2), DirectedBy(m01, p01), ActedIn(p03, m01), SamePerson(p03, p02). Recall
that the number of substitutions depends on the query. Let us assume that the query is Collab(p01
, p02) (did p01 and p02 collaborate?), which results in partial substitution: θ = {P1/p01, P2/p02}.
The inference procedure will proceed as follows:
• The bodies of the clauses (h1)–(h3) are partially grounded using θ = {P1/p01, P2/p02}:DirectedBy(M1, p01) ∧ InGenre(M1, G1)∧
ActedIn(p02, M2) ∧ InGenre(M2, G2)∧
¬ SameGenre(G1, G2)
(h1)
DirBy(M1, p01) ∧ ActedIn(P3, M1) ∧ SamePerson(P3, p02) (h2)
ActedIn(p01, M) ∧ ActedIn(p02, M). (h3)
• Unifying the partially-grounded clauses with the facts, we will have that h1 and h3 will not
be satisfied. However, unification yields one fully-grounded h2 will:
DirBy(m01, p01) ∧ ActedIn(p03, m01) ∧ SamePerson(p03, p02), (h2)
which has the substitution: θ = {P1/p01, M/m01, P2/p02, P3/p03}. As before, once a satis-
fied grounding is obtained, the search is terminated.
66
• The RBM is unrolled as in Example 1, and the appropriate facts that appear in this ground-
ing are activated in the input layer. The prediction is obtained by propagating these inputs
through the network.
4.5 Conclusion
We presented the first learning algorithm for learning the structure of a lifted RBM from data. Mo-
tivated by the success of gradient-boosting, our method learns a set of RRTs using boosting and
then transforms them to a lifted RBM. The advantage of this approach is that it leads to learning
a fully lifted model that is not propositionalized using any standard approaches. We also demon-
strated how to induce a single explainable RBM from the ensemble of trees. The experimental
evaluation on several data sets demonstrated the efficacy and effectiveness along with the explain-
ability of the proposed approach.
67
CHAPTER 5
NEURAL NETWORKS WITH RELATIONAL PARAMETER TYING
Although the two models proposed in Chapters 3 and 4 have successfully lifted RBM to rela-
tional domains, yet the potential bottleneck of these two models is that they are centered on
one specific type of connectionist model: Boltzmann machines. In this chapter, we propose
a general relational neural network architecture (NNRPT) which is independent of any specific
probability distribution (Kaur et al., 2019).
5.1 Introduction
While successful, deep networks have a few important limitations. Apart from the key issue of
interpretability, the other major limitation is the requirement of a flat inputs (vectors, matrics,
tensors), which limits applications to tabular, propositional representations. On the other hand,
symbolic and structured representations (Getoor and Taskar, 2007; De Raedt et al., 2016; Getoor
et al., 2001; Richardson and Domingos, 2006; Bach et al., 2017) have the advantage of being
interpretable, while also allowing for rich representations that allow for learning and reasoning with
multiple levels of abstraction. This representability allows them to model complex data structures
such as graphs far more easily and interpretably than basic propositional representations. While
expressive, these models do not incorporate or discover latent relationships between features as
effectively as deep networks.
Consequently, there has been focus on achieving the dream team of logical and statistical learn-
ing methods such as relational neural networks (Kazemi and Poole, 2018; Sourek et al., 2016).
While specific architectures differ, these methods generally employ hand-coded relational rules
or Inductive Logic Programming (Lavrac and Dzeroski, 1993) to identify the domain’s structural
rules; these rules are then used with the observed data to unroll and learn a neural network. We
improve upon these methods in two specific ways: (1) we employ a rule learner that has been
68
recently successful to automatically extract interpretable rules that are then employed as hidden
layer of the neural network; (2) we exploit the notion of parameter tying from the perspective of
statistical relational learning models that allow multiple instances of the same rule share the same
parameter. These two extensions significantly improve the adaptation of neural networks (NNs)
for relational data.
We employ Relational Random Walks (Lao and Cohen, 2010) to extract relational rules from a
database, which are then used as the first layer of the NN. These random walks have the advantages
of being learned from data (instead of time-consumingly hand-coded), and interpretable (as walks
are rules in a database schema). Given evidence (facts), relational random walks are instantiated
(grounded); parameter tying ensures that groundings of the same random walk share the same
parameters with far fewer network parameters to be learned during training.
For combining outputs from different groundings of the same clause, we employ combina-
tion functions (Natarajan et al., 2008; Jaeger, 2007). For instance, given a rule: Professor(P),
Author(P, U), Author(S, U), Student(S), the ana-bob Professor-Student pair could have co-
authored 6 papers, while the cam-dan pair could have coauthored 10 publications (U). Combi-
nation functions are a natural way to compare such relational features arising from rules. Our
network handles this in two steps: first, by ensuring that all instances (papers) of a particular
Professor− Student pair share the same weights. Second, by combining predictions from each
of these instances (papers) using a combination function. We explore the use of Average combi-
nation function. Once the network weights are appropriately constrained by parameter tying and
combination functions, they can be learned using standard techniques.
We make the following contributions: (1) we learn a NN that can be fully trained from data
and with no significant engineering, unlike previous approaches; (2) we combine the successful
paradigms of relational random walks and parameter tying from SRL methods; this allows the
resulting NN to faithfully model relational data while being fully learnable; (3) we evaluate the
proposed approach against recent relational NN approaches and demonstrate its efficacy.
69
The rest of the chapter is organized as follows: we review the related work in Section 5.2
followed by discussing the proposed architecture in Section 5.3. This is followed by experimental
details in Section 5.4. We finally conclude in Section 5.6 after explaining connection between our
model and convolutional neural network in Section 5.5.
5.2 Related Work
The related work in this chapter has been categorized into four classes, each one of which is
discussed in the following Subsections 5.2.1-5.2.4.
5.2.1 Lifted Relational Neural Networks
Our work is closest to Lifted Relational Neural Networks (LRNN) (Sourek et al., 2016) in terms
of the architecture. LRNN uses expert hand-crafted relational rules as input, which are then in-
stantiated (based on data) and rolled out as a ground network. While at a high-level, our approach
appears similar to the LRNN framework, there are significant differences. First, while Sourek et
al., exploit tied parameters across examples within the same rule, there is no parameter tying across
multiple instances; our model, however, ensures parameter tying of multiple ground instances of
the rule (in our case, a relational random walk). Second, since they adopt a fuzzy notion, their
system supports weighted facts (called ground atoms in logic literature). We take a more standard
approach and our observations are Boolean. Third, while the previous difference appears to be lim-
iting in our case, note that this leads to a reduction in the number of network weight parameters.
Sourek et al. (2016), have extended their work to learn network structure using predicate
invention (Sourek et al., 2017); our work learns relational random walks as rules for the network
structure. As we show in our experiments, NNs cannot only easily handle such large number of
random walks, but can also use them effectively as a bag of weakly predictive intermediate layers
capturing local features. This allows for learning a more robust model than the induced rules,
which take a more global view of the domain.
70
Another recent approach is due to Kazemi and Poole (2018), as discussed in Section 1.1, who
proposed a relational neural network by adding hidden layers to their Relational Logistic Regres-
sion (Kazemi et al., 2014) model. A key limitation of their work is that they are restricted to unary
relation predictions, that is, they can only predict attributes of objects instead of relations between.
In contrast, ours is a general framework in that can be used to predict relations between objects.
Some of the earliest neuro-symbolic systems such as KBANN (Towell et al., 1990) date back to
the early 90s; KBANN also rolls out the network architecture from rules, though it only supports
propositional rules. Current work, including ours, instead explores relational rules which serve
as templates to roll out more complex architectures. Other recent approaches such as CILP++
(Franca et al., 2014) and Deep Relational Machines (Lodhi, 2013) incorporate relational infor-
mation as network layers. However, such models propositionalize relational data into flat-feature
vector and hence, cannot be seen as truly relational models. A rather distinctive approach in this
vein is due to Hu et al. (2016), where two independent networks incorporating rules and data are
trained together. Finally, NNs have also been trained to approximate ILP clause evaluation (Di-
Maio and Shavlik, 2004), perform SLD-resolution in first-order logic (Komendantskaya, 2007),
and approximate entailment operators in propositional logic (Evans et al., 2018).
5.2.2 Relational Random Walks
The Path Ranking Algorithm (Lao and Cohen, 2010) is a key framework, where a combination of
random walks replaces exhaustive search in order to answer queries. Recently, Das et al. (2017)
considered random walks between query entities to perform composition of embeddings of rela-
tions on each walk with recurrent neural networks. DeepWalks (Perozzi et al., 2014) performs
random walks on graphs by treating each node as a word, which results in learning embeddings
for each node of graph. Kaur et al. (2017) in Chapter 3 considers relational random walks to gen-
erate count and existential features to train a relational restricted Boltzmann machine (Larochelle
and Bengio, 2008). This feature transformation induces propositionalization that could potentially
result in loss of information, as we show in our experiments.
71
5.2.3 Tensor Based Models
As already discussed in Section 4.2, recently, several tensor-based models (Nickel et al., 2011;
Bordes et al., 2013; Socher et al., 2013; Bordes et al., 2012; Wang et al., 2014b) have been proposed
to learn embeddings of objects and relations. Such models have been very effective for large-
scale knowledge-base construction. However, they are computationally expensive as they learn
parameters for each object and relation in the knowledge base. Furthermore, the embedding into
some ambient vector space makes the models more difficult to interpret. Though rule distillation
can yield human-readable rules (Yang et al., 2015), it is another computationally intensive post-
processing step, which limits the size of the interpreted rules. We will explore these models further
in the part II of the dissertation.
5.2.4 Other Models
Several NNs have been utilized with relational databases schemas (Blockeel and Uwents, 2004;
Ramon and Raedt, 2000). These models differ on how they handle 1-to-N joins, cyclicity, and
indirect relationships between relations. However, they all learn one network per relation, which
makes them computationally expensive. In the same vein, graph-based models take graph structure
into consideration during training in (Pham et al., 2017; Niepert et al., 2016; Scarselli et al., 2009)
as already in Section 4.2. Finally, with the rapid growth of deep learning, relational counterparts
of most of existing connectionist models have been also proposed in Schlichtkrull et al. (2018),
Palm et al. (2018), Wang et al. (2015) and Zeng et al. (2014).
5.3 Neural Networks with Relational Parameter Tying: The proposed approach
We first introduce some notation for relational logic, which is used for relational representation,
with the domain being represented using constants, variables and predicates. We adopt the follow-
ing conventions: (1) constants used to represent entities in the domain are written in lower-case
72
(e.g., ana, bob); (2) variables and entity types are capitalized (e.g., Student, Professor); and
(3) relations and predicate symbols between entities and attributes are represented as Q(·, ·). A
grounding is a predicate applied to a tuple of terms (i.e., either a full or partial instantiation), e.g.
AdvisedBy(Student, ana), is a partial instantiation.
Rules are constructed from atoms using logical connectives (∧, ∨) and quantifiers (∃, ∀). Due
to the use of relational random walks, the relational rules that we employ are universally conjunc-
tions of the form h ⇐ b1 ∧ . . . ∧ b`, where the head h is the target of prediction and the body
b1 ∧ . . . ∧ b` corresponds to conditions that make up the rule (that is, each literal bi in the body is
a predicate Q(·, ·)). We do not consider negations in this work.
An example rule could be AdvisedBy(S, P) ⇐ Professor(P) ∧ WorksIn(P, T) ∧ PartOf(T,
S) ∧ Student(S). This rules states that if a Student is a part of the project that the Professor
works on, then the Student is advised by that Professor. The body of the rule is learned as a
random walk that starts with Professor and ends with Student. Such a random walk represents
a chain of relations that could possibly connect a Professor to a Student and is a relational
feature that could help in the prediction. The rule head is the target that we are interested in
predicting. Since these rules are essentially “soft” rules, we can also associate clauses with weights,
i.e., weighted rules: (R, w).
A relational neural network N is a set of M weighted rules describing interactions in the do-
main {(Rj, wj)}Mj=1. We are given a set of atomic facts F , known to be true and labeled relational
training examples {(xi, yi)}`i=1. In general, labels yi can take multiple values corresponding to a
multi-class problem. We seek to learn a relational neural network model N ≡ {(Rj, wj)}Mj=1to
predict a Target relation, given relational examples x, that is: y = Target(x).
73
Given: Set of instances F , Target relation, relational data set (x, y) ∈ D;
Construct (structure learning): Rj, relational random walk rules (relational feature describing
the network structure of N );
Train (parameter learning): wj , rule weights via gradient descent with rule-based parameter
tying to identify a sparse set of network weights of N
Example. The movie domain contains the entity types (variables) Person(P), Movie(M) and
Genre(G). In addition to this, there are relations (features): Directed(P, M), ActedIn(P, G) and
InGenre(M, G). The domain also has relations for entity resolution: SamePerson(P1, P2) and
SameGenre(G1, G2). The task is to predict if P1 worked under P2, with the target predicate (label):
WorkedUnder(P1, P2).
5.3.1 Generating Lifted Random Walks
The core component of a neural network model is the architecture, which determines how the var-
ious neurons are connected to each other, and ultimately how all the input features interact with
each other. In a relational neural network, the architecture is determined by the domain structure,
or the set of relational rules that determines how various relations, entities and attributes interact in
the domain as shown earlier with the AdvisedBy example. While previous approaches employed
carefully hand-crafted rules, we, instead, use relational random walks to define the network archi-
tecture and model the local relational structure of the domain. A similar approach was also used
by us in Chapter 3 (Kaur et al., 2017), though the random walk features were used to instantiate a
restricted Boltzmann machine, which has a limited architecture and that work is not lifted since it
instantiates the entire network before learning.
Relational data is often represented using a lifted graph, which defines the domain’s schema; in
such a representation, a relation Predicate(Type1, Type2) is a predicate edge between two type
nodes: Type1Predicate−−−−−→ Type2. A relational random walk through a graph, as discussed in Section
74
2.2, is a chain of such edges corresponding to a conjunction of predicates. For a random walk to
be semantically sound, we should ensure that the input type (argument domain) of the (i + 1)-th
predicate is the same as the output type (argument range) of the i-th predicate.
Example (continued). The body of the rule
ActedIn(P1, G1) ∧ SameGenre(G1, G2) ∧ ActedIn−1(G2, P2)∧
SamePerson(P2, P3) ∧ ActedIn−1(P3, M) ∧ Directed(M, P4)⇒ WorkedUnder(P1, P4)
can be represented graphically as
P1ActedIn−−−−→ G1
SameGenre−−−−−−→ G2ActedIn−1
−−−−−−→P2SamePerson−−−−−−−→ P3
ActedIn−1
−−−−−−→ M Directed−−−−−→ P4.
This is lifted random walk between entities P1 → P4 in the target predicate, WorkedUnder(P1, P4).
It is semantically sound as it is possible to chain the second argument of a predicate to the first
argument of the succeeding predicate. This walk also contains an inverse predicate ActedIn−1,
which is distinct from ActedIn (since the argument types are reversed).
We use path-constrained random walks (Lao and Cohen, 2010) approach to generate M lifted
random walks Rj, j = 1, . . . ,M . These random walks form the backbone of the lifted neural
network, as they are templates for various feature combinations in the domain. They can also be
interpreted as domain rules as they impart localized structure to the domain model, that is, they
provide a qualitative description of the domain. When these rules, or lifted random walks have
weights associated with them, we are then able to endow the rules with a quantitative influence
on the target predicate. We now describe a novel approach to network instantiation using these
random-walk-based relational features. A key component of proposed instantiation is rule-based
parameter tying, which reduces the number of network parameters to be learned significantly, while
effectively maintaining the quantitative influences as described by the relational random walks.
75
Figure 5.1: The relational neural network is unrolled in three stages, ensuring that the output is afunction of facts through two hidden layers: the combining rules layer (with lifted random walks)and the grounding layer (with instantiated random walks). Weights are tied between the input andgrounding layers based on which fact/feature ultimately contributes to which rule in the combiningrules layer.
5.3.2 Network Instantiation
The relational random walks (Rj) generated in the previous subsection are the relational features
of the lifted relational neural network, N . Our goal is to unroll and ground the network with
several intermediate layers that capture the relationships expressed by the random walks. A key
difference in network construction between our proposed work and recent approaches such as that
of Sourek et al., (Sourek et al., 2018) is that we do not perform an exhaustive grounding to generate
all possible instances before constructing the network. Instead, we only ground as needed leading
to a much more compact network. We unroll the network in the following manner (cf. Figure 5.1).
Output Layer: For the Target, which is also the head h in all the rules Rj, introduce an output
neuron called the target neuron, Ah. With one-hot encoding of the target labels, this architecture
76
can handle multi-class problems. The target neuron uses the softmax activation function. Without
loss of generality, we describe the rest of the network unrolling assuming a single output neuron.
Combining Rules Layer: The target neuron is connected to M lifted rule neurons, each corre-
sponding to one of the lifted relational random walks, (Rj, wj). Each rule Rj is a conjunction of
predicates defined by random walks:
Qj1(X, ·) ∧ . . . ∧ Qj
L(·, Z) ⇒ Target(X, Z), j = 1, . . . ,M, (5.1)
and corresponds to the lifted rule neuron Aj . This layer of neurons is fully connected to the output
layer to ensure that all the lifted random walks (that capture the domain structure) influence the
output. The extent of their influence is determined by learnable weights, uj between Aj and the
output neuron Ah.
In Figure 5.1, we see that the rule neuron Aj is connected to the neurons Aji; these neurons
correspond to Nj instantiations of the random-walk Rj. The lifted rule neuron Aj aims to combine
the influence of the groundings/instantiations of the random-walk feature Rj that are true in the
evidence. Thus, each lifted rule neuron can also be viewed as a rule combination neuron. The
activation function of a rule combination neuron can be any aggregator or combining rule (Natara-
jan et al., 2008). This can include value aggregators such as weighted mean, max or distribution
aggregators (if inputs to the this layer are probabilities) such as Noisy-Or. Many such aggregators
can be incorporated into the combining rules layer with appropriate weights (vji) and activation
functions of the rule neurons. For instance, combining rule instantiations out(Aji) with a weighted
mean will require learning vji, with the nodes using unit functions for activation. The formulation
of this layer is much more general and subsumes the approach of Sourek et al (2018), which uses
a max combination layer.
Grounding Layer: For each instantiated (ground) random walk Rjθi, i = 1, . . . , Nj , we introduce
a ground rule neuron, Aji. This ground rule neuron represents the i-th instantiation (grounding) of
the body of the j-th rule, Rjθi: Qj1θi ∧ . . . ∧ Qj
Lθi (cf. Equation 5.1). The activation function of a
77
ground rule neuron is a logical AND (∧); it is only activated when all its constituent inputs are true
(that is, only when the entire instantiation is true in the evidence).
This requires all the constituent facts Qj1θi, . . . , Qj
Lθi to be in the evidence. Thus, the (j, i)-th
ground rule neuron is connected to all the fact neurons that appear in its corresponding instantiated
rule body. A key novelty of our approach is regarding relational parameter tying: the weights
of connections between the fact and grounding layers are tied by the rule these facts appear in
together. This is described in detail further below.
Input Layer: Each instantiated (grounded) predicate that appears as a part of an instantiated rule
body is a fact, that is Qjkθi ∈ F . For each such instantiated fact, we create a fact neuron Af ,
ensuring that each unique fact in evidence has only one single neuron associated with it. Every
example is a collection of facts, that is, example xi ≡ Fi ⊂ F . Thus, an example is input into the
system by simply activating its constituent facts in the input layer.
Relational Parameter Tying: The most important thing to note about this construction is that we
employ rule-based parameter tying for the weights between the grounding layer and the input/facts
layer. Parameter tying ensures that instances corresponding to an example all share the same
weight wj if they occur in the same lifted rule Rj. The shared weights wj are propagated through
the network in a bottom-up fashion, ensuring that weights in the succeeding hidden layers are
influenced by them.
Our approach to parameter tying is in sharp contrast to that of Sourek et al. (2018), who learn
the weights of the network edges between the output layer and the combining rules layer. Fur-
thermore, they also use fuzzy facts (weighted instances), whereas in our case, the facts/instances
are Boolean, though their edge weights are tied. This approach also differs from our approach in
Chapter 3 which also used relational random walks. From a parametric standpoint, Chapter 3 used
relational random walks as features for a restricted Boltzmann machine, where the instance neu-
rons and the rule neurons form a bipartite graph. Thus, the RRBM formulation has significantly
more edges, and commensurately many more parameters to optimize during learning.
78
Figure 5.2: Example: unrolling the network with relational parameter tying.
Example (continued, see Figure 5.2). Consider two lifted random walks (R1, w1) and (R2, w2) for
the target predicate WorkedUnder(P1, P2)
WorkedUnder(P1, P2)⇐ActedIn(P1, M) ∧ Directed−1(M, P2),
WorkedUnder(P1, P2)⇐SamePerson(P1, P3) ∧ ActedIn(P3, M) ∧ Directed−1(M, P2).
Note that while inverse predicate Directed−1(M, P) is syntactically different from Directed(P, M)
(argument order is reversed), they are both semantically same. The output layer consists of a
single neuron Ah corresponding to the binary target WorkedUnder. The lifted rule layer (also
known as combining rules layer) has two lifted rule nodes A1 corresponding to rule R1 and A2
corresponding to rule R2. These rule nodes combine inputs corresponding to instantiations that
are true in the evidence. The network is unrolled based on the specific training example, for
instance: WorkedUnder(Leo, Marty). For this example, the rule R1 has two instantiations that
are true in the evidence. Then, we introduce a ground rule node for each such instantiation:
A11 :ActedIn(Leo, “The Departed”) ∧ Directed−1(“The Departed”, Marty),
A12 :ActedIn(Leo, “The Aviator”) ∧ Directed−1(“The Aviator”, Marty).
79
The rule R2 has only one instantiation, and consequently only one node:
A21 :SamePerson(Leo, Leonardo) ∧ ActedIn(Leo, “The Departed”)
∧ Directed−1(“The Departed”, Marty).
The grounding layer consists of ground rule nodes corresponding to instantiations of rules that
are true in the evidence. The edges Aji → Aj have weights vji that depend on the combining rule
implemented in Aj . In this example, the combining rule is average, so we have v11 = v12 = 12
and v21 = 1. The input layer consists of atomics fact in evidence: f ∈ F . The fact nodes
ActedIn(Leo, “The Aviator”) and Directed−1(“The Aviator”, Marty) appear in the ground-
ing R1θ2 and are connected to the corresponding ground rule neuron A12. Finally, parameters are
tied on the edges between the facts layer and the grounding layer. This ensures that all facts that
ultimately contribute to a rule are pooled together, which increases the influence of the rule during
weight learning. This ensures that a rule that holds strongly in the evidence gets a higher weight.
Once the network Nθ is instantiated, the weights wj and uj can be learned using standard
techniques such as backpropagation. We denote our approach Neural Networks with Relational
Parameter Tying (NNRPT). The tied parameters incorporate the structure captured by the relational
features (lifted random walks), leading to a network with significantly fewer weights, while also
endowing the it with semantic interpretability regarding the discriminative power of the relational
features. We now demonstrate the importance of parameter tying and the use of relational random
walks as compared to previous frameworks.
5.4 Experiments
Our empirical evaluation aims to answer the following questions explicitly1: Q1: How does NNRPT
compare to the state-of-the-art SRL models i.e., what the value of learning a neural net over stan-
dard models? Q2: How does NNRPT compare to propositionalization models i.e., what is the need
1https://github.com/navdeepkjohal/NNRPT
80
for parameterization of standard neural networks? Q3: How does NNRPT compare to other neuro-
symbolic models in literature?
Table 5.1: Data sets used in our experiments to answer Q1–Q3. The last column shows the numberof sampled groundings of random walks per example for NNRPT.
Domain Target #Facts #Pos #Neg #RW #Samp/RWUW-CSE advisedBy 2817 90 180 2500 1000
MUTAGENESIS MoleAtm 29986 1000 2000 100 100CORA SameVenue 31086 2331 4662 100 100IMDB WorkedUnder 914 305 710 80 -
SPORTS TeamPlaysSport 7824 200 400 200 100
5.4.1 Data Sets
We use six standard data sets to evaluate our algorithm (see Table 5.1). We experiment with UW-
CSE, IMDB, CORA, MUTAGENESIS, SPORTS datasets, details of the which are already explained
in Section 3.5 in Chapter 3. We predict AdvisedBy relationship between a professor and a student
in UW-CSE, whether an actor has WorkedUnder a director in IMDB, if one venue is SameVenue as
another in CORA, which sport a particular team plays i.e teamPlaysSports in SPORTS. We used
two variants of MUTAGENESIS in this work: one to answer Q1 and other to answer Q3. For Q1,
we formulated all atom and molecule properties as binary predicates, and performed relation pre-
diction of whether an atom is a constituent of a molecule or not (MoleAtm(AtomID, MolID)). For
Q3, we considered various properties of atoms and molecules as unary predicates (as described in
the experimental framework of LRNN, (Sourek et al., 2018)) and performed the binary classification
of whether a compound is mutagenetic or not.
The last dataset in our experiments is the Predictive Toxicology Challenge (PTC, (Helma et al.,
2001)) dataset. It further consists of four data sets where the aim in each is to predict the carcino-
genicity of a chemical based on its properties, constituent atoms and properties of the constituent
atoms. The true toxicity labels of the chemicals were generated by exposure of female rats (fr),
female mice (fm), male rat (mr) and male mice (mc) to these chemicals. This resulted in four data
sets, which we used to devise an experiment specifically to answer Q3.
81
5.4.2 Baselines and Experimental Details
To answer Q1, we compare NNRPT with the more recent and state-of-the-art relational gradient-
boosting methods, RDN-Boost(Natarajan et al., 2012), MLN-Boost (Khot et al., 2011), and relational
restricted Boltzmann machines RRBM-E, RRBM-C (Kaur et al., 2017). As the random walks chain
binary predicates in our model, we convert unary and ternary predicates into binary predicates
for all data sets. Further, to maintain consistency in experimentation, we use the same resulting
predicates across all our baselines as well. We run RDN-Boost and MLN-Boost with their default
settings and learn 20 trees for each model. Also, we train RRBM-E and RRBM-C according to the
settings recommended in Chapter 3.
For NNRPT, we generate random walks by considering each predicate and its inverse to be two
distinct predicates. Also, we avoid loops in the random walks by enforcing sanity constraints on the
random walk generation. We consider 100 random walks for MUTAGENESIS, CORA, 80 random
walks for IMDB, 200 random walks for SPORTS and 2500 random walks for UW-CSE as suggested
in Chapter 3 (Kaur et al., 2017) (see Table 5.1). Since we use a large number of random walks,
exhaustive grounding becomes prohibitively expensive. To overcome this, we sample groundings
for each random walk for large data sets. Specifically, we sample 100 groundings per random
walk per example for CORA, SPORTS, MUTAGENESIS, and 1000 groundings per random walk per
example for UW-CSE (see Table 5.1).
For all experiments, we set the positive to negative example ratio to be 1 : 2 for training, set
combination function to be average and perform 5-fold cross validation. For NNRPT, we set the
learning rate to be 0.05, batch size to 1, and number of epochs to 1. We train our model with
L1-regularized AdaGrad (Duchi et al., 2011). Since these are relational data sets where the data is
skewed, AUC-PR and AUC-ROC are better measures than likelihood and accuracy.
To answer Q2, we generated flat feature vectors by Bottom Clause Propositionalization (BCP,
(Franca et al., 2014)), according to which one bottom clause is generated for each example. BCP
82
considers each predicate in the body of the bottom clause as a unique feature when it proposition-
alizes bottom clauses to flat feature vector. We use Progol (Muggleton, 1995) to generate these
bottom clauses. After propositionalization, we train two connectionist models: a propositionalized
Restricted Boltzmann Machine (BCP-RBM) and a propositionalized neural network (BCP-NN). The
NN has two hidden layers in our experiments, which makes BCP-NN model a modified version of
CILP++ (Franca et al., 2014) that had one hidden layer. The hyper-parameters of both the models
were optimized by line search on validation set.
To answer Q3, we compare our model with Lifted Relational Neural Networks (LRNN, (Sourek
et al., 2018)) where in order to compare the two models, we perform five sub-experiments. Specifi-
cally, we learn the structure of the neural network by three ways: (a) expert provided (hand-coded)
rules as used in Sourek et al. (2018) (b) employ random walks as structure for both the mod-
els as proposed in our model (c) we perform structure learning by third independent ILP model,
specifically PROGOL (Muggleton, 1995) and input the same clauses to both LRNN and NNRPT.
For first experiment, we employed the hand-crafted rules of Sourek et al. (2018) with both LRNN
and NNRPT and predict the mutageneticity and carcinogeneticity on the MUTAGENESIS and PTC
data sets respectively. The hand-crafted rules consider chains of atoms to predict the target label;
for instance, two-chain rules (2c) consider the properties of two atoms. Since we do not perform
soft clustering at the hidden layer of our model, it is necessary to modify their chain rules to run
within our system. Specifically, LRNN first considers rules that represent the cluster of atom-types,
and then provides the resulting cluster predicate as input to chain rules. However, we formulate
the rules directly in terms of atom-type and bond-type. For example, a 2-chain rule in our model
looks like:
AtomType(B) ∧ AtomType(C) ∧ Bond(B, C, D) ∧ BondType(D)
∧ Contains(A, B) ∧ Contains(A, C)⇒ Mutagenetic(A)
where atoms B and C are connected by bond D, and A is the chemical whose mutageneticity is being
predicted by both the models.
83
Table 5.2: Comparison of different learning algorithms based on AUC-ROC and AUC-PR. NNRPTis comparable or better than standard SRL methods across all data sets.
Data Set Measure RDN-Boost MLN-Boost RRBM-E RRBM-C NNRPT
UW-CSEAUC-ROC 0.973±0.014 0.968±0.014 0.975±0.013 0.968±0.011 0.959±0.024AUC-PR 0.931±0.036 0.916±0.035 0.923±0.056 0.924±0.040 0.896±0.063
IMDBAUC-ROC 0.955±0.046 0.944±0.070 1.000±0.000 0.997±0.006 0.984±0.025AUC-PR 0.863±0.112 0.839±0.169 1.000±0.000 0.992±0.017 0.951±0.082
CORAAUC-ROC 0.895±0.183 0.835±0.035 0.984±0.009 0.867±0.041 0.952±0.043AUC-PR 0.833±0.259 0.799±0.034 0.948±0.042 0.825±0.050 0.899±0.070
MUTAG.AUC-ROC 0.999±0.000 0.999±0.000 0.999±0.000 0.998±0.001 0.981±0.024AUC-PR 0.999±0.000 0.999±0.000 0.999±0.000 0.997±0.002 0.970±0.039
SPORTSAUC-ROC 0.801±0.026 0.806±0.016 0.760±0.016 0.656±0.071 0.780±0.026AUC-PR 0.670±0.028 0.652±0.032 0.634±0.020 0.648±0.085 0.668±0.070
We perform five sub-experiments to answer Q3: (i) hand-coded chain rules with LRNN (HCRules-
LRNN); (ii) hand-coded chain rules (modified) with our approach (HCRules-NNRPT); (iii) lifted ran-
dom walks with LRNN; and (iv) lifted random walks with our approach (proposed) (v) PROGOL
clauses as structure for both the models.
5.4.3 Results
Table 5.2 compares our NNRPT to MLN-Boost, RDN-Boost, RRBM-E and RRBM-C to answer Q1. As
we see, NNRPT is significantly better than RRBM-C for CORA and SPORTS on both AUC-ROC and
AUC-PR, and performs comparably to the other data sets. It also performs better than MLN-Boost,
RDN-Boost on IMDB and CORA data sets, and comparably on other data sets. Similarly, it performs
better than RRBM-E on SPORTS, both on AUC-ROC and AUC-PR and comparably on other data
sets. Broadly, Q1 can be answered affirmatively in that NNRPT performs comparably to or better
than state-of-the-art SRL models.
Table 5.3 shows the comparison of NNRPT with two propositionalization models: BCP-RBM and
BCP-NN in order to answer Q2. NNRPT performs better than BCP-RBM on all the data sets except
MUTAGENESIS, where the two models have similar performance. NNRPT also performs better than
84
Table 5.3: Comparison of NNRPT with propositionalization-based approaches. NNRPT is signifi-cantly better on a majority of data sets.
Data Set Measure BCP-RBM BCP-NN NNRPT
UW-CSEAUC-ROC 0.951±0.041 0.868±0.053 0.959±0.024AUC-PR 0.860±0.114 0.869±0.033 0.896±0.063
IMDBAUC-ROC 0.780±0.164 0.540±0.152 0.984±0.025AUC-PR 0.367±0.139 0.536±0.231 0.951±0.082
CORAAUC-ROC 0.801±0.017 0.670±0.064 0.952±0.043AUC-PR 0.647±0.050 0.658±0.064 0.899±0.070
MUTAG.AUC-ROC 0.991±0.003 0.945±0.019 0.981±0.024AUC-PR 0.995±0.001 0.973±0.012 0.970±0.039
SPORTSAUC-ROC 0.664±0.021 0.543±0.037 0.780±0.026AUC-PR 0.532±0.041 0.499±0.065 0.668±0.070
BCP-NN on all data sets. It should be noted that BCP feature generation sometimes introduces a
large positive-to-negative example skew (for example, in the IMDB data set), which can some-
times gravely affect the performance of the propositional model, as we observe in Table 5.3. This
emphasizes the need for designing models that can handle relational data directly and without
propositionalization; our proposed model as an effort in this direction. Q2 can now be answered
affirmatively: that NNRPT performs better than propositionalization models.
Table 5.4 shows the comparison of NNRPT with LRNN with both approaches using expert hand-
coded rules (Sourek et al., 2018). NNRPT performs better than LRNN on the mut-3c data subset and
similarly on the mut-2c data subset. Furthermore, it can be observed that, while both LRNN and
NNRPT do not exhibit good performance on the PTC data sets, NNRPT is often better than LRNN. This
leads us to infer that chain rules proposed in Sourek et al., (2018) may not be an effective choice in
some domains. So how well do these approaches perform using relational random walk features?
Table 5.5 shows the results of this experiment, where, instead of employing expert hand-crafted
rules, we use lifted random walks. We restrict ourselves to two domains: IMDB and UW-CSE, as
the full grounding of the random walks generated for the other domains was too large for LRNN.
For UW-CSE, we varied the number of random walks as {100, 300, 400}. For IMDB, we provided
85
Table 5.4: Comparison of NNRPT and LRNN on AUC-ROC and AUC-PR on different data sets.Both the models were provided expert hand-crafted rules from Sourek et al., (Sourek et al., 2018).NNRPT is capable of employing rules to improve performance in some data sets.
Data Set Measure HCRules-LRNN HCRules-NNRPT
mut-2CAUC-ROC 0.8296±0.0589 0.7756±0.0636AUC-PR 0.9203±0.0302 0.8922±0.0408
mut-3CAUC-ROC 0.8359±0.0679 0.8389±0.0421AUC-PR 0.9182±0.0255 0.9293±0.0252
fm-2CAUC-ROC 0.5±0 0.5788±0.0761AUC-PR 0.4038±0.0028 0.5146±0.1149
fr-2CAUC-ROC 0.5±0 0.5198±0.0972AUC-PR 0.3449±0.0027 0.4025±0.0881
mm-2CAUC-ROC 0.5±0 0.5790±0.0299AUC-PR 0.3879±0.0045 0.5054±0.0532
mr-2CAUC-ROC 0.5±0 0.5623±0.0377AUC-PR 0.4478±0.0033 0.4862±0.0831
Table 5.5: Comparsion of LRNN and NNRPT using relational random walk features. Across all thedomains NNRPT could better exploit the power of relational random walks.
Model Measure imdb-20RW UWCSE-100RW UWCSE-300RW UWCSE-400RW
LRNNAUC-ROC 0.6493±0.1480 0.6054±0.1206 0.7902±0.1771 0.6962±0.1745AUC-PR 0.5255±0.1898 0.5462±0.1758 0.7146±0.1747 0.6177±0.1539
NNRPTAUC-ROC 0.7773±0.3299 0.9014±0.1155 0.9166±0.0413 0.9459±0.0376AUC-PR 0.7423±0.3598 0.8215±0.1926 0.8327±0.0827 0.8778±0.0965
20 random walks as LRNN could not work with larger set. For each of these settings, our proposed
framework significantly outperforms the LRNN framework. Also, while our framework can easily
scale to 2500 random walks for UW-CSE, the other frameworks cannot achieve this scale.
Tables 5.4 and 5.5 offer a deeper insight into the potential of our NNRPT approach. While
NNRPT can exploit expert hand-crafted rules when available, its true strength emerges on domains
like PTC. In situations where the hand-crafted rules come from domains where even experts have
limited understanding of the domain, NNRPT can help identify viable features and discover new
relationships. In contrast, LRNN cannot scale to as many relational random-walk features.
86
Table 5.6: Comparison of NNRPT and LRNN on AUC-ROC and AUC-PR on different data sets.Both the models were provided clauses learnt by PROGOL, (Muggleton, 1995). NNRPT is capableof employing rules to improve performance in some data sets.
Model Measure UW-CSE IMDB CORA MUTAGEN. SPORTS
LRNNAUC-ROC 0.923±0.027 0.995±0.004 0.503±0.003 0.500±0.000 0.741±0.016AUC-PR 0.826±0.056 0.985±0.013 0.356±0.006 0.335±0.000 0.527±0.036
NNRPTAUC-ROC 0.700±0.186 0.997±0.007 0.968±0.022 0.532±0.019 0.657±0.014AUC-PR 0.910±0.072 0.992±0.017 0.943±0.032 0.412±0.032 0.658±0.056
Finally, for the fifth sub-experiment in Q3, Table 5.6 compares the performance of NNRPT and
LRNN when both use clauses learned by PROGOL (Muggleton, 1995). PROGOL learned 4 clauses
for CORA, 8 clauses for IMDB, 3 clauses for SPORTS, 10 clauses for UW-CSE and 11 clauses for
MUTAGENESIS in our experiment. NNRPT performs better on UW-CSE, SPORTS evaluated using
AUC-PR. This result is especially significant because these data sets are considerably skewed.
NNRPT also outperforms LRNN on CORA and MUTAGENESIS. Lastly, NNRPT has comparable per-
formance on IMDB on both AUC-ROC and AUC-PR. The reason for this big performance gap
between the two models on CORA is likely because LRNN could not build effective models with the
fewer number of clauses (i.e. four) typically learned by PROGOL. In contrast, even with very few
clauses, NNRPT is able to outperform LRNN. This helps us answer Q3, affirmatively, that: NNRPT
offers many advantages over state-of-the-art relational neural networks.
In summary, our experiments clearly show the benefits of parameter tying as well as the ex-
pressivity of relational random walks in tightly integrating with a neural network model across a
wide variety of domains and settings. The key strengths of NNRPT are that it can (1) efficiently
incorporate a large number of relational features, (2) capture local qualitative structure through re-
lational random walk features, (3) tie feature weights (parameter-tying) in a manner that captures
the global quantitative influences.
87
5.5 Relation with Convolutional Neural Network
A typical convolutional neural network (CNN) is composed of three layers: convolution, max-
pooling and (fully-connected) output layers. NNRPT can be considered a special instance of a con-
volutional network in relational domains, where the fact-grounding layer edges are the equivalent
of convolution, combining rules layer represents pooling, and softmax layer is the fully-connected
layer. If we perform a full and exhaustive grounding of the neural network in NNRPT, M is the
number of lifted random walks (template rules), N is the number of grounded random walks (in-
stances of a template rule) and |F| is the number of all facts (atomic instances). The data can be
represented as a three-dimensional tensor B of size M × N × |F|, whose elements are precisely
Bijk = Qjkθi (see the discussion of the Input Layer in Section 5.3.1). In addition, if we consider
the rule layer as tensor T = M × 1 × |F|, where parameters are tied across |F|, then [wm1f ]Mm=1
constitutes the convolving filter that is repeatedly applied to each of |F| ground instances. The
resulting tensor G = M × N × 1 obtained by composing G = D ◦ T representing the output
of grounded layer passes through a pooling layer (which is the rule-combination layer, here) to
downsample the data produce a new tensor C = M × 1 × 1. The tensor C, when composed with
the fully-connected non-linear layer F = M × |O| of our model produces tensor of size 1 × |O|
that represents the probability of each class in the output: O.
5.6 Conclusion
We considered the problem of learning neural networks from relational data. Our proposed ar-
chitecture was able to exploit parameter tying i.e., different instances of the same rule shared the
same parameters inside the same training example. In addition, we explored the use of relational
random walks to create relational features for training these neural nets. Our extensive experi-
ments on standard relational domains demonstrated that the proposed NNRPT is on par with the
state-of-the-art SRL models and outperforms recent relational neural network methods as well as
propositionalization-based learning.
88
CHAPTER 6
TOPIC AUGMENTED KNOWLEDGE GRAPH EMBEDDINGS
Knowledge graph embeddings models have shown remarkable growth in the past few years. The
reason for their popularity can be ascribed to their “closer to the metal” representation (as learnable
flat-feature vectors) than the neural SRL models which makes them amenable for neural network
architectures. As every success comes at a certain price, the knowledge graph embedding models
are not without flaws. One major drawback of these models is that they are unable to reason about
newer data encountered at the test time. This issue is not prevalent in neural SRL models as can be
seen from our previous chapters. Inspired by this, we propose our first solution to make knowledge
graph embeddings generalizable in this chapter.
6.1 Introduction
A Knowledge Graph (KG) is typically represented as triples (h, r, t) where r is the relation that
exists between entities h and t. For instance, triple (Washington DC, capital, USA) describes
the fact that “Washington, D.C. is the capital of USA”. A KG encompasses everyday facts that are
crucial in solving advanced AI problems such as question answering (Bordes et al., 2014), relation
extraction (Wang et al., 2014a) and web search (Szumlanski and Gomez, 2010). Consequently, past
decade has seen surge in curation of huge knowledge graphs such as Freebase (Bollacker et al.,
2008), Dbpedia (Lehmann et al., 2014), Yago (Suchanek et al., 2007), Wordnet (Miller, 1995) and
NELL (Carlson et al., 2010) etc. These knowledge graphs are ever-expanding with newer facts
being added to them everyday (Shi and Weninger, 2018). Though gigantic, the major drawback of
most of these knowledge graphs is that they are necessarily incomplete and have important links
missing in them. For instance, in Freebase, 71% of people are missing their place of birth and 75%
have unknown nationality (Dong et al., 2014).
Knowledge Graph Embeddings (KGE) have emerged as one of the promising solution to over-
come this issue of missing link prediction in a given knowledge graph. A standard KGE model
90
Figure 6.1: An example of entity descriptions in Freebase
represents each entity or relation as a learnable vector in low-dimensional feature space. These
vectors, also known as embeddings encode the global and local KG properties in their parameters.
The link prediction task is then accomplished by considering each entity as a point in the embed-
ding space and relation as a geometric operation between them to generate a score whose value
decides the presence or absence of the relation between entities. Recent years have seen major
upsurge in such models that mainly differ from each other based on the scoring function (Bordes
et al., 2013; Nickel et al., 2016, 2011; Socher et al., 2013; Trouillon et al., 2016; Yang et al., 2015).
Undoubtedly, they have been enormously successful in solving link prediction problem resulting
in the state-of-the-art performance (Wang et al., 2017).
Although embedding based models are effective in modeling the link prediction task between
existing entities in KB, most of them suffer from a fundamental limitation: predicting the link
between newer (out-of-kb) entities introduced at the test time (Shi and Weninger, 2018). One
workaround for this problem is to exploit the supplementary information in the form of a concise
textual description of entities that KGs are equipped with. For instance, Figure 6.1 shows the
textual descriptions of two entities Friends and Comedy along with the knowledge graph triple
that depicts the genre of Friends sitcom as Comedy in Freebase. The features extracted from the
textual description can act as surrogate for the embedding when newer entity is encountered in the
knowledge graph link prediction task.
To this end, some of the recent research work (Xiao et al., 2017; Xie et al., 2016) have success-
fully harnessed the semantic content present the textual description of the entities. For instance,
91
one of the primitive models along this direction, DKRL (Xie et al., 2016), introduced two en-
coders: CBOW and deep convolutional neural networks to learn text embeddings of entities from
their corresponding descriptions. These encoders, combined with standard TransE model (Bordes
et al., 2013) that learn embedding from KG triples, can perform link prediction between newer
entities. This work has inspired several directions (Shah et al., 2019; Shi and Weninger, 2018;
Wang and Li, 2016; Xiao et al., 2017) to perform link prediction in the presence of newer entities.
It should be noted that all these models rely on some variant of deep learning models to exploit
the textual description of entities. For instance, Shi and Weninger (2018) utilize relation-aware
attention model, others (Shah et al., 2019; Xiao et al., 2017) use two different feature spaces to
learn embeddings from text and knowledge graphs respectively and further utilize transformation
matrices to project one kind of entities onto the feature space of the other.
Our proposed work tackles the problem of handling newer entities at the test time from dif-
ferent perspective. Rather than using the advanced deep models to exploit the semantic content
present in the entities, we rely on a variant of LDA model (Blei et al., 2003) to extract the hidden
topics present in the text and utilize those as substitute for the embeddings of the newer entities
encountered at the test time. The major advantage of the proposed approach is that, it is a step
towards learning interpretable embeddings, because we utilize the document-topic and topic-word
distribution learnt from the text to assign the meaning to each dimension of entity embedding.
Specifically, we are inspired by the idea of out-of-matrix prediction proposed in the Collabo-
rative Topic Regression model (Wang and Blei, 2011) and reformulate the model for the relational
data in order to bring it to the knowledge graph embeddings. The key idea is that we consider a
generative model that models both knowledge graph triples and the text description. We derive
the prior probability distribution of an entity from two sources: Dirichlet distribution that accounts
for the textual description of an entity and zero-mean spherical Gaussian prior distribution that
accounts for the interactions of the entity with triples present in the knowledge graph. Next, we
propose the likelihood of the triple based on the scoring function in DistMult (Yang et al., 2015).
92
We further derive a solution for learning the embedding of entities and relations that encompass
two sources of data. To summarize, the contribution of our work is threefolds:
• We propose a novel knowledge graph embeddings model: Topic Augmented Knowledge
Graph Embeddings (TAKE) that elegantly incorporates both textual and knowledge graph
triples information into the embedding of the proposed model. The proposed model is an
attempt to deal with zero-shot scenario where topics obtained by proposed model are used as
substitute for entity embeddings. We further show that our proposed model can assign topics
to each dimension of embedding, opening up the possibility of an interpreting a model.
• In addition to the newly occurring entities, scarcely occurring entities would also benefit
from our proposed model. As will be presented in the main section, the embedding of an
entity (h or r) is learnt as a combination of embeddings obtained from knowledge graph and
the topic models. As a result, sparsely occurring entities will benefit more by topics learned
from the text description. On the other hand, the embeddings of the frequently occurring
entities will be dominated by the knowledge graph triples information.
• Experimental results on two widely used datasets demonstrate that our model performs com-
parable or better than the baseline models. We obtain negative results for some questions too,
we reason about such cases in detail.
Rest of this chapter is organized as follows. Section 6.2 outlines the related work section. Next, we
explain the proposed TAKE model in detail presenting the algorithm for the same in Section 6.3.
Finally, we conclude the paper by presenting our extensive experimental evaluations on standard
KB datasets in Section 6.4 and outlining the areas for future research in Section 6.5.
6.2 Related Work
Our study of related work is organized along four axes. We first discuss the most popular models
in the standard knowledge graph embeddings. Then, we overview the various approaches that have
93
been proposed till date to combine the text into the knowledge graph embedding models. Since,
the priors of entities and relations in our proposed model are derived from Gaussian distribution,
we survey the past models on Gaussian embeddings in knowledge graphs. Finally, we review LDA
based models along the fourth dimension.
6.2.1 Knowledge graph embeddings models
Even though there has been numerous knowledge graph embedding models for missing link pre-
diction, for brevity, we focus on the most popular ones here. The most influential approach among
the embedding based models has been TransE (Bordes et al., 2013). In this model, the relation
embedding r is considered as a translation operation from the head entity h to the tail entity t i.e.
h+r ≈ t. The plausibility of a triple is obtained by L2 norm of this formula. Though useful in case
of 1-to-1 relations, this model fails to model 1-to-N, N-to-1 and N-to-N relations. To overcome
this flaw, TransH (Wang et al., 2014b) model projects both the entities in a triple to the relation-
specific hyperplane before performing translation operation between them. Likewise, TransR (Lin
et al., 2015) model considers a different feature space for each relation r and projects entities to
the corresponding relation space before performing translation operation between them.
In addition to translation based embedding models, another interesting approach is the compo-
sition based embedding models. The earliest model among them is DistMult (Yang et al., 2015)
where the score for a triple is computed by element-wise composition of embeddings of a given
triple. Though it has a simpler scoring function, DistMult model can not handle anti-symmetric
relations. Two more advanced models were proposed to handle anti-symmetric relations in KGs.
Holographic embedding model, HolE (Nickel et al., 2016), employs circular correlation to com-
pose the embeddings that measures the covariance between embeddings at different dimension
shifts. ComplEx model (Trouillon et al., 2016), on the other hand, handles anti-symmetry in re-
lation triples by performing tensor factorization of relational data in a complex feature space. In
addition to above two classes of single-hop models, there has been another line of work, e.g.
94
PTransE (Lin et al., 2015), that consider path between two entities and compute their score by
composing embeddings of all the relations existing along the path.
6.2.2 Text-aware Knowledge graph embeddings models
All the models discussed in the previous section focus on link prediction in those knowledge graphs
that have already learnt the representation of all the entities during the training. These models fail to
handle the newer entities encountered at the test time. A relatively unexplored area in knowledge
graph completion is to assert a triple for which at least one of entity is novel and has not been
encountered before. The first model that solved this issue was Jointly (Wang et al., 2014a). This
model jointly learnt the embeddings of knowledge graph entities, relations and word provided in
supplementary text in the same continuous vector space where entities and words embeddings
were aligned in the same vector space by utilizing entity names and wikipedia anchors. Zhong et
al. (2015) improved the above work by proposing a newer alignment approach between entities
and words that, instead, utilized the textual description of entities.
Starting with this work by Zhong et al. (2015), utilizing entity descriptions as auxiliary knowl-
edge in order to infer the missing links between newer entities has become a major trend in all
the research works that followed it. For instance, DKRL model (Xie et al., 2016) learns two types
of embedding representations of a given entity: structure-based representation that captures an
entity’s interaction in triples of KB by learning a standard TransE model (Bordes et al., 2013),
description-based embedding representation that captures the textual information about a given
entity by utilizing one of the two encoders - CBOW (Mikolov et al., 2013) and deep convolutional
neural network. Another model, SSP (Xiao et al., 2017), introduces a novel concept of semantic
hyperplane for each triple which is obtained from the topic models of the textual entity descrip-
tions. The error obtained from the triples learned from the standard knowledge graph embedding
model (TransE) is projected onto the semantic hyperplane and the goal of this model is to minimize
this newer error that captures the semantic relevance between entities in a given triple.
95
Another popular model, ConvMask (Shi and Weninger, 2018) proposed a novel relation-aware
attention model that extracts only that text snippet from the entire entity description that is relevant
to the given relation under consideration. It then passes the word embeddings of the chosen text
snippet from fully connected convolutional neural network in order to generate unique entity em-
bedding for a given description and employs it as a substitute for new entity. Finally, a very recent
OWE model (Shah et al., 2019) considers two vector spaces: word space where the word embed-
dings of the words present in a given entity description reside and triple space where structural
embeddings of a knowledge graph triples are located. These two types of embeddings are trained
independently in their respective feature spaces. The model then aggregates the word embeddings
of entity description existing in word space to generate an entity embedding, it further trains a
novel transformation function that can successfully project the entity embedding from the word
space to triple space in order to attain the desired objective.
Though aimed at solving different problems than zero-shot scenario, there are other research
works that have leveraged the text data to learn more efficient knowledge graph embedding models.
For instance, NTN model (Socher et al., 2013) represents entity vector as the average of the word
embeddings occurring in its entity name which allows entities sharing common words in their
name to lie close to each other in the vector space, thereby, improving the model’s performance.
Recent model, TEKE (Wang and Li, 2016), utilizes the textual context information of entities
to overcome lower performance on 1-to-N, N-to-1, N-to-N relations and KG sparseness. This
model starts by annotating the entities in given text, then learns the word embeddings of each
word in the text through word2vec model (Mikolov et al., 2013), obtains the context embedding
of an entity (or pair of entities) by computing aggregate on the word embeddings residing in its
context in the text, projects these resulting context embeddings into knowledge graph vector space
by incorporating transformation matrices into the model and finally train standard KG embedding
models: TransE/TransR/TransH to optimize the knowledge graph embeddings.
Another model with similar aim as TEKE was proposed by Xu et al. (2017) where the text
embeddings were extracted from entity description by employing three encoders: (i) NBOW (ii)
96
Bi-LSTM model (iii) a novel attention based LSTM encoder that selects the most relevant in-
formation from the text depending upon the context (relation) under consideration. Further, they
proposed a gating mechanism that strikes a balance between structural and text embedding when
combining them in the final objective function. TransConv model (Lai et al., 2019) proposed a
novel knowledge graph embedding model that is specifically designed for social networks like
Facebook or Twitter. This model augments the scoring function of TransH (Wang et al., 2014b)
with two novel conversational factors that are derived from the textual communication between
users on social media. Incorporation of textual conversation between users into the model defi-
nitely improves the performance as shown empirically in this model.
Finally, most close to our work is SSP (Xiao et al., 2017) as both the models are exploiting the
hidden LDA topics present in the textual information in order to infer the newer entities. How-
ever, both the models have different objective functions: whereas SSP model utilizes semantic
hyperplane projection to minimize the error captured from the knowledge graph embeddings, our
proposed model exploit LDA as a Dirichlet prior on the knowledge graph embeddings.
6.2.3 Gaussian Embeddings in Knowledge graphs
Owing to the fact that we employ Gaussian prior distributions for entities and relations in our
proposed model, in this section we survey the knowledge graph models based on Gaussian distri-
butions proposed in the past. Please note that these are standard KG embedding models, similar
to the ones discussed in Section 6.2.1, that discovers the links between existing entities and are
unable to tackle newer entities at the test time. Along this direction the first model was K2GE
(He et al., 2015), that learns density-based knowledge graph embeddings by modeling each entity
(h and t) and relation r as multivariate Gaussian distribution. It proposed the scoring function as
KL divergence between the distribution of vectors h - t and r (inspired by TransE model). The
unique feature of this model was that it could accounts for the uncertainties present in entities and
relations by capturing the them in variance of their Gaussian distributions.
97
Motivated by the observation that a given relation can further have multiple semantic sub-
clusters hidden within it depending upon the entity pairs it is participating in, TransG model
(Xiao et al., 2016) proposed a generative model for knowledge graph embeddings that employed
Bayesian non-parametric infinite mixture model for drawing knowledge graph embeddings. This
model generated multiple translation components for a relation by utilizing Chinese Restaurant
Process as relation’s prior distribution which accounted for its multiple sub-clusters. Likewise, in
order to incorporate semantic interpretability into knowledge graph embeddings, KSR (Xiao et al.,
2019) model proposed a novel multi-view clustering framework that leveraged two-level hierar-
chical generative process to represent semantic entities and relations such that the first level of the
model produced the semantic knowledge view that the entities belonged to and the second level
provided the cluster that each entity is drawn from within that view.
6.2.4 LDA based models
While topic models (Blei et al., 2003) are widely used for text modeling, their applicability to zero-
shot scenario in KB has been relatively limited. To the best of our knowledge, SSP (Xiao et al.,
2017) is the only model to have done that. Towards the other end of the spectrum lie KGE-LDA
model (Yao et al., 2017) that employed knowledge graph embeddings inside LDA topic modeling
in order to learn coherent topics inside LDA. This model uses same topic distribution to generate
the words and entities inside a document. However, while the words are drawn from the standard
multinomial topic-word distribution in LDA, a novel von Mises-Fisher (Gopal and Yang, 2014)
topic-embedding distribution was proposed to draw embeddings from a given topic. Finally, Col-
laborative Topic Regression (CTR) (Wang and Blei, 2011) is another relevant model that proposed
to avail topics as substitute for article embeddings while recommending newer articles to users,
which has inspired this work. However, they mainly focused on learning user and article vectors
based on user-item interactions and the abstract of articles, this work focuses on learning embed-
dings of multi-relational data in the knowledge graphs.
Given the related work, we now focus on the proposed model in the next section.
98
Figure 6.2: The proposed TAKE approach. Both the entities h and t in the triple (h, r, t) aredrawn from the distribution N (θ, λ−1e ) and the relation r is drawn from N (0, λ−1r ), whereas theprobability of triple (h, r, t) being true, P(yh,r,t = 1) is drawn from Equation 6.14 where P(1) andP(0) refers to the true part and three false terms in the equation respectively.
6.3 Topic Augmented Knowledge Graph Embeddings: the proposed TAKE approach
In this section, we describe in detail our proposed Topic Augmented Knowledge Graph Embeddings
(TAKE) framework. We consider a knowledge graph K = {E ,R, T ,D} where E , R and T =
{(h, r, t)}|T |n=1 are the set of entities, relations and knowledge graph triples as defined in any stan-
dard knowledge graph embedding model (similar to Section 2.4). In addition to that, the model has
access to the supplementary data which is the set of documentsD = {di}|E|i=1 where each document
di is the concise textual description about an entity ei present in the knowledge graph.
6.3.1 Problem Formulation
The underlying thought behind this framework is that the embeddings of entities in a knowledge
graph is generated by leveraging two data sources: the semantic information present in the concise
textual description of entities; the interaction of an entity with the other entities and relations
present in a given knowledge graph. Specifically, the information from the textual description is
captured as the semantic topics present in a document by employing the concept of topic modeling
99
and the interactions of the entity with other entities and relations in knowledge graph is acquired by
zero-mean spherical Gaussian prior. Mathematically, the prior probability of any entity embedding
variable, e ∈ RK , in E is sum of two variables: variable θe ∈ RK that captures the topics present in
the document description de ∈ D of the entity e and another variable kbe ∈ RK that represents the
interactions of an entity with the triples T in knowledge graph (see Figure 6.2). Furthermore, the
variable θe is generated from the Dirichlet distribution and the variable kbe is generated via zero-
mean spherical Gaussian prior with the λ−1e variance (Please refer to Section 2.4 and Section 2.5 for
detailed introduction to generative embeddings formulation (without topics) and topic modeling):
e = θe + kbe e, θe, kbe ∈ RK where (6.1)
θe ∼ Dirichlet(~α) and (6.2)
kbe ∼ N (0, λ−1e I) (6.3)
In case of entities that participate in large number of knowledge graph triples, the variable kbe will
contribute heavily to e in e = θe + kbe. However, the entities that occur in fewer facts will mostly
be determined by the content information present in θe. Altogether, the information acquired
from the two vectors complement each other and helps the proposed model to learn better entity
embeddings than embeddings obtained from any one of the source taken alone. We can integrate
the two sources of data explained in Equations 6.1-6.3 into one distribution and further conclude
that an entity embedding is drawn from the distribution below:
e ∼ N (θe, λ−1e I) (6.4)
While the generation of entity relies on two sources, the relation embeddings are drawn from the
zero-mean spherical Gaussian prior with variance λ−1r as discussed before in Equation 2.4:
r ∼ N (0, λ−1r I) (6.5)
100
And finally, the conditional probability of a triple is drawn from the softmax defined in the Equation
2.6-2.8. As a next step, in order to learn the model parameters, we define the complete log-
likelihood of the model as the linear combination of contribution due to the KG data and text
description in the following expression:
A = αA(KG) + (1− α)A(text) (6.6)
where A(KG) represents an approximate joint obtained from the knowledge graph triples and
A(text) is the joint probability of the text description of entities and the hidden topic parame-
ters obtained by employing topic modeling. Further, A(KG) can be defined as follows:
A(KG) = A(P (e)) +A(P (r)) +A(P (h, r, t)) (6.7)
We consider each of the above log-term in Equations 6.6 and 6.7 individually in detail. The
A(P (e)) andA(P (r)) terms are generated from the Equations 6.4 and 6.5 respectively as follows:
A(P (e)) = log
|E|∏i=1
N(θi, λ
−1e I)
= −λe2
|E|∑i=1
(ei − θi
)ᵀ(ei − θi)+ C1 (6.8)
A(P (r)) = log
|R|∏p=1
N(0, λ−1r I
)= −λr
2
|R|∑p=1
(rᵀprp
)+ C2 (6.9)
After considering entity and relation generation, we now focus onA(P (h, r, t)) term that specifies
the score function of the triples T in the knowledge graph and is expressed as log of the probability
term P(yh,r,t = 1 | h, r, t) defined in the Equation 2.7. As can be observed from the equation, the
softmax functions have very cumbersome normalization terms in their denominator, hence we wish
to replace them with manageable terms instead. Inspired by Wang et al. (2014a), we introduce the
A(P (h, r, t)) term as follows:
A(P (h, r, t)) ≈|T |∑n=1
(log P(hn | rn, tn) + log P(rn | hn, tn) + log P(tn | hn, rn)
)(6.10)
The above expression represents the cyclic dependency between three elements in a triple (h, r,
t) where one element can be approximated when the other two are given. Such cyclic dependency
101
in relation domains have also been exploited at the triple level in past SRL models, where one
triple can be infered when the other triples in the domain are given (Heckerman et al., 2001; Khot
et al., 2011; Lowd and Davis, 2010). However, even with the above design of A(P (h, r, t)), the
inconvenient summation term still exists in denominator of all three probabilities. For instance,
probability of P(h | r, t) defined as:
P(h | r, t) =exp(score(h, r, t))∑h∈|E| exp(score(h, r, t))
(6.11)
To overcome the intractable denominator in the above equation, the model samples C negative
examples by corrupting the head term in P(h | r, t) for each positive triple (h, r, t). This results in
corrupted head hc in resulting negative example (hc, r, t) and instead of optimizing log P(h | r, t),
the model optimizes the following term (Mikolov et al., 2013; Wang et al., 2014a):
log P(1 | h, r, t) +1
C
C∑c=1
[log P(0 | hc, r, t)
](6.12)
The probability P(1 | h, r, t) in the above term is defined as:
P(1 | h, r, t) = σ(a ∗ score(h, r, t)) (6.13)
where sigmoid function, σ(ax) = 1/(1 + exp (−ax)) has scaling hyper-parameter (aka tempera-
ture) a and scoring function is same as defined in Equation 2.8. The modified term for P(r | h, t)
and P(t | h, r) can be constructed in the similar manner by corrupting the relation r and tail t for
C times respectively. Our final approximate log-likelihood term, A(P (h, r, t)), turns out to be:
A(P (h, r, t)) ≈T∑n=1
(3 ∗ log P(1 | hn, rn, tn) +
1
C
C∑c=1
[log P(0 | hc, rn, tn)
]+
1
C
C∑c=1
log[P(0 | hn, rc, tn)
]+
1
C
C∑c=1
log[P(0 | hn, rn, tc)
]) (6.14)
We now consider A(text) term which represents the log of joint distribution of the hidden
variables {θ, z} and the known text data D = {di}|E|i=1. Here, the document description of i-th
102
entity, di, consists of Ni words; θi ∈ RK represents topic mixture being discussed in document
di and variable zi is a topic vector of length Ni that assigns topic to each word being generated in
document di. Given the parameters {~α,β}, A(text) is expressed as:
A(text) = log
|E|∏i=1
P (θi, di, zi | ~α, β) (6.15)
=
|E|∑i=1
log(P(di | zi,β) ∗ P(zi | θi) ∗ P(θi | ~α)
)(6.16)
We discuss each of the probability component in Equation 6.16 individually here. First, we assume
the value of the positive vector ~α to be one which consequently sets the Dirichlet distribution
P (θi | ~α) to be a constant value. Second, the variable zi draws the topic of each word in the
document from multinomial distribution (zij ∼Mult(θi)). Therefore, once the topic is known for
a given word (say zij = k) then the probability becomes P (zij = k | θi) = θik. However, the
topic of j-th word in the i-th document could be any of the K topics requiring us to sum over all
the topics that a word may be drawn from. Finally, the probability P (di | zi, β) of observing a
document di given that parameters zi and β are known, can be decomposed into the probability of
individual words and is defined as:
P(di | zi, β) =
Ni∏j=1
βzij , wij(6.17)
By bringing all the above considerations together, Equation 6.16 summarizes to:
A(text) =
|E|∑i=1
log( Ni∏j=1
K∑k=1
θikβk,wij
)=
|E|∑i=1
|Ni|∑j=1
logK∑k=1
θikβk,wij(6.18)
This formulation sum up the generative process of LDA as follows: the j-th word of the i-th
document could be generated from any of the K topics with probability θik, and for an given topic
k under consideration, generate j-th word by following the given βk ∈ RV distribution.
As the last step, we substitute the A(P (e)), A(P (r)), A(P (h, r, t)) and A(text) derived
in Equation 6.8, 6.9, 6.14 and 6.18 respectively into Equation 6.6 in order to obtain the final
103
expression of approximate complete log-likelihood of e, r, θ given λe, λr, α and β:
A = α
(− λe
2
|E|∑i=1
(ei − θi
)ᵀ(ei − θi)− λr2
|R|∑p=1
(rᵀprp
)+
T∑n=1
(3 ∗ log P(1 | hn, rn, tn)
+1
C
C∑c=1
[log P(0 | hc, rn, tn)
]+
1
C
C∑c=1
[log P(0 | hn, rc, tn)
]+
1
C
C∑c=1
[log P(0 | hn, rn, tc)
]))+ (1− α)
( |E|∑i=1
|Ni|∑j=1
logK∑k=1
θikβk,wij
)(6.19)
The above Equation 6.19 represents the final formulation of the proposed TAKE model. Hav-
ing discussed the model formulation for TAKE in this section, we now turn to learning the param-
eters {e, r, θ, β} of the model in the next section.
6.3.2 Learning the model parameters
The parameters of the proposed model are learnt iteratively by optimizing the knowledge graph
parameters {e, r} while fixing the topic parameters {θ, β} and vice versa. We now compute the
gradient expressions for all the four model parameters {e, r, θ, β} beginning with the derivative
of approximate complete log-likelihood A for the entity embedding e.
Derivative of update expression for parameter e
In order to compute the derivative ofA in Equation 6.19 with respect to a specific entity embedding
ei ∈ E , we must take into account the fact that a given entity ei might play the part of: (i) head
h (i.e. I(ei = h)), or (ii) tail t (i.e. I(ei = t)) in a given positive triple and (iii) corrupted
head h (i.e. I(ei = h)), or (iv) corrupted tail t (i.e. I(ei = t)) in a negative example generated
by corrupting different positive triple (the three expectation terms in Equation 6.19). Further, we
should consider the entity as is (i.e. I(ei = ei)) when it contributes as the prior following Equation
6.8. To summarize the approximate log-likelihoodA considered earlier in Equation 6.19 is further
sub-divided for a given entity ei as follows:
104
A = A1 +A2 +A3 +A4 +A5 +A6
A1 = −λe2
|E|∑i=1
(ei − θi
)ᵀ(ei − θi)A2 = subset T1 of triples in the approximate log-likelihood A where ei participates as head of
true triple i.e. (h, r, t) ≡ (ei, r, t)
A3 = subset T2 of triples in the approximate log-likelihood A where ei participates as tail of
true triple i.e. (h, r, t) ≡ (h, r, ei)
A4 = subset T3 of triples in the approximate log-likelihood A where ei participates as corrupted
head of false triple i.e. (h, r, t) ≡ (ei, r, t)
A5 = subset T4 of triples in the approximate log-likelihood A where ei participates as corrupted
tail of false triple i.e. (h, r, t) ≡ (h, r, ei)
A6 = subset of triples in A where ei does not participate at all (6.20)
The derivative of A with respect to ei is computed as:
∂A∂ei
=∂A1
∂ei+∂A2
∂ei+∂A3
∂ei+∂A4
∂ei+∂A5
∂ei+∂A6
∂ei(6.21)
We now consider the derivative ofA for each component described in the above equation, individ-
ually, in order to get the final expression for ei. While computing the derivatives, we consider only
those components of the approximate log-likelihoodA that account for entity ei. We start with the
part A1 where entity contributes as prior in the Equation 6.19 i.e. I(ei = ei):
∂A1
∂ei= −λe
(ei − θi
)(6.22)
105
Next we consider the case when entity represents the head in the true knowledge graph triples, i.e.
I(ei = h) and compute the derivative expression A2 as follows:
∂A2
∂ei= a ∗
T1∑n=1
(3 ∗(1− P(1 | ei, rn, tn)
)∗(rn ◦ tn
)− 1
C
C∑c=1
P(1 | ei, rc, tn) ∗(rc ◦ tn
)− 1
C
C∑c=1
P(1 | ei, rn, tc) ∗(rn ◦ tc
))(6.23)
In the above equation, T1 represents the set of all the true triples in the knowledge graph in which
entity ei plays the role of head entity. Also, operator ◦ in rn◦tn represents the element-wise product
between embeddings. Likewise, we compute the gradients of approximate complete log-likelihood
A when entity ei plays the role of tail (I(ei = t)) by considering component A3:
∂A3
∂ei= a ∗
T2∑m=1
(3 ∗(1− P(1 | hm, rm, ei)
)∗(hm ◦ rm
)− 1
C
C∑c=1
P(1 | hc, rm, ei) ∗(hc ◦ rm
)− 1
C
C∑c=1
P(1 | hm, rc, ei) ∗(hm ◦ rc
))(6.24)
In the above equation, T2 characterizes the set of all the true triples in the knowledge graph when
ei appears as the tail. We now consider the case where ei contributes as corrupt head I(ei = h) in
the false triples set T3 and compute the derivative of A4 as below:
∂A4
∂ei=
∂
∂ei1
C
T∑n=1
( C∑c=1
[log P(0 | I(hc = ei), rn, tn)
])= −a
C∗T3∑s=1
P(1 | ei, rs, ts) ∗(rs ◦ ts
)(6.25)
As final step in the derivative, we consider the case where entity participates in a false triple as
corrupt tail I(ei = t) and calculate the derivative of expression A5 as below:
∂A5
∂ei=
∂
∂ei1
C
T∑n=1
( C∑c=1
[log P(0 | hn, rn, I(tc = ei))
])= −a
C∗T4∑u=1
P(1 | hu, ru, ei) ∗(hu ◦ ru
)(6.26)
106
Finally, the derivative of expression A6 with respect to ei would be zero. As the last step, we
substitute the derivative ofA with respect to ei when it plays the role of e, h, t, h and t in a triple,
as derived in derived in Equation 6.22, 6.23, 6.24, 6.25 and 6.26 respectively into Equation 6.21 in
order to obtain final expression of approximate complete log-likelihood with respect to ei:
∂A∂ei
= λeα
((θi − ei
)+
aλe∗({ T1∑
n=1
(3CC∗(1− P(1 | ei, rn, tn)
)∗(rn ◦ tn
)− 1
C
C∑c=1
P(1 | ei, rc, tn) ∗(rc ◦ tn
)− 1
C
C∑c=1
P(1 | ei, rn, tc) ∗(rn ◦ tc
))}+{ T2∑m=1
(3CC∗(1− P(1 | hm, rm, ei)
)∗(hm ◦ rm
)− 1
C
C∑c=1
P(1 | hc, rm, ei) ∗(hc ◦ rm
)− 1
C
C∑c=1
P(1 | hm, rc, ei) ∗(hm ◦ rc
))}−{1
C
T3∑s=1
P(1 | ei, rs, ts) ∗(rs ◦ ts
)}−{1
C
T4∑u=1
P(1 | hu, ru, ei) ∗(hu ◦ ru
)}))(6.27)
As can be seen from the equation above, we can not obtain closed form solution for an entity.
Therefore, we optimize the above expression by utilizing stochastic gradient ascent. We now
proceed to the next derivation where we compute the update expression for the relation vector rp.
Update expression for parameter r
The parameter learning of a given relation rp in a knowledge graph is analogous to that of learning
in an entity as discussed in the previous section. A given relation rp might contribute to the
approximate log-likelihood A in Equation 6.19 as: (i) relation r when it is a part of true triple in
knowledge graph (i.e. I(rp = r)) (ii) corrupted relation r when it participates in a corrupted triple
(i.e. I(rp = r)) (iii) relation as is when it contributes as the prior (I(rp = rp)). We reconsider the
expression A in Equation 6.19 and disintegrate it further according to the role rp plays in it. This
can be mathematically outlined as follows:
107
A = A7 +A8 +A9 +A10
A7 = −λr2
|R|∑p=1
(rᵀprp
)A8 = subset T5 of triples in the approximate likelihood A where rp participates as a
true relation in true triple i.e. (h, r, t) ≡ (h, rp , t)
A9 = subset T6 of triples in the approximate likelihood A where rp participates as
corrupted relation in false triple i.e. (h, rp, t) ≡ (h, rp , ei)
A10 = subset of triples in A where rp does not participates at all (6.28)
The derivative of A with respect to rp is given by:
∂A∂rp
=∂A7
∂rp+∂A8
∂rp+∂A9
∂rp+∂A10
∂rp(6.29)
We consider each component of the derivative individually beginning with the case when the rela-
tion characterizes as the prior of the model I(rp = rp):
∂A7
∂rp= −λr
2
|R|∑p=1
(rᵀprp
)= −λrrp (6.30)
Next, we examine the case where relation contributes to A as a part of true triples (i.e. I(rp = r)):
∂A8
∂rp= a ∗
T5∑n=1
(3 ∗(1− P(1 | hn, rp, tn)
)∗(hn ◦ tn
)− 1
C
C∑c=1
P(1 | hc, rp, tn) ∗(hc ◦ tn
)− 1
C
C∑c=1
P(1 | hn, rp, tc) ∗(hn ◦ tc
))(6.31)
In the above expression, T5 represents the set of true triples in which rp appears. Finally, we
consider the case when relation rp operate as corrupted relation in the negative examples generated
108
in the model (i.e. I(rp = r)):
∂A9
∂rp=
∂
∂rp1
C
T∑n=1
( C∑c=1
[log P(0 | hn, I(rc = rp), tn)
])= −a
C∗T6∑
v=1
P(1 | hv, rp, tv) ∗(hv ◦ tv
)(6.32)
Finally, the derivative ofA10 with respect to rp is zero. Next, we substitute each role of relation as
rp, r and r derived in Equation 6.30, 6.31 and 6.32 into Equation 6.29 in order to obtain the final
derivative of approximate complete-likelihood with respect to relation embedding, rp, as below:
∂A∂rp
= α ∗ λr
(− rp +
aλr∗{ T5∑n=1
(3CC∗(1− P(1 | hn, rp, tn)
)∗(hn ◦ tn
)− 1
C
C∑c=1
P(1 | hc, rp, tn) ∗(hc ◦ tn
)− 1
C
C∑c=1
P(1 | hn, rp, tc) ∗(hn ◦ tc
))− 1
C
T6∑v=1
P(1 | hv, rp, tv) ∗(hv ◦ tv
)})(6.33)
As can be observed, we can not attain a closed-form solution for updating the relation rp, hence we
obtain the optimal value of relation embedding by stochastic gradient ascent. Also, note that the
above expression is dependent on only one data source: knowledge graph triples. We now proceed
to learning the topic model parameters {θ, β} in the next section while keeping (e, r) fixed.
Update expression for the parameter θ
In order to update the topic parameters, we consider only the fraction of approximate complete-
likelihood expression A in Equation 6.19 that involves the topic parameters {θ, β} and regard the
remaining proportion as constant C(e, r). We denote the expression involving {θ, β} parameters
as L(θ,β) as shown below:
L(θ,β) = α
(− λe
2
|E|∑i=1
(ei − θi
)ᵀ(ei − θi))+ (1− α)
( |E|∑i=1
|Ni|∑j=1
logK∑k=1
θikβk,wij
)(6.34)
109
In order to evade the summation inside the log expression in the second part of the equation
L(θ, β), we simplify it by noticing that the reason for summation over k is that the topic of j-th
word in the i-th document is unknown, i.e. variable zij is hidden. We denote observed parameters
of L(θ, β) as X= {θ, β} the hidden parameter as H = {z} in order to express the second part of
the equation as:
logK∑k=1
θikβk,wij= log
∑H
p(X,H) = log p(X) (6.35)
Next, we examine the expression log p(X) and expand it further as:
log p(X) = log∑H
p(X,H) = log∑H
p(X,H) ∗(q(H)
q(H)
)= log
(Eq[p(X,H)
q(H)
])≥ Eq
[log p(X,H)
]− Eq
[log q(H)
](from Jensen’s inequality)
≥ ELBO (6.36)
Instead of maximizing marginal probability log p(X), we maximize the Evidence Lower Bound
(ELBO) to find the parameters that gives as tight a bound as possible on the marginal probability.
In the above equation, q(H) is variational distribution over the hidden variable H defined as follows:
q(H) = q(zij = k) = φijk (6.37)
We consider the ELBO in Equation 6.36 and the definition of φ in Equation 6.37 and remodel the
likelihood function in Equation 6.34 for a given document di as:
L(θi,φi) = α
(− λe
2∗(ei − θi
)ᵀ(ei − θi))
+ (1− α)
(Ni∑j=1
(∑H
q(H) log p(X,H)−∑H
q(H) log q(H)))
(6.38)
= α
(− λe
2∗(ei − θi
)ᵀ(ei − θi))
+ (1− α)
(Ni∑j=1
K∑k=1
(φijk log θikβk,wij
− φijk log φijk))
(6.39)
110
After circumventing the summation inside the log expression, we are now prepared to compute the
update equations for the topic model parameters {θi, φijk, β}, beginning with the parameter θi:
∂L(θi,φi)
∂θi= α ∗ λe(ei − θi) + (1− α) ∗ ζi
θi(6.40)
where ζi ∈ RK is a vector whose k-th entry is defined asNi∑j=1
φijk. As we can observed, the closed
form solution for θi is not feasible, we employ projection gradient descent as in Wang and Blei
(2011) to update θi.
Update expression for the parameter φ
We next explain the update equation of the variational parameter φijk that represents the distri-
bution of k-th topic of j-th word in the i-th document. As we can observe, each word in the
document will have one of the K topics assigned to it, i.e.K∑k=1
φijk = 1, we introduce this equation
as Lagrange constraint in L(θi,φi) as follows:
L(θi,φi) = (1− α) ∗( Ni∑
j=1
K∑k=1
(φijk log θikβk,wij
− φijk log φijk))
+ λφ(K∑k=1
φijk − 1) + C(θ, ei)
(6.41)
We compute the derivative of the above equation with respect to φijk and set the result to zero in
order to obtain the final update expression for φijk:
φijk =θikβk,wij∑Kk=1 θikβk,wij
(6.42)
Update expression for the parameter β
After the learning algorithm has iterated over all the triples on the knowledge graph and updated the
parameters {e, r, θ, φ}, it can update the parameter β over all the text data, i.e. text documents of
all the entities. The generation of the update equation of β is similar to as in standard LDA model
111
with introduction of constraint that for a given topic k, the probability distribution of all the words
in vocabulary should sum to 1 resulting in the the following optimization equation for β:
L(β) = (1− α)
( |E|∑i=1
Ni∑j=1
K∑k=1
( V∑v=1
I(wij = v)φijk log(βk,v)))
+K∑k=1
λk( V∑v=1
βkv − 1)
+ C(θ, ei,φ) (6.43)
Solving the above equation by taking its derivative with respect to βk,v and each λk will yield the
following update expression for βk,v:
βk,v =
∑|E|i=1
∑Ni
j=1 φijk ∗ I(wij = v)∑Vv=1
∑|E|i=1
∑Ni
j=1 φijk ∗ I(wij = v)(6.44)
6.3.3 TAKE Algorithm
Having introduced the update expressions for the proposed model parameters {e, r, θ, β}, we now
present the proposed TAKE algorithm (Algorithm 3) in detail. The algorithm begins by initializing
the KG parameters {e, r} following a uniform distribution (line 1-3) or advanced initialization
procedure like setting the embeddings to pre-trained TransE vectors. The topic model parameters
are initialized to the final values obtained after executing LDA model on the text descriptions
of entities (line 4). TAKE algorithm then updates the model parameters over multiple epochs
until they have converged to stable values (line 5- 32). The convergence criteria is that change in
approximate log-likelihood over two consecutive epochs should be below a given threshold.
Within a given epoch, the algorithm maintains two temporary matrices Egradient, Rgradient that
store the gradient updates of entities and relations while iterating over all the triples of the knowl-
edge graph. The algorithm then retrieves the entity index of head hn, tail tn (line 9), corrupted
head hc (line 14), corrupted tail tc (line 18) in matrix E and computes their update expressions for
their roles as I(ei = hn), I(ei = tn), I(ei = hc) and I(ei = tc) in triple (hn, rn, tn) according
to Equation 6.27 and updates Egradient at the corresponding indexes. The model avails the values
112
Algorithm 3 Topic Augmented Knowledge Embeddings (TAKE)
Input: training data T = {hn, rn, tn}|T |n=1, set of entities E , set of relationsR, text description ofentities D = {di}|E|i=1, function F(e) that returns the index of input entity or relation, λe,λr, C: sample of negative examples
Output: entity embeddings E ∈ R|E|×K , relation embeddings R ∈ R|R|×K , topic vectors ofdescription Θ ∈ R|E|×K , topic-word distribution β ∈ RK×V
1: for l ∈ {E ∪ R} do2: l← Uniform() or pre-trained embeddings3: end for4: Θ and β ← LDA5: while not convergence do6: Set Egradient = 0, Rgradient = 0, Φ ∈ R|E|×Ne×K = 07: for each (hn, rn, tn) ∈ T do . update the KG parameters E and R8: Generate C negative triples {hc, rn, tn} with set of corrupt heads hC = {hc}C
c=1;negative triples {hn, rc, tc} with set of corrupt tails tC = {tc}C
c=1 and negativeexamples {hn, rc, tn} with set of corrupt C relations rC = {rc}C
c=1
9: h← F(hn); t← F(tn); r← F(rn)10: Update Egradient[h, : ] = Egradient[h, : ] +
(I(ei = h) part of Eqn. 6.27 using E and R
)11: Update Egradient[t, : ] = Egradient[t, : ] +
(I(ei = t) part of Eqn. 6.27 using E and R
)12: Update Rgradient[r, : ] = Rgradient[r, : ] +
(I(rp = r) part of Eqn. 6.33 using E and R
)13: for each hc in hC do14: h← F(hc);15: Update Egradient[h, : ] = Egradient[h, : ] +
(I(ei = h) part of Eq. 6.27 using E and R
)16: end for17: for each tc in tC do18: t← F(tc);19: Update Egradient[t, : ] = Egradient[t, : ] +
(I(ei = t) part of Eq. 6.27 using E and R
)20: end for21: for each rc in rC do22: r ← F(rc);23: Update Rgradient[r, : ] = Rgradient[r, : ] +
(I(rp = r) part of Eq. 6.33 using E and R
)24: end for25: end for26: for i = {1, 2, . . . | E |} do . update the topic parameters Θ and Φ27: Update Φ[i, :, : ] according to Equation 6.4228: Update θi by project gradient descent on Equation 6.4029: end for30: Et+1 ← Et+η ∗α
(λe(Θ−Et)+a∗Egradient
); Rt+1 ← Rt+η ∗α(−λr ∗Rt+ Rgradient
)31: Update β according to Equation 6.4432: end while
113
of E and R from the previous iteration in order to compute the update expressions for the above
parameters. It follows similar procedure to update Rgradient for relation rn and rc. The algorithm
then updates topic parameters Φ, Θ and β according to Equation 6.42, 6.40 and 6.44 respectively
(line 26-31). Once the algorithm has iterated over all the triples in KB (line 7-25), it updates E
by performing stochastic gradient ascent which considers the prior part I(ei = ei) in Equation
6.27 in addition to the remaining expression already stored in Egradient; likewise, R is updated by
considering I(rp = rp) part in addition to the final value of Rgradient (line 30). This marks the end
of one epoch. After the algorithm has converged, it returns the model parameters E, R, Θ and β
which are utilized for inferring newer links in the knowledge graph.
6.4 Experiments
Having discussed the proposed algorithm in previous section we now turn to its experimental eval-
uation where we solve two benchmark tasks: knowledge graph completion, entity classification.
Additionally, we perform qualitative analysis to answer two more questions about interpretabil-
ity of our model and the effect of our model on scarcely occurring entities in KG. We employ
two benchmark datasets, FB15K FB20K (Bordes et al., 2013) along with their entity descriptions
proposed in Xie et al. (2016) to evaluate our proposed algorithm. The triples in the datasets are
extracted from the Freebase knowledge graph (Bollacker et al., 2008) and the entity description of
each entity is crawled from the /ns/common/topic/description relation defined for each entity
in Freebase. While in FB15K dataset, we handle in-KB prediction where all the entities occurring
in the test set have already been encountered in the train and the validation set, FB20K dataset is
specifically designed for zero-shot scenario where at least one of the entity in every test set triple
is new and is never seen in the train or the validation set. The train and validation set of FB20K is
same as FB15K dataset. The detailed statistics of the datasets are provided in Table 6.1.
114
Table 6.1: Data sets used in our experiments on TAKE model (Xie et al., 2016)
DATASET #Relations #Entities #Train #Valid #TestFB15K 1341 14904 472860 48991 57803FB20K 1341 19923 472860 48991 30490
6.4.1 Knowledge Graph Completion
The goal of knowledge graph completion is to predict the true triple (h, r, t) when at least one of
the element h, r or t is missing in a triple. The efficiency of a given model for this task is evaluated
by corrupting the head h (or tail t, or relation r) of a given test triple by all the entities (or relations
in case of r) present in the dataset. The score of each resulting triple is computed according to
scoring function of proposed model and further sorted in decreasing order to obtain the rank of a
true triple among the corrupted triples in the sorted list.
Evaluation Protocol
The two standard metrics are reported in our experiments: Mean Rank of true test triples over the
entire test data and Hits @ n: the proportion of test triples placed among the top-n scores. These
two metrics further have two flavors: raw and filtered. In case of filtered metrics, we remove all
the triples that have already occurred in the train, validation or test dataset before computing the
score whereas we do not remove them for raw setting. In our experiments, we report Mean Rank,
Filtered Rank, Hits @ 10 (Raw and Filtered) for entities and Hits @ 1 (Raw and Filtered) for
relations. Lower ranks and higher Hits@n values are preferred during the evaluation.
Parameter Setting
We implemented our proposed model in C++ and further performed grid search on the following
hyper-parameter settings: λr(λe) ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000}, temperature a ∈ {3500
, 4000, 4500, 5000}, neg ratio C ∈ {1, 3, 5}, embedding dimension K ∈ {50, 80, 100, 150} and
learning rate η ∈ {0.01, 0.02, 0.1, 0.2} and recover the best combination of hyper-parameter by
115
Table 6.2: Mean Rank and Hits@10 (entity prediction) for models tested on FB15K dataset
FB15K Mean Rank Hits @ 10Raw Filtered Raw Filtered
TransE 210 119 48.5 66.1TransH 212 87 45.7 64.4
DKRL(BOW) 200 113 44.3 57.6DKRL(ALL) 181 91 49.6 67.4
SSP (Std.) 154 77 57.1 78.6SSP (Joint) 163 82 57.2 79.0
TAKE 195 72 44.0 60.8
evaluating them on validation set. Further, we initialize the {θ,β} parameters by running standard
LDA (Blei et al., 2003) on the text data and the stopping criteria of our algorithm is change in
likelihood between two consecutive epochs should be below a certain threshold. The optimal
settings of our model for knowledge graph completion for FB15K dataset are: λr = 0.1, λe = 0.1,
a = 4500, C = 1, α = 0.5 and K = 100 and η = 0.2. We use the same setting for FB20K dataset
while performing entity classification task.
As datasets are same, we reprint the experimental results of several baselines from the literature.
Specifically, we report the results for TransE (Bordes et al., 2013) and TransH model (Wang et al.,
2014b) models that rely on one source - knowledge graph triples - for learning the knowledge graph
embeddings. Also, we report results from two past models namely - DKRL (Xie et al., 2016) and
SSP (Xiao et al., 2017) that utilize both the knowledge graph triples and the text descriptions to
learn the entity embeddings. Each of these models further have two variants: DKRL(BOW) exploit
the CBOW encoder (Mikolov et al., 2013) for learning the word embeddings whereas DKRL(ALL)
is the weighted union of DKRL employing CNN encoder and TransE model. SSP(Std.) and
SSP(Joint) are the two variants of SSP model (Xiao et al., 2017). SSP(Std.) utilizes pretrained text
embeddings while SSP(Joint) jointly learns the text and the KB embeddings.
116
Table 6.3: Mean Rank and Hits@1 (relation prediction) for models tested on FB15K dataset
FB15K Mean Rank Hits @ 10Raw Filtered Raw Filtered
TransE 2.91 2.53 69.5 90.2TransH 8.25 7.91 60.3 72.5
DKRL(BOW) 2.85 2.51 65.3 82.7DKRL(ALL) 2.41 2.03 69.8 90.8
SSP (Std.) 1.58 1.22 69.9 89.2SSP (Joint) 1.87 1.47 70.0 90.9
TAKE 4.42 4.06 33.56 39.33
Results
The results of entity prediction and relation prediction for FB15K dataset are presented in the Table
6.2 and 6.3 respectively From the tables, we observe that:
(i) From Table 6.2, we conclude that our proposed model outperforms the existing “triples-
only” models on rank by a large margin. Further our model outperforms state-of-the-art text
augmented KG models on filtered rank.
(ii) From Table 6.3, we observe that the performance of our model is substandard for the relation
prediction compared to the state-of-the-art models. This might be due to the reason that
the proposed model has two hyper-parameters λe and λr for entity and relations. It shows
optimal performance for prediction relations and the entities for different set of parameters,
for instance, we were able to bring the Mean Filtered Rank of relation to 3.2 for FB15K
dataset when we set the parameters to λr = 0.1, λe = 1, a = 4000, C = 2 and K = 100.
6.4.2 Entity Classification
Most of the entities in the Freebase knowledge graph (Bollacker et al., 2008) have multiple types.
For instance, entity The Queen′s College in Freebase KG has following types: /base/oxford/
college and universities, /organization/organization and /education/university
117
etc. The goal of entity classification task is to predict all the possible entity type labels that an
entity may possess. Entity classification, therefore, is a multi-label classification task.
Evaluation Protocol
In order to perform our experiments, we acquire the FB15K entity classification datasets generated
in DKRL work (Xie et al., 2016) which has 13445 entities and 50 entity type labels which are
further randomly split into training and test split. For FB20K dataset, the 13445 in-KB entities
form the training set and newer 5019 out-of-KB entities form the test set. We obtain the out-of-KB
entity embeddings for FB20K dataset by optimize the Equation 6.39 while setting θi = ei in it as
proposed in Wang and Blei (2011). Further, for fair comparison, we train a Logistic Regression
classifier in one-vs-rest setting for multi-label classification task as done in DKRL work. The
evaluation metric is Mean Average Precision (MAP) which is a common metric used for evaluating
multi-label classification in literature (Neelakantan and Chang, 2015).
Table 6.4: The MAP Results for entity classification in FB15K and FB20K datasets
Model FB15K FB20KTransE 87.8 -BOW 86.3 57.5
DKRL(BOW) 89.3 52.0DKRL(ALL) 90.1 61.9
SSP (Std.) 93.2 -SSP (Joint) 94.4 67.4
TAKE 83.7 28.1
Results
The results of entity classification are presented in Table 6.4. From the results, we observe that
our model performs reasonable well on FB15K dataset and poorly on FB20K dataset. We con-
jecture that the reason of poor performance of out-of-KB entities in FB20K is the follows: our
model allows each dimension of KG entity embedding to have any range of values at the training
118
Entity Name #1 Topics #2 TopicsNASA 13 earth, space 99 world, nationalF. C. Copenhagen 1 club, football 82 europe, europeanESPN 62 television, network 66 games, events81st Academy Awards 30 awards, academy 47 united, statesUniversity of Pisa 44 university, research 71 italian, italyHappy Feet 90 film, released 29 film, animatedAmazon.com 11 company, business 47 united, statesBlood plasma 50 blood, disease 32 red, colorCollege rock 39 record, label 44 university, researchWater 81 water, natural 15 indian, film
Figure 6.3: Interpretability in Knowledge Graph embeddings on FB15K dataset. we randomlypick 10 entities from dataset and we represent each entity as mixture of top-two topics, and wefurther pick two most probable words in each topic.
time whereas the topic vector acquired at the test time for each document satisfy the following
properties, ∀θi, 0 ≤ θi ≤ 1 and∑
k θik = 1. Because of difference in scale of each dimension,
the topic vectors presently can not act as substitution for entity embeddings. In order to employ
topics as surrogate for the out-of-KB entities, the model should be trained by explicitly imposing
the non-negativity constraints on entities and relations embeddings of KG in the objective function
of the model (Ding et al., 2018).
6.4.3 Interpretability of the proposed model
In order to assert that the entity embeddings learnt by our model are interpretable, we perform
qualitative evaluation of the latent space learnt by our model. In this evaluation, we consider
the 100-dimensional entity embeddings already learnt over FB15K dataset in Section 6.4.1. Please
note that an entity embedding in our model is combination of LDA document-topics and the contri-
bution due to KG triples (Equation 6.27). Since LDA topics are interpretable (Chang et al., 2009),
we rely on LDA topics θ to interpret the meaning of each dimension of given entity embedding.
To view the entity topics, we randomly chose 10 entities from FB15K dataset (Figure 6.3), for
each entity ei, we retrieve the top two topics in the document-topic distribution as argmaxkθik,
119
Dimension 1 2 3 4 5#1 Topic famous club works california drama#2 Topic addition football published state series
Dimension 6 7 8 9 10#1 Topic team show islands stage played#2 Topic national television island career side
Figure 6.4: Table displays top two topics learnt along each of first 10 dimensions of 100-dimensional FB15K entity.
and for each topic k, we further retrieve the most probabilistic words in the topic from the topic-
word distribution β as argmaxvβk,v. For example, for an entity NASA, the most probabilistic topic
is topic number 13 (among the 100 topics learnt), which further has two most probable words
earth and space (obtained from β) and the second most probable topic is topic 99 whose two
most probable words are world and national. Although our model is compelling in recovering
the right topic for most of the entities, still it sometimes retrieves incorrect top topics because of
the presence of polysemic words in their document description. For example, in Figure 6.3, entity
Water refers to movie Water but the model retrieves compound ’water’ as the top topic.
By following a similar procedure, we could also interpret the top-n topic along each dimension
of any given entity in FB15K dataset. For example, we show the top two topics along the first
10 dimensions of FB15K entities in Figure 6.4 (leaving the rest of dimensions because of space
constraints). This shows that our model is interpretable as it can retrieve the topic being discussed
along each dimension of embedding.
6.4.4 Effect on sparsely occurring entities
Next, we aim to prove that the sparsely occurring entities in KG benefit more from the text doc-
uments while the frequently occurring entity’s embedding are dominated by the KG information.
To demonstrate that, we consider a smaller dataset, namely the validation set of FB15K to keep our
results manageable and train our proposed model on it with the following setting: λr = 1, λe = 1,
K = 100, C = 1, epochs = 10 and η = 0.1. In order to observe the effect of two data sources on
120
Figure 6.5: The effect of proposed model on sparsely occurring entities’ embeddings. The Y-axisplots average of offset=(e−θ)ᵀ(e−θ) value of each embedding while the X-axis plots the numberof times an embedding occurs in the KG.
entity embeddings, we compute the offset value (ei − θi)ᵀ(ei − θi) for each entity ei in KG.
This offset value (ei − θi) corresponds to amount of KG part being dominated in an embedding
as can be noticed by entity update Equation 6.27. We plot the offset value of each entity against
the count of occurrence of the corresponding entity in the KG in Figure 6.5. Specifically, each
Y-axis is the average of offset value of all the entities that have same count in the KG. Further,
the embedding values were learnt without performing the L2-norm of embeddings at the end of
each epoch, that accounts for the 1e11 value on the Y-axis.
As can be seen from the plot, the sparsely occurring entities have smaller offset value, hence
proving that their embeddings are benefiting more from the topic vectors learnt from text whereas
as the occurrence of entity in KG becomes frequent, it’s offset value is increasing showing that
its final embedding value is deviating away from topic vectors. This is because that the second half
of embedding update equation following a/λe in Equation 6.27 is more prominent and frequently
occurring embeddings are being endowed more with KG information.
121
6.5 Conclusion
We proposed a novel model that jointly learnt the entity and relation embeddings by exploiting
both the knowledge graph triples and the text description of entities in generative setting. The goal
of this model was two-folds: to instill interpretability in embeddings by relying on LDA topics and
to attain generalizability in embeddings by handling the out-of-KB entities. Our experimental re-
sults show that utilizing multi-modal data definitely helps us in learning high-quality embeddings.
Although, we were able to achieve the goal of interpretability in embeddings, we could not handle
the out-of-KB entities effectively and the required changes to attain this goal of generalizability in
the proposed model were also suggested.
122
CHAPTER 7
TEXT AUGMENTED ADVERSARIAL KNOWLEDGE GRAPH EMBEDDINGS
Much of the previous work on multi-modal learning of knowledge graph embeddings in the pres-
ence of entity text descriptions, discussed in the previous chapter, share one common design phi-
losophy: they consider the two sources of data as complementary to each other. In this work, we
distance ourselves from these approaches by considering an alternative view to multi-modal learn-
ing. Specifically, we consider two data sources in adversarial setting and propose a novel model
for text-enhanced knowledge graph embeddings.
7.1 Introduction
Knowledge Graphs (KG) (Bollacker et al., 2008; Miller, 1995; Suchanek et al., 2007) incorporate
rich information in them which can be leveraged to solve important AI problems such as ques-
tion answering (Bordes et al., 2014), coreference resolution (Ng and Cardie, 2002), web search
(Szumlanski and Gomez, 2010) and recommender systems (Zhang et al., 2016). As discussed in
the previous chapter in detail, knowledge graph embedding models have been front runners in ex-
ploiting the rich information present in the KGs. While early work on these models mainly focused
on leveraging only the knowledge graph triples to learn the distributional representation of entities
and relations (Bordes et al., 2013; Lin et al., 2015; Yang et al., 2015; Trouillon et al., 2016), lately,
the focus has shifted to exploit additional sources of information such as images (Xie et al., 2017),
types (Chang et al., 2014; Ma et al., 2017; Xie et al., 2016) and the text description of entities (Xie
et al., 2016; Wang and Li, 2016; Xiao et al., 2017) in order to learn high-quality KG embeddings.
Among the models utilizing auxiliary sources of information, models exploiting the entities text
description have been the most popular.
As already discussed in Chapter 6, Section 6.2.2, major work on text-enhanced knowledge
graph embeddings fall into three categories: (i) models in which semantic text embeddings and the
123
KB embeddings lie on the same features space and can further be combined in a novel way (Wang
et al., 2014a; Xie et al., 2016; Zhong et al., 2015) (ii) models in which semantic text embedding
and the KB embedding lie in different distributional space and a novel variant is proposed to
bring the text embedding onto the KB feature space (Shah et al., 2019; Wang and Li, 2016; Xiao
et al., 2017) (iii) models employing a novel attention mechanism to focus on specific words on
the text description of entities (Shi and Weninger, 2018; Xu et al., 2017). Although the above
models are effective, there is a common underlying assumption in all these models that the two
knowledge sources are complementary to each other. Most of them have not exploited the power
of adversarial models when working with multi-modal data in KGs, which have shown to be shown
to be successful in multiple applications (Fedus et al., 2018; Ma et al., 2017; Zhu et al., 2017).
We consider an alternative setting in our proposed work where the text descriptions and the
knowledge graph triples are posed in an adversarial setting. We consider the model generating
the semantic text embeddings from entity descriptions as generator and the model working with
the knowledge graph triples as discriminator. In GAN terminology, these semantic embeddings
form the counterfeit currency produced by the generator in order to deceive the discriminator and
the goal of generator is to generate the text embeddings as similar to KB entity embeddings as
possible. On the other hand, the goal of the discriminator is to learn to distinguish between entity
embeddings generated from knowledge graph triples data (pdata) and the counterfeit embeddings
(pz) aka text embeddings generated by the generator. Because of the competition, both the players
are improving their methods, which in turn produces high quality KG embeddings.
To the best of our knowledge, we are the first to consider the entity text description and the
knowledge graph triples in adversarial settings and propose a novel model to enhance the knowl-
edge graph embeddings by text descriptions. It must be mentioned that our proposed method
is model-agnostic. We can employ any of the existing knowledge graph embedding model like
TransE (Bordes et al., 2013), DistMult (Yang et al., 2015) etc inside the discriminator.
124
7.2 Related Work
Much of the related literature covering the text-enhanced knowledge graph embeddings have al-
ready been discussed in Sections 6.2.1 and 6.2.2. However, there have been three directions in
the past that have specifically considered knowledge graph embedding in adversarial setting. We
discuss them in detail here.
Typically, negative examples in knowledge graph embedding models are generated by corrupt-
ing either the head or the tail of a given positive example by either random sampling or Bernoulli
distribution (Wang et al., 2014b), which is not necessarily the optimal way of generating negative
examples. Two related research directions (Cai and Wang, 2018; Wang et al., 2018) in the past
have employed GANs to generate negative examples more intelligently. Collectively, the goal of
discriminator is to train a standard KB embedding model whereas the purpose of the generator
is to produce advanced negative examples such that the discriminator finds it hard to distinguish
between a given positive example and the negative example produced by generator. The model for-
mulation of both the models is nearly identical, they were proposed by two independent research
groups around the same time.
The most related work to ours is Qin et al. (2020), as they employ GAN to solve zero-shot
learning in KB embeddings by exploiting the text descriptions of a given relation. For a given triple
(h, r, t), they exploit the text description of relation r to learn its TD-IDF vector representation
that forms the noisy distribution of generator. The neighborhood information of both h and t is
exploited to learn the joint embedding of the pair (h, t) that forms the true data distribution of
the discriminator. This is followed by training of Wassertein GAN (Arjovsky et al., 2017) over
these two distributions. Our proposed model differs from them considerably as we do not utilize
GAN (or stacked NN) directly to train our model as them, our model is adversarial because the
two sub-modules designed inside it are satisfying opposing constraints.
125
Inspired by enormous success of GAN models (Zhu et al., 2017; Minervini et al., 2017; Chen
et al., 2016) in other applications, we next propose a novel multi-modal knowledge graph embed-
dings model trained in adversarial setting.
7.3 Adversarial Approach to learning KB embedding model
The problem under consideration is same as proposed in Chapter 6. The input is a knowledge
graph K = {E ,R, T ,S} where E , R and T = {(h, r, t)}|T |n=1 are the set of entities, relations
and knowledge graph triples. Besides, the model can take advantage of set of documents S =
{di}|E|i=1 where each document represents textual description di about an entity ei ∈ E present in
the knowledge graph K. The goal is to learn knowledge graph embedding by utilizing two sources
of data: the triples T and the entity descriptions S.
We aim to train a novel knowledge graph embedding model that poses two sources of data in
an adversarial setting. We further develop two models: generator G and discriminator D in our
formulation. The goal of generator is to generate the semantic text embeddings for each entity
ei ∈ E by utilizing the document di ∈ S. In particular, the generator deceives the discriminator
into believing that text embeddings are entity embeddings obtained from knowledge graph triples
T . It accomplishes this by employing constrained optimization over the text description where the
constraint forces each text embedding to lie close to corresponding entity embedding in KG in the
low-dimensional feature space. On the other hand, discriminator D aims at learning knowledge
graph embeddings from the triples T while ensuring that it drives KG entity embedding away from
the text embeddings generated by G. We conjecture that because of the additional constraints that
the proposed model has to guarantee, it would result in high-quality embeddings. After the training
is over, the embeddings learnt by discriminator are retained as the final KB embeddings.
Before we formulate the model, it must be mentioned that there are two set of parameters in our
model. The discriminator parameters ΘD = {E,R} consist of KG entity embedding E ∈ R|E|×k
and relation embeddings R ∈ R|R|×k. For a given triple (h, r, t), an entity’s embedding can be
126
obtained by its index in the matrix E i.e. Eᵀ(h,:) ≡ h and Eᵀ
(t,:) ≡ t and Rᵀ(r,:) ≡ r where h, r, t ∈ Rk
are the knowledge graph embeddings. The ΘD parameters are optimized inside the discriminator
model. The generator parameters are ΘG = {W ,M} and optimized inside generator. Also,
the parameter E is passed from discriminator to generator and is kept constant inside generator.
The parameter M is passed from the generator to discriminator and is kept constant inside the
discriminator. We now present the technical details of the model starting with the generator.
7.3.1 The Generator Design
The aim of the generator G is to learn text embedding (or topic vector) for each entity present in
the KG by tapping into the semantic information present in the entity description documents S,
while at the same time ensuring that each text embedding lie close to the corresponding KG entity
embedding. We employ Constrained Non-Negative Matrix Factorization (CNMF) (Liu and Wu,
2010; Xiao et al., 2017) over the entity descriptions in order to achieve the above two goals. We
first represent the entity descriptions S as a matrixC ∈ R|V |×|E| whose each column ci ∈ R|V | is a
count vector for entity ei extracted from description di. Specifically, each entry cji inC represents
the number of times jth vocabulary word occurred in the entity description of ei. The goal of
non-negative matrix factorization is to find two factors W ∈ R|V |×k and ST ∈ Rk×|E| that when
multiplied together generate the original matrix C. That is
C ≈WST (7.1)
where each row si ∈ Rk of matrix S represents the semantic text embedding of entity ei. The
factorization can be achieved by minimizing the Frobenius norm between the original matrix C
and the product of the factorsW and S as follows:
O = ‖C −WST‖ (7.2)
In order to incorporate the constraint that the text embeddings should lie close to the corresponding
KG entity embeddings, we consider the KG entity embedding as matrix E ∈ R|E|×k. The matrix
127
E is a discriminator paramater being optimized in D and is kept constant inside the generator G’s
optimization. Further, as both the text and the KG entity embeddings are derived from the two
different sources of information, we introduce a projection matrixM ∈ Rk×k that projects the KG
entity embeddings from the triple feature space onto text feature space. Our goal is to bring the
resultant projected KG entity embeddings close to text embeddings by introducing the following
constraint in the model:
S = EM (7.3)
In the above equation, matrix E represents the constraint (and hence, is considered a constant by
generator) that we wish the text embedding to follow.
Please note that this model can be extended to the out-of-KB setting where some entities, Ein,
have two data sources as defined above whereas the other set of entities, Eout, do not participate in
any KG triple and model only has access to their text descriptions. In that case the constraint in
Equation 7.3 will be modified as S = AM where matrixA ∈ R|Ein+Eout|×|k+Eout| is defined as:
A =
E|Ein|×k 0
0 I|Eout|×|Eout|
(7.4)
Here, I|Eout|×|Eout| is an identity matrix and M ∈ R|k+Eout|×k is the projection matrix. In this case,
the entities in Ein would satisfy the usual constraint that a given text entity should lie close to the
corresponding KB entity. However, for entities in Eout, the text embedding will be incorporated in
matrix M itself and these entities can not be constrained to lie close to KB entity embeddings as
KB entities will not be available for entities in Eout. For now, we only consider the case in Equation
7.3, where both the sources of data are available for each entity, i.e |Eout| = 0.
By introducing the constraint posed in Equation 7.3 into the objective function proposed in
Equation 7.2, the objective modifies as follows:
O = ‖C −W (EM)T‖
= ‖C −WMTET‖(7.5)
128
The non-negativity constraints in the modified objective function above are that each element wij
in matrix W and∑
k eik ∗ mkj in the product EM should be non-negative. Let αij and βij be
the Lagrange multiplier for the constraint wij ≥ 0 and∑
k eik ∗ mkj ≥ 0 respectively and we
further define matrix α ∈ R|V |×k and β ∈ R|E|×k which results in the final optimization function
of generator G as:
LG(W ,M) = ‖C −WMTET‖+αW T + β(EM )T (7.6)
Parameter Learning in Generator
Following the optimization steps in the Constraint Non-Negative Matrix factorization work (Liu
and Wu, 2010), we solve the objective function in Equation 7.6 as:
LG(W ,M ) = Tr((C −WMTET
)(C −WMTET
)T+αW T + β(EM )T
)= Tr
(CCT − 2CEMW T +WMTETEMW T +αW T + β(EM)T
)= Tr
(CCT
)− 2Tr
(CEMW T
)+ Tr
(WMTETEMTW T
)+ Tr
(αW T
)+ Tr
(β(EM )T
)(7.7)
Tr(W) represents the trace of the matrix and is obtained by taking sum of eigenvalues (Lipschutz,
1968). Taking the derivative of LG(W ,M) with respect to W and M and set them to zero, we
get the following:
∂LG∂W
= −2CEM + 2WMTETEM +α = 0 (7.8)
∂LG∂M
= −2ETCTW + 2ETEMW TW +ETβ = 0 (7.9)
Multiply Equation 7.9 byMT on both sides we get:
−2MTETCTW + 2MTETEMW TW +MTETβ = 0
−2MTETCTW + 2MTETEMW TW + (EM )Tβ = 0
(7.10)
129
In the resulting Equations 7.8 and 7.10, we use Kuhn-Tucker condition αijwij = 0 and (∑
k eik ∗
mkj) ∗ βij = 0, that generate the following equations:
(CEM )ijwij − (WMTETEM )ijwij = 0 (7.11)
(W TCEM )ijmij − (W TWMTETEM )ijmij = 0 (7.12)
The above equations will result in the final update equations of wij and mij as follows (Lee and
Seung, 2001):
wij ← wij(CEM )ij
(WMTETEM )ij(7.13)
mij ← mij(W TCEM )ij
(W TWMTETEM )ij(7.14)
7.3.2 The Discriminator Design
The goal of discriminator is to learn the relation and entity embedding for triples T present in the
KG K, while ensuring that it learns to discriminate between the entity embeddings E learnt from
KG and the semantic embedding S learnt from the text. Although, our discriminator is model-
agnostic and can utilize any of the existing scoring function like TransE (Bordes et al., 2013),
TransH (Wang et al., 2014b), TransR (Lin et al., 2015) to train our discriminator, we employ the
scoring function DistMult (Yang et al., 2015) for a given triple (h, r, t) as follows:
φ(h, r, t) =k∑l=1
hlrltl (7.15)
where h, r, t ∈ Rk are the KG embeddings of a given triple (h, r, t) in KG. We further utilize
sigmoid function, similar to (Trouillon et al., 2016), to generate the probability of a triple (h, r, t)
being true (or false) as below:
PD(Y |(h, r, t)
)= σ
(− Y φ(h, r, t)
)(7.16)
where Y = {1,−1} represents the label of the true (or false) triple. The above probability represents
the true (data) distribution probability (pdata) and the goal of discriminator is to maximize it. At
130
the same time, discriminator aims to learn to distinguish the KB entity embeddings from the text
embeddings. In order to achieve that, the discriminator acquire the text embedding of head and tail
of a given triple as follows:
hs = Sᵀ(h,:) = (E(h,:)M)ᵀ = Mᵀh and
ts = Sᵀ(t,:) = (E(t,:)M)ᵀ = M ᵀt
(7.17)
where hs and ts are the text embeddings for the head and the tail in triple (h, r, t) in discriminator.
They are obtained by applying the generator’s constraint in Equation 7.3. M ∈ Rk×k is the
projection matrix proposed in Equation 7.3 which is considered as a constant in the discriminator
model. The resulting scoring function of a discriminator for the text embedding is as follows:
φ(hs, r, ts) =k∑l=1
hlsrltls (7.18)
The probability of a triple being true based on the text embeddings of head and the tail is given by:
PG(Y |(hs, r, ts)
)= σ
(− Y φ(hs, r, ts)
)(7.19)
Please note that the probability in above Equation 7.19 represents the probability of noise pz(z)
generated by the generator. The goal of discriminator is to maximize the probability of true data in
Equation 7.16 while minimizing the probability of the noise in Equation 7.19 at the same time. In
order to achieve this, it maximizes the following objective function:
LD(E,R) = EY∼PD[
log(PD(Y |h, r, t)
)]+ EY∼PG
[log(1− PG(Y |h, r, t)
)](7.20)
7.4 The proposed algorithm
After explaining our generator and discriminator in detail, we now explain our proposed algorithm
- Text Augmented Adversarial Knowledge graph Embedding (TAAKE) (algorithm 4) in detail
here. The input to the algorithm is two data sources: knowledge graph triples T and the text
131
Algorithm 4 Text Augmented Adversial Knowledge Embeddings (TAAKE)
Input: training triples T = {hn, rn, tn}|T |n=1, set of entities E , set of relationsR, text descriptionof entities S = {di}|E|i=1
Output: Knowledge graph embeddings Θ = {E,R,W ,M} trained in adversarial setting
1: while not convergence do2: Sample mini-batch of data Tbatch from the triples T3: Set Ψ = 0 . entities to be optimized by Generator G4: for each triple (h, r, t) ∈ Tbatch do . Discriminator D optimization5: Ψ = Ψ ∪ {h} ∪ {t}6: Generate C negative examples for the given triple.7: Update the discriminator parameters Θ = {E,R} for given triple according to8: Equation 7.20 for both positive and negative examples:9: θD = θD + η∇θDLD(E,R)
10: end for11: for s in Ψ do . Generator G optimization12: Update the generator parameters Θ = {W ,M} for entity s according to iterative13: update algorithm proposed in Equation 7.13 and 7.14.14: end for15: end while
description of entities S. The algorithm would optimize and return four model parameters Θ =
{E,R,W ,M} at its completion. The algorithm iterates until the stopping criteria is met which
is a fixed number of iterations while allowing early stopping when the rank of true triples start
increasing over validation set. Within each epoch (line 1-15), the discriminator D first optimizes
the triple parameters (line 4-10). It samples batch, Tbatch, of triples (line 2) and for each triple
in the batch, generates C negative examples (line 6) by either corrupting the head h or the tail
t of a given triple. The discriminator parameters ΘD = {E,R} are updated for each triple by
performing stochastic gradient ascent (line 8-9) on the objective function in Equation 7.20. After
discriminator has updated one batch, the control is passed to the generator G which updates the
text embedding parameters for all the entities that occurred in a given batch Tbatch (line 11-14).
It updates the generator parameters ΘG = {W ,M} by iterative update algorithm proposed in
Equation 7.13 and 7.14. After the optimization is over, the model parametersE,R can be utilized
for link prediction on test data.
132
The rigorous evaluation of the proposed model is an immediate direction. Because of the ad-
versarial constraints, high-quality embeddings {E,R} of a given KG will be returned by discrim-
inator that could be utilized for knowledge graph completion, triple classification tasks (Bordes
et al., 2013). Also, it would be interesting to observe the performance of the model when the text
embeddings S are utilized as the substitute of the KG entity embeddings E. Because of the con-
straint that brings the two closer, we should see superior performance while substituting the KG
entities with the text embeddings. Further, we hypothesize that because the model is utilizing two
data sources to learning the KG embeddings, the performance of the model would be superior than
the existing models like TransE (Bordes et al., 2013), TransH (Wang et al., 2014b) which rely on a
single source of data, namely, KG triples.
133
CHAPTER 8
CONCLUSION
Our key goal in this dissertation was to develop methods that facilitate a tighter integration of neuro
symbolic systems that are scalable and are endowed with symbolic reasoning capabilities. To this
effect, we have proposed a set of novel neuro-symbolic architectures inside the framework of neu-
ral SRL that can reason about complex interactions present in the relational data. In addition,
we have presented some solutions to the novel problems encountered in knowledge graph embed-
dings sub-field in order to make these models more generic for handling relational data. While
these models take significant steps towards achieving true neuro-symbolic AI, there are some sub-
problems that still need to investigated. We elucidate a few of these problems in detail and propose
suitable solutions to them.
8.1 Future Directions
8.1.1 Knowledge Graph Alignment
Though highly useful in solving AI tasks, an important downside of current knowledge graphs is
that each of them has been developed by independent organizations by crawling facts from dif-
ferent sources, or by utilizing different algorithms. This results in knowledge graphs in different
languages/formats/structures. As a result, the knowledge embodied in these different graphs is het-
erogeneous and complementary (Zhu et al., 2017). This necessitates the need for integrating them
in order to form one unified knowledge graph that would form a richer source of knowledge to
solve AI problems more effectively. As a first step towards integrating these knowledge graphs, one
needs to address the following issues, which collectively are known as knowledge graph alignment:
(i) entity alignment (entity resolution) that aims at finding entities in different knowledge bases
being integrated which, in fact, refer to same real-world entity (ii) triple-wise alignment fo-
cuses on finding triples in two knowledge graphs that refer to the same real-world fact. For in-
stance, even though triple (m.030cx, tv program genre, m.01z4y) in Freebase and (Friends,
134
genre, Comedy) triple in Dbpedia refers to same fact - Friends sitcom has comedy genre - they
are represented with different identities of entities and relations in two knowledge graphs.
Motivated by their success inside single knowledge graph problems, more recently, embed-
dings have been employed to perform knowledge graph alignment across multiple knowledge
graphs. One of the early works along this line is by Chen et al. (2016), that encodes entities
and relations of two knowledge graphs into two separate embeddings space and proposes three
methods of transitioning from an embedding to its counterpart in other space. Following this,
more advanced approaches for knowledge graph alignment have been proposed that can mainly be
divided into three main categories:
• The first set of models overcome the problem of low availability of aligned entities and
aligned triples across multiple knowledge graphs. As low availability of training data can
hinder the performance of model, these works increase the size of the training data either
iteratively (Zhu et al., 2017); or via bootstrapping approach (Sun et al., 2018); or by co-
training (Chen et al., 2018) technique.
• Another line of research is based on the idea that in addition to utilizing the knowledge in
standard relation triples, there is rich semantic knowledge present in the knowledge graphs
in the form of properties and text description of entities which can be harnessed to improve
the performance of model (Sun et al., 2017; Zhang et al., 2019; Zhu et al., 2019).
• The third line of research is focused on designing models that overcome the limitations
of translation based embeddings models (Li et al., 2018), as they exploit standard Graph
Convolutional Networks (Wang et al., 2018), their relational variants (Wu et al., 2019; Ye
et al., 2019) and Wasserstein GAN (Pei et al., 2018) in order to learn the embeddings of
entities and relations in multiple knowledge graphs.
135
Motivation We propose a novel knowledge base alignment technique based upon string edit dis-
tance that addresses the following limitations of the existing models (Kaur et al., 2020a):
• Even though the past techniques have exploited the supplementary knowledge present in
KBs in the form of text description of entities, properties of entities as attributional embed-
dings; none of them exploited the rich semantic knowledge present in the type descriptions
of the entities. As shown in the past (Chang et al., 2014; Ma et al., 2017; Xie et al., 2016;
Krompaß et al., 2015), incorporating type information into a single KB model increases the
productive performance of the model. Likewise, we conjecture a performance improvement
in knowledge alignment task by utilizing the type information. Further, use of type informa-
tion can help the model deal with polysemy issues present in KBs.
• We consider multiple possible interactions between triples of two knowledge graphs by per-
forming all possible edit distances between two triples. This is different from the linear
transformation model (Chen et al., 2016) that only considers one possible transformation
between corresponding entities/relations in two triples. Multiple transformations allow mul-
tiple ways in which two similar triples can be brought closer in embedding space.
• Finally, all the past models have considered triple-wise alignment between binary relations
while our proposed model can find similarity between relations of any arity. For instance,
if our task is to perform threshold-based classification between two relations, let us say,
distance(advisedby(william, lisa), coauthor(william, lisa, tom)) < θ, where θ is the
threshold for positive classification, then our proposed model can find the edit distance be-
tween two relations of different arity.
Knowledge Alignment by String edit distance in embedding space: We consider a multi-
lingual knowledge base K that consists of a set L of languages. Specifically, we consider two
ordered language pairs (L1, L2) ∈ L2 where each language L1 = (E1, R1, T1) consist of set of
136
Figure 8.1: A Finite State Transducer. Operation a : b represent that the finite state transducerwould read input character a ∈ x and outputs character b ∈ y.
entities E1, relations R1 and triples T1 = r1(h1, t1)1 . Similarly, L2 = (E2, R2, T2). We aim
at finding the distance between triples (T1, T2) ∈ (L1, L2) such that the distance between aligned
triples is always less than misaligned triples. This is because entities (and relations) that participate
in similar triples, being semantically similar, will lie close to each other in the embedding space.
Formally,
dist(r1(h1, t1), r2(h2, t2)
)< dist
(r1(h1, t1), rq(hq, tq)
)(8.1)
where r1(h1, t1) ∈ T1, r2(h2, t2) ∈ T2 and rq(hq, tq) ∈ T′2. The corrupted sample set T ′2 is
defined as T ′2 = {rq(h2, t2) | ∀rq ∈ R2} ∪ {r2(hq, t2) | ∀hq ∈ E2} ∪ {r2(h2, tq) | ∀tq ∈ E2} where
r2(h2, t2) ∈ T2, a true triple existing in language L2.
String-edit distance: The distance function of our model is inspired by the edit distance com-
putation between a pair of strings (x, y) by memoryless stochastic transducer proposed by Ristad
and Yianilos (1998). The idea was that a transducer (Figure 8.1) receives an input string x and
1constants used to represent entities and relations in the domain are written in lower-case (e.g., r1, h1); set ofentities and relations are capitalized (e.g., E1, R2)
137
performs a sequence of edit operations until it reaches the terminal stage when it outputs string y.
Edit operations, δ(z), performed by transducer were defined as: δ(a, b): substitution of character
a ∈ x by character b ∈ y; δ(a, ε): deletion of character a ∈ x; δ(ε, b) : insertion of character b ∈ y.
One sequence of edit operations between (x, y), called edit sequence, is defined as the product of
all the edit operations along the sequence. The total edit distance between pair of strings is defined
as the sum of all the edit sequences edq :
dist(x, y) =∑edq
∏δ(z)∈ edq
δ(z) (8.2)
The cost of edit operations, δ(z), is a learnable cost that was optimized by EM algorithm (Moon,
1996) in the original model (Ristad and Yianilos, 1998; Oncina and Sebban, 2006).
String-edit operation δ(z): Inspired by learning of string-edit distance by Ristad and Yianilos
(1998), our goal is to compute the distance between two triples in Equation 8.1 by formulating
them as pair of strings. We aim at considering each aligned triple pair (T1, T2) ∈ (L1, L2) such
that T1 ∈ L1 is analogous to input string x and T2 ∈ L2 being analogous to output string y.
Specifically, by considering triple rj(ei, ek) as string rjeiek, edit distance computation between
two strings can be performed by making the following assumptions:
• Our basic unit of edit operation is one entity e or one relation r. Further, each entity or each
relation are represented by low-dimensional embedding.
• Our basic edit operation are: (a) substitution of an entity or a relation in T1 ∈ L1 by any
another entity or relation in T2 ∈ L2 i.e δ(e1, e2), δ(e1, r2), δ(r1, e2), δ(r1, r2) for every
e1 ∈ E1, e2 ∈ E2, r1 ∈ R1, r2 ∈ R2 (b) deletion of an entity or relation present in T1 ∈ L1
i.e. δ(e1, ε), δ(r1, ε) for every e1 ∈ E1, r1 ∈ R1 (c) insertion of an entity or relation present
in T2 ∈ L2 i.e. δ(ε, e2), δ(ε, r2) for every e2 ∈ E2, r2 ∈ R2. We aim to perform these edit
operations in embedding space.
138
Figure 8.2: Knowledge graph alignment by string-edit distance in embedding space.
As can be seen, some of the edit operations such as δ(e, r) and δ(r, e) are semantically incorrect. To
overcome this, we consider three embedding spaces: entity-space, relation-space and string-space
(Figure 8.2). This ensures that original entities’ (or relations’) information is preserved while they
participate in the string-edit distance computation. Secondly, this also guarantees that entities are
semantically different from relations as we locate them in separate vector space (Lin et al., 2015).
Specifically, we model all the entities in language L1 and L2 to reside in ke-dimensional em-
bedding space, i.e. ∀e1 ∈ E1, e2 ∈ E2, e1 ∈ Rke , e2 ∈ Rke . Further, all the relations in L1 and
L2 lie in kr-dimensional embedding space, i.e. ∀r1 ∈ R1, r2 ∈ R2, r1 ∈ Rkr , r2 ∈ Rkr . In order
to perform the edit operation between two triples (T1, T2) ∈ (L1, L2), their constituent entities
and relations are first projected onto the ks-dimensional string-space. For example, embedding
corresponding to the triple r1(h1, t1) ∈ T1 and r2(h2, t2) ∈ T2 in Equation 8.1 are projected onto
string-space as follows:
rs1 = r1Mr1 , rs2 = r2Mr2 , (8.3)
hs1 = h1M
r1h1−type, ts1 = t1M
r1t1−type, hs
2 = h2Mr2h2−type, ts2 = t2M
r2t2−type, (8.4)
where r1, r2 ∈ Rkr , h1, h2, t1, t2 ∈ Rke , Mr1 , Mr2 ∈ Rkr×ks , Mr1h1−type, Mr1
t1−type ∈
Rke×ks Mr2h2−type, Mr2
t2−type ∈ Rke×ks . Also, we enforce the constraints that the embeddings and
139
the projection matrix lie inside the unit ball i.e. ‖rs‖2 ≤ 1, ‖hs‖2 ≤ 1, ‖ts‖2 ≤ 1, ‖rMr‖2 ≤
1, ‖eMre−type‖2 ≤ 1. The matrices Mr1 and Mr2 are the projection matrices that project the
relations from the relation-space to the string-space. Similarly, Mr1h1−type is the projection matrix
that project entities from the entity-space to string-space. More specifically, projection matrices
Mr1h1−type and Mr1
t1−type represent the type-matrices that encode the type of entities h1 and t1
inside the relation r1 respectively. The total number of type-matrices will be equal to total possible
entity types in a knowledge base.
Once the entities and relations of the aligned pairs have been projected to the string-space,
they are considered semantically equal. Henceforth, they represent characters of strings upon
which we perform string-edit distance operations in the string-space. Consequently, aligned triples
(T1, T2) =(r1(h1, t1), r2(h2, t2)
)provided as training data represent transformed triple (T1, T2) =(
rs1(hs1, t
s1), r
s2(h
s2, t
s2))
after projection. These transformed triples are modeled as string pair
(x, y) =(rs1h
s1t
s1, r
s2h
s2t
s2
)in string-space, where each character of the string has its correspond-
ing embedding, which is obtained by projection operation on entities and relations residing in their
original embedding space. As a next step, we consider embeddings of characters of string x as set
a = {rs1,hs1, t
s1} and string y as b = {rs2,hs
2, ts2} and define edit operations - substitution, deletion
and insertion as follows:
• substitution operation is difference between embedding of character a in input string x and
character b in output string y, i.e. δ(a,b) = (a− b), a,b ∈ Rks
• deletion operation δ(a, ε) is the difference between embedding of character a in input string
x and special null embedding ε: δ(a, ε) = (a− ε), a ∈ Rks
• insertion operation δ(ε,b)is the difference between special null embedding ε and embedding
of character b in the output string y: δ(ε,b) = (ε− b), b ∈ Rks
The next step after computing the edit-operation is determining the edit-sequence between
string pair, which is explained as follows.
140
Edit-sequence and the Edit-distance computation As discussed previously, one edit-sequence
is a sequence of edit operations, δ(z), performed between a pair of strings (x, y) starting at input
string x and reaching output string y. We define one edit-sequence as an element-wise dot product
of embeddings obtained as a result of edit operation, δ(z), between string pairs (x, y). This is
followed by L2-norm, in order to obtain a scalar value for one possible edit distance between
(x, y). Formally,
edq(r1(h1, t1), r2(h2, t2)
)= ‖�
(δ(z1), δ(z2), . . . , δ(zk)
)‖22 =
ks∑i=1
[δ(z1)
(i)δ(z2)(i) . . . δ(zk)
(i)]2
(8.5)where δ(z1), δ(z2), . . . , δ(zk) are the vector obtained for each edit operation previously in the
string-space. � is the element-wise dot product of the vectors and δ(zk)i is the i-th element of the
vector δ(zk). As there can be multiple edit sequences possible between triples (T1, T2), the final
distance between the pair of relation triples is defined as an average of all the edit sequences.
dist(r1(h1, t1), r2(h2, t2)
)=
1
N
∑edq
edq(r1(h1, t1), r2(h2, t2)
)(8.6)
where N = |edq(r1(h1, t1), r2(h2, t2)
)|, number of edit sequences between triples r1(h1, t1),
r2(h2, t2). To train the proposed model, we minimize margin-based ranking criteria over the
aligned training pairs (T1, T2) ∈ (L1, L2):
LA =∑
(T1,T2)
[γA + dist
(r1(h1, t1), r2(h2, t2)
)− dist
(r1(h1, t1), rq(hq, tq)
)]+
(8.7)
where r1(h1, t1) ∈ T1 and r2(h2, t2) ∈ T2, [x]+ = max{0, x}, γA is the hyperparameter. The
negative example rq(hq, tq) is obtained by corrupting positive example r2(h2, t2) (cf. eqn (8.1)).
The model can be empirically evaluated by utilizing multi-lingual dataset, WN31, designed by
Chen et al. (2016). WK31 dataset contains English (En), French (Fr) and German (De) knowledge
graphs where the ground truth of the aligned triples of two languages (e.g. En−Fr and En−De)
are crawled from Dbpedia’s dbo : Person domain. The evaluation should be designed as binary
classification task where two multi-lingual triples are aligned if dist(r1(h1, t1), r2(h2, t2)) < θ,
141
where θ will be optimized over the validation set. By maintaining the positive to negative example
ratio as 1 and accuracy as the evaluation metric for the proposed model, we hope that the model
would yield efficient results.
8.2 Closing Remarks
The models that we have already developed in this dissertation in addition to the ones we aim to
investigate are our first steps in realizing the grand vision of neuro-symbolic integration. How-
ever, there are some directions that could be pursued beyond the scope of this thesis that could
facilitate the development of more sophisticated neuro-symbolic models. One direction to explore
would be to harness the human expert knowledge to learn more robust systems. Human advice is
especially crucial in the domains where the data is noisy, insufficient and incorrect. Because incor-
porating human knowledge in the context of neural networks (Towell and Shavlik, 1994), support
vector machines (Fung et al., 2002; Kunapuli et al., 2010), probabilistic logic models (Odom et al.,
2015; Odom and Natarajan, 2018, 2016b), reinforcement learning (Kunapuli et al., 2013; Odom
and Natarajan, 2016a) have already proven successful in the past, exploiting it in neuro-symbolic
systems would prove useful in the noisy and uncertain domains.
Knowledge graph embeddings can serve as starting point of exploring human knowledge in
the neuro-symbolic systems. For instance, it was observed in Xiong et al. (2018) that most of
the relations in knowledge graphs have very few instances. In such scenario, the human expert’s
advice can provide the qualitative influences (Altendorf et al., 2012; Yang and Natarajan, 2013;
Kokel et al., 2020) like monotonicities and synergies, that frequently occurring relation’s instances
can have on the scarcely occurring relation’s instances in knowledge graph embedding models,
thus improving the link prediction in scarcely occurring relations.
We considered binary input layer in case of LRBM-Boost model proposed in Chapter 4. Learn-
ing lifted Boltzmann machines models that support other distributions in the input layer like Pois-
son, Normal to learn truly hybrid models can lead to several adaptations on real data. Further, deep
142
learning models are made up of stacks of hidden layers, where one layer learns the higher-order ab-
stractions of the layer below it. Analogically, neuro-symbolic models can be proposed that invent
newer predicates (Kok and Domingos, 2007) by utilizing the data of layers beneath them. Design-
ing such relational models in the deep settings, say in deep Boltzmann machines (Salakhutdinov
and Hinton, 2009), would truly realize the dream of achieving deep neuro-symbolic systems.
143
REFERENCES
Altendorf, E., A. C. Restificar, and T. G. Dietterich (2012). Learning from sparse data by exploitingmonotonicity constraints. In UAI.
Arjovsky, M., S. Chintala, and L. Bottou (2017). Wasserstein Generative Adversarial Networks.PMLR 70, 214–223.
Bach, S., M. Broecheler, B. Huang, and L. Getoor (2017). Hinge-loss Markov random fields andprobabilistic soft logic. JMLR 18, 1–67.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learn-ing 2(1), 1–127.
Bengio, Y., P. Lamblin, D. Popovici, and H. Larochelle (2006). Greedy layer-wise training of deepnetworks. In NeurIPS.
Bergstra, J. and Y. Bengio (2012). Random search for hyper-parameter optimization. JMLR 13,281–305.
Besold, T. R., A. S. d’Avila Garcez, S. Bader, H. Bowman, P. M. Domingos, P. Hitzler,K. Kuhnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon,and G. Zaverucha (2017). Neural-symbolic learning and reasoning: A survey and interpretation.CoRR abs/1711.03902, 1–58.
Blei, D. and J. Lafferty (2009). Topic models. In Text Mining: Theory and Applications, pp. 71–89.Taylor and Francis.
Blei, D. M. and J. D. Lafferty (2005). Correlated topic models. In NeuRIPS.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. JMLR 3, 993–1022.
Blockeel, H. and L. De Raedt (1998). Top-down induction of first-order logical decision trees.Artificial Intelligence 101, 285–297.
Blockeel, H. and W. Uwents (2004). Using neural networks for relational learning. In ICML-2004Workshop on SRL, pp. 23–28.
Bollacker, K., C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008). Freebase: a collaborativelycreated graph database for structuring human knowledge. In SIGMOD.
Bordes, A., X. Glorot, J. Weston, and Y. Bengio (2012). Joint learning of words and meaningrepresentations for open-text semantic parsing. In AISTATS.
Bordes, A., N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013). Translating em-beddings for modeling multi-relational data. In NeurIPS.
144
Bordes, A., J. Weston, and N. Usunier (2014). Open question answering with weakly supervisedembedding models. In ECML-PKDD.
Cai, L. and W. Y. Wang (2018). KBGAN: adversarial learning for knowledge graph embeddings.In NAACL-HLT.
Carlson, A., J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, Jr., and T. M. Mitchell (2010).Toward an architecture for never-ending language learning. In AAAI, pp. 1306–1313.
Chang, J., S. Gerrish, C. Wang, J. L. Boyd-graber, and D. M. Blei (2009). Reading tea leaves:How humans interpret topic models. In NeuRIPS.
Chang, K.-W., W.-t. Yih, B. Yang, and C. Meek (2014). Typed tensor decomposition of knowledgebases for relation extraction. In EMNLP.
Chen, M., Y. Tian, K. Chang, S. Skiena, and C. Zaniolo (2018). Co-training embeddings ofknowledge graphs and entity descriptions for cross-lingual entity alignment. In IJCAI.
Chen, M., Y. Tian, M. Yang, and C. Zaniolo (2016). Multi-lingual knowledge graph embeddingsfor cross-lingual knowledge alignment. In IJCAI.
Chen, X., Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016). Infogan:Interpretable representation learning by information maximizing generative adversarial nets. InNeuRIPs.
Craven, M. W. and J. W. Shavlik (1995). Extracting tree-structured representations of trainednetworks. In NeurIPS, pp. 24–30.
Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines: AndOther Kernel-based Learning Methods. Cambridge University Press.
Das, M., Y. Wu, T. Khot, K. Kersting, and S. Natarajan (2016). Scaling lifted probabilistic infer-ence and learning via graph databases. In SDM.
Das, R., A. Neelakantan, D. Belanger, and A. McCallum (2017). Chains of reasoning over entities,relations, and text using recurrent neural networks. In EACL.
Das, S., S. Natarajan, K. Roy, R. Parr, and K. Kersting (2020). Fitted Q-learning for relationaldomains. In KR.
Davis, J. and M. Goadrich (2006). The relationship between Precision-Recall and ROC curves. InICML.
De Raedt, L., K. Kersting, S. Natarajan, and D. Poole (2016). Statistical Relational ArtificialIntelligence: Logic, Probability, and Computation. Morgan and Claypool Publishers.
145
Deng, L. (2015). Connecting deep learning features to log-linear models. In Log-Linear Models,Extensions and Applications. MIT Press.
Desjardins, G., A. Courville, Y. Bengio, P. Vincent, and O. Dellaleau (2010). Parallel temperingfor training of restricted Boltzmann machines. AISTATS 9, 145–152.
DiMaio, F. and J. Shavlik (2004). Learning an approximation to inductive logic programmingclause evaluation. In ILP, pp. 80–97.
Ding, B., Q. Wang, B. Wang, and L. Guo (2018). Improving knowledge graph embedding usingsimple constraints. In ACL.
Domingos, P. and D. Lowd (2009). Markov Logic: An Interface Layer for AI. Morgan & ClaypoolPublishers.
Dong, X., E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, andW. Zhang (2014). Knowledge vault: A web-scale approach to probabilistic knowledge fusion.In ACM SIGKDD.
Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning andstochastic optimization. JMLR 12, 2121–2159.
Evans, R. et al. (2018). Can neural networks understand logical entailment? In ICLR.
Evans, R. and E. Grefenstette (2018). Learning explanatory rules from noisy data. JAIR 61(1),1–64.
Fedus, W., I. Goodfellow, and A. M. Dai (2018). Maskgan: Better text generation via filling in the—-. In ICLR.
Feldman, R. and J. Sanger (2006). Text Mining Handbook: Advanced Approaches in AnalyzingUnstructured Data. Cambridge University Press.
Fischer, A. and C. Igel (2012). An introduction to restricted Boltzmann machines. In CIARP, pp.14–36. Springer Berlin Heidelberg.
Franca, M. V. M., G. Zaverucha, and A. S. d’Avila Garcez (2014). Fast relational learning usingbottom clause propositionalization with artificial neural networks. Machine Learning 94, 81–104.
Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals ofStatistics 29(5), 1189–1232.
Fung, G. M., O. L. Mangasarian, and J. W. Shavlik (2002). Knowledge-based support vectormachine classifiers. In NeuRIPS.
146
Garcez, A. S. d., D. M. Gabbay, and K. B. Broda (2002). Neural-Symbolic Learning System:Foundations and Applications. Springer-Verlag.
Getoor, L., N. Friedman, D. Koller, and A. Pfeffer (2001). Learning Probabilistic RelationalModels, pp. 307–335. Springer.
Getoor, L. and B. Taskar (2007). Introduction to Statistical Relational Learning. MIT Press.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. MIT Press.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio (2014). Generative adversarial nets. In NeuRIPS.
Gopal, S. and Y. Yang (2014). Von mises-fisher clustering models. In ICML.
Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017). Improved trainingof wasserstein gans. In NeuRIPS.
Gutmann, B. and K. Kersting (2006). TildeCRF: Conditional Random Fields for logical sequences.In ECML, pp. 174–185.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. InCVPR.
He, S., K. Liu, G. Ji, and J. Zhao (2015). Learning to represent knowledge graphs with gaussianembedding. In CIKM.
Heckerman, D., D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie (2001). Dependencynetworks for inference, collaborative filtering, and data visualization. JMLR 1, 49–75.
Helma, C., R. D. King, S. Kramer, and A. Srinivasan (2001). The predictive toxicology challenge2000-2001. Bioinformatics 17, 107–108.
Hinton, G., L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,P. Nguyen, T. Sainath, and B. Kingsbury (2012). Deep neural networks for acoustic modelingin speech recognition. IEEE Signal Processing Magazine 29(6), 82–97.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. NeuralComputation 14(8), 1771–1800.
Hinton, G. E. and S. Osindero (2006). A fast learning algorithm for deep belief nets. NeuralComputation 18, 1527–1554.
Hofmann, T. (1999). Probabilistic latent semantic analysis. In UAI.
147
Hu, H., J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018). Relation networks for object detection. InCVPR, pp. 3588–3597.
Hu, Z., X. Ma, Z. Liu, E. H. Hovy, and E. P. Xing (2016). Harnessing deep neural networks withlogic rules. In ACL.
Huang, Y., W. Wang, and L. Wang (2015). Conditional high-order Boltzmann machine: A super-vised learning model for relation learning. In ICCV.
Jaeger, M. (1997). Relational Bayesian Networks. In UAI.
Jaeger, M. (2007). Parameter learning for relational bayesian networks. In ICML.
Kameya, Y. and T. Sato (2011). Parameter learning of logic programs for symbolic-statisticalmodeling. JAIR 15, 391–454.
Kaur, N., G. Kunapuli, S. Joshi, K. Kersting, and S. Natarajan (2019). Neural networks for rela-tional data. In ILP.
Kaur, N., G. Kunapuli, T. Khot, K. Kersting, W. Cohen, and S. Natarajan (2017). Relationalrestricted boltzmann machines: A probabilistic logic learning approach. In ILP.
Kaur, N., G. Kunapuli, and S. Natarajan (2020a). Knowledge graph alignment using string editdistance. ArXiv abs/2003.12145, 1–6.
Kaur, N., G. Kunapuli, and S. Natarajan (2020b). Non-parametric learning of lifted restrictedboltzmann machines. IJAR 120, 33–47.
Kazemi, S., D. Buchman, K. Kersting, S. Natarajan, and D. Poole (2014). Relational logisticregression. In KR.
Kazemi, S. M. and D. Poole (2018). RelNN: A deep neural model for relational learning. In AAAI,pp. 6367–6375.
Kersting, K. and L. D. Raedt (2007). Bayesian logic programming: Theory and tool. In AnIntroduction to Statistical Relational Learning.
Khot, T., S. Natarajan, K. Kersting, and J. Shavlik (2011). Learning Markov logic networks viafunctional gradient boosting. In ICDM.
Kok, S. and P. Domingos (2007). Statistical predicate invention. In ICML.
Kok, S. and P. Domingos (2009). Learning Markov logic network structure via hypergraph lifting.In ICML.
148
Kok, S. and P. Domingos (2010). Learning Markov logic networks using structural motifs. InICML.
Kok, S., M. Sumner, M. Richardson, et al. (2010). The Alchemy system for statistical relationalAI. Technical report, University of Washington.
Kokel, H., P. Odom, S. Yang, and S. Natarajan (2020). A unified framework for knowledge inten-sive gradient boosting: Leveraging human experts for noisy sparse domains. In AAAI.
Komendantskaya, E. (2007). First-order deduction in neural networks. In LATA.
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolu-tional neural networks. In NeuRIPS.
Krompaß, D., S. Baier, and V. Tresp (2015). Type-constrained representation learning in knowl-edge graphs. In ISWC.
Kunapuli, G., K. P. Bennett, A. Shabbeer, R. Maclin, and J. Shavlik (2010). Online knowledge-based support vector machines. In J. L. Balcazar, F. Bonchi, A. Gionis, and M. Sebag (Eds.),ECML-PKDD.
Kunapuli, G., P. Odom, J. W. Shavlik, and S. Natarajan (2013). Guiding autonomous agents tobetter behaviors through human advice. In ICDM.
Lacroix, T., N. Usunier, and G. Obozinski (2018). Canonical tensor decomposition for knowledgebase completion. In ICML.
Lai, Y.-Y., J. Neville, and D. Goldwasser (2019). Transconv: Relationship embedding in socialnetworks. In AAAI.
Landwehr, N., A. Passerini, L. De Raedt, and P. Frasconi (2010). Fast learning of relational kernels.Machine Learning 78, 305–342.
Lao, N. and W. Cohen (2010). Relational retrieval using a combination of path-constrained randomwalks. Journal of Machine Learning 81(1), 53–67.
Larochelle, H. and Y. Bengio (2008). Classification using discriminative restricted boltzmannmachines. In ICML, pp. 536–543.
Lavrac, N. and v. Dzeroski (1993). Inductive Logic Programming: Techniques and Applications.Prentice Hall.
Lecun, Y., S. Chopra, R. Hadsell, R. Marc’aurelio, and F. Huang (2006). A tutorial on energy-basedlearning. In Predicting Structured Data. MIT Press.
Lee, D. D. and H. S. Seung (2001). Algorithms for non-negative matrix factorization. In NeuRIPS.
149
Lehmann, J., R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey,P. Van Kleef, S. Auer, and C. Bizer (2014). Dbpedia - a large-scale, multilingual knowledge baseextracted from wikipedia. Semantic Web Journal 6(2), 167–195.
Li, K., J. Gao, S. Guo, N. Du, X. Li, and A. Zhang (2014). LRBM: A Restricted BoltzmannMachine based approach for representation learning on linked data. In ICDM, pp. 300–309.
Li, S., X. Li, R. Ye, M. Wang, H. Su, and Y. Ou (2018). Non-translational alignment for multi-relational networks. In IJCAI.
Lin, Y., Z. Liu, H. Luan, M. Sun, S. Rao, and S. Liu (2015). Modeling relation paths for represen-tation learning of knowledge bases. In EMNLP.
Lin, Y., Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015). Learning entity and relation embeddings forknowledge graph completion. In AAAI.
Lipschutz, S. (1968). Schaum’s Outline of Theory and Problems of Linear Algebra. New York:McGraw-Hill.
Liu, H. and Z. Wu (2010). Non-negative matrix factorization with constraints. In AAAI.
Lodhi, H. (2013). Deep relational machines. In ICONIP, pp. 212–219.
Lowd, D. and J. Davis (2010). Learning markov network structure with decision trees. In ICDM.
Ma, L., X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017). Pose guided personimage generation. In NeuRIPS.
Ma, S., J. Ding, W. Jia, K. Wang, and M. Guo (2017). TransT: Type-based multiple embeddingrepresentations for knowledge graph completion. In ECML-PKDD.
Manhaeve, R., S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt (2018). DeepProbLog:Neural probabilistic logic programming. In NeuRIPS.
Marra, G. and O. Kuzelka (2019). Neural Markov Logic Networks. arXiv abs/1905.13462, 1–19.
Mihalkova, L. and R. Mooney (2007). Bottom-up learning of Markov logic network structure. InICML.
Mikolov, T., I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013). Distributed representations ofwords and phrases and their compositionality. In NeuRIPS.
Mikolov, T., W.-t. Yih, and G. Zweig (2013). Linguistic regularities in continuous space wordrepresentations. In NAACL-HLT.
150
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM 38(11),39–41.
Minervini, P., T. Demeester, T. Rocktaschel, and S. Riedel (2017). Adversarial sets for regularisingneural link predictors. In UAI.
Mnih, V., K. Kavukcuoglu, A. A. Silver, David andRusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015). Human-level controlthrough deep reinforcement learning. Nature 518, 529–533.
Moon, T. K. (1996). The expectation-maximization algorithm. IEEE Signal Processing Maga-zine 13(6), 47–60.
Muggleton, S. (1995). Inverse entailment and progol. New Generation Computing 13, 245–286.
Muggleton, S. and L. D. Raedt (1994). Inductive Logic Programming: Theory and methods.Journal Of Logic Programming 19, 629–679.
Natarajan, S., S. Joshi, P. Tadepalli, K. Kersting, and J. Shavlik (2011). Imitation learning inrelational domains: A functional-gradient boosting approach. In IJCAI, pp. 1414–1420.
Natarajan, S., T. Khot, K. Kersting, B. Guttmann, and J. Shavlik (2012). Gradient-based boostingfor statistical relational learning: The relational dependency network case. MLJ 86(1), 75–100.
Natarajan, S., T. Khot, K. Kersting, and J. Shavlik (2016). Boosted Statistical Relational Learners:From Benchmarks to Data-Driven Medicine. SpringerBriefs in CS. Springer.
Natarajan, S., A. Prabhakar, N. Ramanan, A. Bagilone, K. Siek, and K. Connelly (2017). Boostingfor postpartum depression prediction. In CHASE.
Natarajan, S., P. Tadepalli, T. Dietterich, and A. Fern (2008). Learning first-order probabilisticmodels with combining rules. AMAI 54(1-3), 223–256.
Neelakantan, A. and M.-W. Chang (2015). Inferring missing entity type instances for knowledgebase completion: New dataset and methods. In NAACL-HLT.
Neville, J., D. Jensen, L. Friedland, and M. Hay (2003). Learning relational probability trees. InKDD, pp. 625–630.
Ng, V. and C. Cardie (2002). Improving machine learning approaches to coreference resolution.In ACL.
Nickel, M., L. Rosasco, and T. Poggio (2016). Holographic embeddings of knowledge graphs. InAAAI.
151
Nickel, M., V. Tresp, and H.-P. Kriegel (2011). A three-way model for collective learning onmulti-relational data. In ICML.
Niepert, M., M. Ahmed, and K. Kutzkov (2016). Learning convolutional neural networks forgraphs. In ICML, pp. 2014–2023.
Odom, P., T. Khot, R. Porter, and S. Natarajan (2015). Knowledge-based probabilistic logic learn-ing. In AAAI.
Odom, P. and S. Natarajan (2016a). Active advice seeking for inverse reinforcement learning. InAAMAS.
Odom, P. and S. Natarajan (2016b). Actively interacting with experts: A probabilistic logic ap-proach. In ECMLPKDD.
Odom, P. and S. Natarajan (2018). Human-guided learning for probabilistic logic models. Frontiersin Robotics and AI 5, 1–56.
Oncina, J. and M. Sebban (2006). Learning stochastic edit distance: Application in handwrittencharacter recognition. Pattern Recognition 39(9), 1575–1587.
Palm, R. B., U. Paquet, and O. Winther (2018). Recurrent relational networks for complex rela-tional reasoning. In ICLR.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Pei, S., L. Yu, and X. Zhang (2018). Improving cross-lingual entity alignment via optimal trans-port. In IJCAI.
Perozzi, B., R. Al-Rfou’, and S. Skiena (2014). Deepwalk: online learning of social representa-tions. In KDD.
Pham, T., T. Tran, D. Phung, and S. Venkatesh (2017). Column networks for collective classifica-tion. In AAAI, pp. 2485–2491.
Poole, D. (1993). Probabilistic horn abduction and bayesian networks. AIJ 64(1), 81–129.
Poon, H. and P. Domingos (2007). Joint inference in information extraction. In AAAI.
Qin, P., X. Wang, W. Chen, C. Zhang, W. Xu, and W. Y. Wang (2020). Generative adversarialzero-shot relational learning for knowledge graphs. In AAAI.
Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning 5(3), 239–266.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.
152
Raedt, L. D., S. Dumancic, R. Manhaeve, and G. Marra (2020). From statistical relational toneuro-symbolic artificial intelligence. arXiv abs/2003.08316, 1–6.
Raedt, L. D., A. Kimmig, and H. Toivonen (2007). ProbLog: A probabilistic prolog and its appli-cation in link discovery. In IJCAI.
Ramanan, N., G. Kunapuli, T. Khot, B. Fatemi, S. M. Kazemi, D. Poole, K. Kersting, andS. Natarajan (2018). Structure learning for relational logistic regression: An ensemble approach.In KR, pp. 661–662.
Ramon, J. and L. D. Raedt (2000). Multi instance neural network. In ICML Workshop.
Richardson, M. and P. Domingos (2006). Markov Logic Networks. MLJ 62, 107–136.
Ristad, E. S. and P. N. Yianilos (1998). Learning string-edit distance. IEEE Transaction on PatternAnalysis and Machine Intelligence 20(5), 522 – 532.
Rocktaschel, T. and S. Riedel (2017). End-to-end differentiable proving. In NeuRIPS.
Rumelhart, D. E. and J. L. McClelland (1987). Information Processing in Dynamical Systems:Foundations of Harmony Theory, Chapter Parallel Distributed Processing: Explorations in theMicrostructure of Cognition: Foundations, pp. 194–281. MIT Press.
Salakhutdinov, R. and G. Hinton (2009). Deep Boltzmann Machines. In AISTATS.
Salakhutdinov, R. and A. Mnih (2007). Probabilistic Matrix Factorization. In NeuRIPS.
Salakhutdinov, R., A. Mnih, and G. Hinton (2007). Restricted Boltzmann machines for collabora-tive filtering. In ICML.
Salehi, F., R. Bamler, and S. Mandt (2018). Probabilistic knowledge graph embeddings. In Sym-posium on Advances in Approximate Bayesian Inference.
Santoro, A. et al. (2017). A simple neural network module for relational reasoning. In NeurIPS,pp. 4967–4976.
Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009). The graph neuralnetwork model. IEEE Transactions on Neural Networks 20(1), 61–80.
Schlichtkrull, M., T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018). Modelingrelational data with graph convolutional networks. In ESWC, pp. 593–607.
Shah, H., J. Villmow, A. Ulges, U. Schwanecke, and F. Shafait (2019). An open-world extensionto knowledge graph completion models. In AAAI.
Shi, B. and T. Weninger (2018). Open-world knowledge graph completion. In AAAI.
153
Silver, D., J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, andD. Hassabis (2017). Mastering the game of go without human knowledge. Nature 550, 354–359.
Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical bayesian optimization of machinelearning algorithms. In NeuRIPS, pp. 2951–2959.
Socher, R., D. Chen, C. D. Manning, and A. Ng (2013). Reasoning with neural tensor networksfor knowledge base completion. In NeuRIPS.
Sourek, G., S. Manandhar, F. Zelezny, S. Schockaert, and O. Kuzelka (2016). Learning predictivecategories using lifted relational neural networks. In ILP.
Suchanek, F. M., G. Kasneci, and G. Weikum (2007). Yago: A core of semantic knowledge. InWWW.
Sun, Z., W. Hu, and C. Li (2017). Cross-lingual entity alignment via joint attribute-preservingembedding. In ISWC.
Sun, Z., W. Hu, Q. Zhang, and Y. Qu (2018). Bootstrapping entity alignment with knowledgegraph embedding. In IJCAI.
Sung, F., Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2018). Learning tocompare: Relation network for few-shot learning. In CVPR, pp. 1199–1208.
Sutskever, I. and G. Hinton (2007). Learning multilevel distributed representations for high-dimeasional sequences. In AISTATS.
Sutskever, I., O. Vinyals, and Q. V. Le (2014). Sequence to sequence learning with neural networks.In NeuRIPS.
Szumlanski, S. and F. Gomez (2010). Automatically acquiring a semantic network of relatedconcepts. In CIKM.
Taskar, B. (2002). Discriminative probabilistic models for relational data. In UAI.
Taylor, G. W., G. E. Hinton, and S. T. Roweis (2007). Modeling human motion using binary latentvariables. In NeurIPS.
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likeli-hood gradient. In ICML.
Towell, G. G. and J. W. Shavlik (1994). Knowledge-based artificial neural networks. AI 70(1–2),119–165.
154
Towell, G. G., J. W. Shavlik, and M. O. Noordewier (1990). Refinement of approximate domaintheories by knowledge-based neural networks. In AAAI.
Trouillon, T., J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard (2016). Complex embeddings forsimple link prediction. In ICML.
Sourek, G., V. Aschenbrenner, F. Zelezny, S. Schockaert, and O. Kuzelka (2018). Lifted relationalneural networks: Efficient learning of latent relational structures. JAIR 62(1), 69–100.
Sourek, G., M. Svatos, F. Zelezny, S. Schockaert, and O. Kuzelka (2017). Stacked structurelearning for lifted relational neural networks. In ILP.
Wang, C. and D. M. Blei (2011). Collaborative topic modeling for recommending scientific arti-cles. In SIGKDD.
Wang, H., X. Shi, and D. Yeung (2015). Relational stacked denoising autoencoder for tag recom-mendation. In AAAI.
Wang, P., S. Li, and R. Pan (2018). Incorporating GAN for negative sampling in knowledgerepresentation learning. In AAAI.
Wang, Q., Z. Mao, B. Wang, and L. Guo (2017). Knowledge graph embedding: A survey ofapproaches and applications. IEEE Transactions on Knowledge and Data Engineering 29(12),2724–2743.
Wang, W. and W. Cohen (2016). Learning first-order logic embeddings via matrix factorization.In IJCAI.
Wang, Z. and J. Li (2016). Text-enhanced representation learning for knowledge graph. In IJCAI.
Wang, Z., Q. Lv, X. Lan, and Y. Zhang (2018). Cross-lingual knowledge graph alignment viagraph convolutional networks. In EMNLP.
Wang, Z., J. Zhang, J. Feng, and Z. Chen (2014a). Knowledge graph and text jointly embedding.In EMNLP.
Wang, Z., J. Zhang, J. Feng, and Z. Chen (2014b). Knowledge graph embedding by translating onhyperplanes. In AAAI.
Wright, S. J. (2015). Coordinate descent algorithms. Mathematical Programming 151(1), 3–34.
Wu, Y., X. Liu, Y. Feng, Z. Wang, R. Yan, and D. Zhao (2019). Relation-aware entity alignmentfor heterogeneous knowledge graphs. In IJCAI.
Xiao, H., Y. Chen, and X. Shi (2019). Knowledge graph embedding based on multi-view clusteringframework. IEEE Transactions on Knowledge and Data Engineering 1(1), 1–1.
155
Xiao, H., M. Huang, and X. Zhu (2016). TransG : A generative model for knowledge graphembedding. In ACL.
Xiao, H., M. Huang, and X. Zhu (2017). SSP: semantic space projection for knowledge graphembedding with text descriptions. In AAAI.
Xie, R., Z. Liu, J. Jia, H. Luan, and M. Sun (2016). Representation learning of knowledge graphswith entity descriptions. In AAAI.
Xie, R., Z. Liu, H. Luan, and M. Sun (2017). Image-embodied knowledge representation learning.In IJCAI.
Xie, R., Z. Liu, and M. Sun (2016). Representation learning of knowledge graphs with hierarchicaltypes. In IJCAI.
Xiong, C., V. Zhong, and R. Socher (2017). Dynamic coattention networks for question answering.In ICLR.
Xiong, W., M. Yu, S. Chang, X. Guo, and W. Y. Wang (2018). One-shot relational learning forknowledge graphs. In EMNLP.
Xu, J., X. Qiu, K. Chen, and X. Huang (2017). Knowledge graph representation with jointlystructural and textual encoding. In IJCAI.
Yang, B., W. Yih, X. He, J. Gao, and L. Deng (2015). Embedding entities and relations for learningand inference in knowledge bases. In ICLR.
Yang, S., T. Khot, K. Kersting, and S. Natarajan (2016). Learning continuous-time Bayesiannetworks in relational domains: A non-parametric approach. In AAAI, pp. 2265–2271.
Yang, S. and S. Natarajan (2013). Knowledge intensive learning: Combining qualitative constraintswith causal independence for parameter learning in probabilistic models. In ECML PKDD.
Yao, L., Y. Zhang, B. Wei, Z. Jin, R. Zhang, Y. Zhang, and Q. Chen (2017). Incorporating knowl-edge graph embeddings into topic modeling. In AAAI.
Ye, R., X. Li, Y. Fang, H. Zang, and M. Wang (2019). A vectorized relational graph convolutionalnetwork for multi-relational network alignment. In IJCAI.
Zeng, D., K. Liu, S. Lai, G. Zhou, and J. Zhao (2014). Relation classification via convolutionaldeep neural network. In COLING.
Zhang, F., N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma (2016). Collaborative knowledge baseembedding for recommender systems. In KDD.
156
Zhang, Q., Z. Sun, W. Hu, M. Chen, L. Guo, and Y. Qu (2019). Multi-view knowledge graphembedding for entity alignment. In IJCAI.
Zhang, S., L. Yao, A. Sun, and Y. Tay (2019). Deep learning based recommender system: A surveyand new perspectives. ACM Computing Survey 52(1), 1–38.
Zhong, H., J. Zhang, Z. Wang, H. Wan, and Z. Chen (2015). Aligning knowledge and text embed-dings by entity descriptions. In EMNLP.
Zhu, H., R. Xie, Z. Liu, and M. Sun (2017). Iterative entity alignment via joint knowledge embed-dings. In IJCAI.
Zhu, J.-Y., T. Park, P. Isola, and A. A. Efros (2017). Unpaired image-to-image translation usingcycle-consistent adversarial networks. In ICCV.
Zhu, Q., X. Zhou, J. Wu, J. Tan, and L. Guo (2019). Neighborhood-aware attentional representationfor multilingual knowledge graphs. In IJCAI.
157
BIOGRAPHICAL SKETCH
Navdeep Kaur is a PhD candidate in the Department of Computer Science at The University of
Texas at Dallas, advised by Professor Sriraam Natarajan. Her research interests include neuro-
symbolic computing, representation learning and statistical relational learning. She completed her
Master of Science (MS) degree in Computer Science from Indiana University, Bloomington. She
obtained her Bachelor of Technology (BTech) and Master of Technology (MTech) in Computer
Science & Engineering from Punjab Technical University (Punjab, India).
Before pursuing graduate studies, Navdeep worked in academia for several years with Punjab
Technical University. During her career in academia, she taught data structures, algorithm design
and algorithms and java programming to undergraduate classes in the university.
158
CURRICULUM VITAE
Navdeep Kaur
Contact Information:Department of Computer ScienceThe University of Texas at Dallas800 W. Campbell Rd., ECSS 3.214Richardson, TX 75080-3021, U.S.A.
Email: [email protected]
Educational History:BTech, Computer Science & Engineering, Punjab Technical University, 2005MTech, Computer Science & Engineering, Punjab Technical University, 2009MS, Computer Science, Indiana University, Bloomington, 2018PhD, Computer Science, The University of Texas at Dallas, 2020
Efficient Combination of Neural and Symbolic Learning for Relational DataPhD DissertationComputer Science Department, The University of Texas at DallasAdvisors: Dr. Sriraam Natarajan
Employment History:Research Assistant, The University of Texas at Dallas, January 2020 – presentTeaching Assistant, The University of Texas at Dallas, August 2019 – December 2019Research Assistant, The University of Texas at Dallas, August 2018 – July 2019Research Assistant, Indiana University Bloomington, June 2015 – July 2018Teaching Assistant, Indiana University Bloomington, August 2014 – May 2015Assistant Professor, Punjab Technical University, India, July 2011 – July 2014Lecturer, Punjab Technical University, India, September 2009 – October 2010Research Intern, DTRL Lab, DRDO, Delhi, India, January 2009 – July 2009Lecturer, Punjab Technical University, India, March 2007 – December 2008
Professional Services:Reviewer: CODS-CoMAD 2020
Publications:Journal Papers:1. Navdeep Kaur, Gautam Kunapuli and Sriraam Natarajan, “Non-Parametric Learning of LiftedRestricted Boltzmann Machines”, in International Journal of Approximate Reasoning (Elsevier)2020 .
Conference Papers:2. Navdeep Kaur, Gautam Kunapuli and Sriraam Natarajan, “Topic Augmented Knowledge GraphEmbeddings”, under review.3. Navdeep Kaur, Gautam Kunapuli, Saket Joshi, Kristian Kersting and Sriraam Natarajan, “Neu-ral Networks for Relational Data”, in Inductive Logic Programming, 2019.4. Navdeep Kaur, Gautam Kunapuli, Tushar Khot, Kristian Kersting, William Cohen and Sri-raam Natarajan, “Relational Restricted Boltzmann Machines: A Probabilistic Logic Learning Ap-proach”, in Inductive Logic Programming, 2017.
Workshop Papers:5. Navdeep Kaur, Gautam Kunapuli and Sriraam Natarajan, “Boosting Relational Restricted Boltz-mann Machines”, in WiML workshop @ NeuRIPS 2019