Duality-Induced Regularizer for Semantic Matching ... - arXiv

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL. XX, NO. X, AUGUST 2021 1

Duality-Induced Regularizer for SemanticMatching Knowledge Graph Embeddings

Jie Wang, Senior Member, IEEE, Zhanqiu Zhang, Zhihao Shi, Jianyu Cai,Shuiwang Ji, Senior Member, IEEE, and Feng Wu, Fellow, IEEE

Abstract—Semantic matching models—which assume that entities with similar semantics have similar embeddings—have showngreat power in knowledge graph embeddings (KGE). Many existing semantic matching models use inner products in embeddingspaces to measure the plausibility of triples and quadruples in static and temporal knowledge graphs. However, vectors that have thesame inner products with another vector can still be orthogonal to each other, which implies that entities with similar semantics mayhave dissimilar embeddings. This property of inner products significantly limits the performance of semantic matching models. Toaddress this challenge, we propose a novel regularizer—namely, DUality-induced RegulArizer (DURA)—which effectively encouragesthe entities with similar semantics to have similar embeddings. The major novelty of DURA is based on the observation that, for anexisting semantic matching KGE model (primal), there is often another distance based KGE model (dual) closely associated with it,which can be used as effective constraints for entity embeddings. Experiments demonstrate that DURA consistently and significantlyimproves the performance of state-of-the-art semantic matching models on both static and temporal knowledge graph benchmarks.

Index Terms—Knowledge Graph Embeddings, Knowledge Graph, Link Prediction, Temporal Knowledge Graphs, Regularization.

F

1 INTRODUCTION

KNOWLEDGE graphs store human knowledge in a struc-tured way, with nodes being entities and edges being

relations. In the past few years, knowledge graphs havemade great achievements in many areas, such as naturallanguage processing [1], question answering [2], recommen-dation systems [3], and computer vision [4].

Some popular knowledge graphs, such as Freebase [5],Wikidata [6], and Yago3 [7], consists of triples (subject,predicate, object). However, in many scenarios, triples can-not well represent knowledge as knowledge may evolvewith time. For example, the fact (Donald Trump, president,USA) is only valid at the time interval 2017-2021. Therefore,temporal knowledge graphs that consists of quadruples(subject, predicate, object, timestamp) also attract increasingattention recently. Although commonly used knowledgegraphs usually contain billions of triples or quadruples, theystill suffer from the incompleteness problem that a lot offactual triples or quadruples are missing. Due to the largescale of knowledge graphs, it is impractical to find all validtriples or quadruples manually. Therefore, knowledge graphcompletion (KGC)—which aims to predict missing linksbetween entities based on known links automatically—hasattracted much attention.

Knowledge graph embedding (KGE) is a powerful tech-nique for KGC, which embeds entities, relations, and pos-sible timestamps in KGs into low-dimensional continu-

• J. Wang, Z. Zhang, Z. Shi, J. Cai, F. Wu are with: a) CAS KeyLaboratory of Technology in GIPAS, University of Science and Tech-nology of China, Hefei 230027, China; b) Institute of Artificial Intel-ligence, Hefei Comprehensive National Science Center, Hefei 230091,China. E-mail: [email protected], [email protected], [email protected], [email protected], [email protected].

• S. Ji is with the Department of Computer Science and Engineering, TexasA&M University, College Station, TX77843 USA. E-mail: [email protected].

Manuscript received August, 2021.

ous embedding spaces. KGE models define a score func-tion for each triple/quadruple in embedding spaces. Validtriples/quadruples are expected to have higher scores thaninvalid ones. The success of many existing KGE modelsrelies on an assumption that entities with similar seman-tics should have similar representations. For example, thetriples (lions, is, mammals) and (tigers, is, mammals) willhave similar scores if the entities lions and tiger havesimilar embeddings. In other words, given a known validtriple (lions, is, mammals), we can infer that the triple(tigers, is, mammals) is also valid with high confidence.

Semantic matching (SM) models are an important suiteof knowledge graph embedding approaches. They usuallymodel static and temporal knowledge graphs as partiallyobserved third-order and fourth-order binary tensors, re-spectively. Then, they formulate KGC as a tensor completionproblem and try to solve it by tensor factorization. From theperspective of scoring function, SM models use inner prod-ucts between embeddings to define the plausibility of giventriples and quadruples. Theoretically, these models arehighly expressive and can well handle complex relation pat-terns, such as 1-N, N-1, and N-N relations [8], [9]. However,due to the property of inner products, entities with similarsemantics may have dissimilar embeddings. For example,even if the embeddings of lions and tiger are orthogonal,the triples (lions, is, mammals) and (tigers, is, mammals)are likely to have the same scores when the two entityembeddings have proper norms (please refer to Section 4.1for more detailed explanations). Similarly, we can observethe phenomenon in temporal semantic matching models(please refer to Section 5.3). The performance of SM modelssuffer from this phenomenon and thus cannot achieve state-of-the-art.

To tackle the challenge, we propose a novel regular-izer for semantic matching KGE models—namely, DUality-

arX

iv:2

203.

1294

9v2

[cs

.CL

] 6

Apr

202

2


induced RegulArizer (DURA)—which effectively encour-ages entities with similar semantics to have similar em-beddings. Specifically, by noticing that entities with similarsemantics often connect to the same entity through the samerelation, we propose to use a regularizer to constrain theseentities’ embeddings. Further, we propose DURA basedon the observation called duality—for an existing semanticmatching KGE model (primal), there is often another dis-tance based KGE model closely associated with it (dual),which uses Minkowski distances to define score functions.The duality can be derived by expanding the squared scorefunctions of the associated distance based models, wherethe cross-term in the expansion is exactly a SM modeland the squared terms in it give us a regularizer. In otherwords, adding DURA to a SM model is equivalently to aSM model with its associative distance based model beingregularization. Since the minimization of the Minkowskidistances between vectors force these vectors to be the same,DURA indeed encourages entities with similar semantics tohave similar embeddings. Moreover, we show that whenrelation embeddings are diagonal matrices, the minimiza-tion of DURA has a form of the tensor nuclear 2-norm [10].Experiments demonstrate that DURA is widely applicableto various static and temporal semantic matching KGEmodels and yields consistent and significant improvementson benchmark datasets.

An earlier version of this paper has been published atNeurIPS 2020 [11]. This journal manuscript significantlyextends the initial version in several aspects. First, we pro-pose a temporal version of DURA, named TDURA, whichis designed for temporal knowledge graph embeddings.Second, we introduce a general semantic matching temporalKGE model, named TRESCAL, which is an extension ofRESCAL to temporal KGs. Third, we rigorously analyzethe diagonal relation embedding matrices and show thatthe minimization of TDURA also has a form of the tensornuclear 2-norm [10]. Finally, we conduct experiments ontemporal knowledge graphs and conduct more ablationstudies to demonstrate the effectiveness of TDURA.

2 PRELIMINARIES

We review the backgrounds of the knowledge graph, knowl-edge graph completion, and knowledge graph embeddingsin Sections 2.1, 2.2, and 2.3, respectively. Then, we introducethe notations in Section 2.4.

2.1 Knowledge GraphsStatic Knowledge Graphs Given a set E of entities and aset R of relations, a knowledge graph K = {(ui, rj , vk)} ⊂E × R × E is a set of triplets, where ui and rj are the i-thentity and j-th relation, respectively.Temporal Knowledge Graphs Given a set E of entities, aset R of relations, and a set T of timestamp, a temporalknowledge graph KT = {(ui, rj , vk, tl)} ⊂ E × R × E × Tis a set of quadruples, where ui, rj and tl are the i-th entity,j-th relation and l-th timestamp, respectively.

2.2 Knowledge Graph CompletionStatic Knowledge Graph Completion (KGC) The goal ofKGC is to predict valid but unobserved triples based on the

known triples in K, which is often solved by knowledgegraph embedding (KGE) models. KGE models associateeach entity ui ∈ E and relation rj ∈ R with an em-bedding (may be real or complex vectors, matrices, andtensors) ui and rj . Generally, they define a score functions : E ×R×E → R to associate a score s(ui, rj , vk) with eachpotential triplet (ui, rj , vk) ∈ E ×R×E . The scores measurethe plausibility of triples. For a query (ui, rj , ?), KGE modelsfirst fill the blank with each entity in the knowledge graphsand then score the resulted triples. Valid triples are expectedto have higher scores than invalid triples.Temporal Knowledge Graph Completion (TKGC) Thegoal of TKGC is to predict valid but unobserved quadruplesbased on the known quadruples in KT . TKGE models alsocontain two important categories: distance based modelsand semantic matching models, both of which are tempo-ral knowledge graph embedding (TKGE) methods. TKGEmodels also associate each entity ui ∈ E , relation rj ∈ R,and timestamp tl ∈ T with an embedding (may be real orcomplex vectors, matrices, and tensors) ui, rj and tl. Gener-ally, they define a score function s : E × R× E × T → R toassociate a score s(ui, rj , vk, tl) with each potential quadru-ples (ui, rj , vk, tl) ∈ E ×R×E ×T . The scores measure theplausibility of quadruples. For a query (ui, rj , ?, tl), TKGEmodels first fill the blank with each entity in the temporalknowledge graphs and then score the resulted quadruples.Valid quadruples are expected to have higher scores thaninvalid quadruples.

2.3 Knowledge Graphs Embeddings

Knowledge graph embeddings aim to embed entities, re-lations, and possible timestamps in static and temporalknowledge graphs into low-dimensional continuous vec-tor spaces, which have two important categories—distancebased models and semantic matching models.Distance Based (DB) Models for Static KGs DB modelsdefine the score function s with Minkowski distances. Thatis, the score functions have the formulation of s(ui, rj , vk) =−‖Γ(ui, rj , vk)‖p, where Γ is a model-specific function.Semantic Matching (SM) Models for Static KGs SMmodels regard a knowledge graph as a third-order binarytensor X ∈ {0, 1}|E|×|R|×|E|. The (i, j, k) entry Xijk = 1if (ui, rj , vk) is valid otherwise Xijk = 0. Suppose thatXj denotes the j-th frontal slice of X , i.e., the adjacencymatrix of the j-th relation. Generally, a SM KGE modelfactorizes Xj as Xj ≈ Re (URjV>), where the i-th (k-th)row of U (V) is ui (vk), Rj is a matrix representing relationrj , Re (·) and · are the real part and the conjugate of acomplex matrix, respectively. That is, the score functions aredefined as s(ui, rj , vk) = Re (uiRjv>k ). Note that the realpart and the conjugate of a real matrix are itself. The aim ofSM models is to seek matrices U,R1, . . . ,R|R|,V, such thatRe (URjV>) can approximate Xj . Let Xj = Re (URjV>)and X be a tensor of which the j-th frontal slice is Xj . Thenthe regularized formulation of a semantic matching modelcan be written as

minX1,...,X|R|

|R|∑j=1

L(Xj , Xj) + λg(X ), (1)


where L(Xj , Xj) measures the discrepancy between Xj , Xj ,and g is the regularization function, and λ > 0 is a fixedparameter.Distance Based (DB) Models for Temporal KGs TemporalDB models define the score function s with the Minkowskidistance. That is, the score functions have the formulation ofs(ui, rj , vk, tl) = −‖Γ(ui, rj , vk, tl)‖p, where Γ is a model-specific function.Semantic Matching (SM) Models for Temporal KGs Tem-poral SM models regard a temporal knowledge graph as afourth-order binary tensor X ∈ {0, 1}|E|×|R|×|E|×|T |. The(i, j, k, l) entry Xijkl = 1 if (ui, rj , vk, tl) is valid otherwiseXijkl = 0. Suppose that Xjl denotes the (j, l) slice ofX , i.e., the adjacency matrix given the j-th relation andthe l-th timestamp. Generally, a SM temporal KGE modelfactorizes Xjl as Xjl ≈ Re (U(Rj � Tl)V>), where the i-th (k-th) row of U (V) is ui (vk), Rj is a matrix repre-senting relation rj , Tl is a matrix representing timestamptl, Re (·) and · are the real part and the conjugate of acomplex matrix, and � is the element-wise multiplicationbetween two matrices. The aim of temporal SM models isto seek matrices U,R1, . . . ,R|R|,V and T1, . . . ,T|T |, suchthat Re (U(Rj � Tl)V>) can approximate Xjl. Let Xjl =Re (U(Rj � Tl)V>) and X be a tensor of which the (j, l)slice is Xjl. Then the regularized formulation of a semanticmatching model can be written as

minXjl

|R|∑j=1

|T |∑l=1

L(Xjl, Xjl) + λg(X ), (2)

where λ > 0 is a fixed parameter, L(Xjl, Xjl) measures thediscrepancy between Xjl and Xjl, and g is the regularizationfunction.

2.4 Other NotationsWe use ui ∈ E and vk ∈ E to distinguish head and tailentities. Let ‖ · ‖1, ‖ · ‖2, and ‖ · ‖F denote the L1 norm, theL2 norm, and the Frobenius norm of matrices or vectors.We use 〈·, ·〉 to represent the inner products of two real orcomplex vectors.

3 RELATED WORK

This work is related to knowledge graph embedding modelsand regularizer for semantic mathcing knowledge graphembeddings.

3.1 Knowledge Graph Embedding ModelsPopular knowledge graph embeddings include distance-based, learning-based, and semantic matching models.

Distance based models describe relations as relationalmaps between head and tail entities. Then, they use theMinkowski distance to measure the plausibility of a giventriplet or quadruple. Lower distances correspond to higherplausibility of given triplets or quadruples. For example,TransE [12] and its variants [8], [9] represent relations astranslations in vector spaces. They assume that a validtriplet (ui, rj , vk) satisfies ui,rj + rj ≈ vk,rj , where ui,rjand vk,rj mean that entity embeddings may be relation-specific. Structured embedding (SE) [13] uses linear maps

to represent relations. Its score function is defined ass(ui, rj , vk) = −‖R1

jui − R2jvk‖1. RotatE [14] defines each

relation as a rotation in a complex vector space and the scorefunction is defined as s(ui, rj , vk) = −‖ui ◦ rj−vk‖1, whereui, rj , vk ∈ Ck and |[r]i| = 1. ModE [15] assumes that R1

j

is diagonal and R2j is an identity matrix. It shares a similar

score function s(ui, rj , vk) = −‖ui ◦ rj − vk‖1 with RotatEbut ui, rj , vk ∈ Rk. Some works extend existing distance-based models to temporal KGs. For example, TTransE [16]extends TransE by learning timestamp embeddings. In-spired by TransH [8], HyTE [17] projects entity and relationembeddings into the timestamp-specific hyperplanes.

Learning-based models use neural networks—such asconvolutional neural networks [18], capsule networks [19],and graph neural networks [20]—to model the interactionamong triplets or quadruples. Then, they use either innerproducts or distance functions to measure the plausibility ofgiven triplets or quadruples.

Semantic matching models usually formulate the KGCtask as a third-order binary tensor completion problem.RESCAL [21] factorizes the j-th frontal slice of X as Xj ≈ARjA>. That is, embeddings of head and tail entities arefrom the same space. As the relation specific matrices con-tain lots of parameters, RESCAL is prone to be overfitting.DistMult [22] simplifies the matrix Rj in RESCAL to bediagonal, while it sacrifices the expressiveness of modelsand can only handle symmetric relations. In order to modelasymmetric relations, ComplEx [23] extends DistMult tocomplex embeddings. Both DistMult and ComplEx can beregarded as variants of canonical decomposition (CP) [24],which are in real and complex vector spaces, respectively.Some works also extend semantic mathcing models to tem-poral KGs. For example, ConT [25] extends TuckER [26]by additionally learning timestamp embeddings. ASALSAN[27] use three-way DEDICOM [28] to express temporalrelations, which shares similar ideas with RESCAL [21].Besides, TComplEx and TNTComplEx [29] reduce memorycomplexity by extending ComplEx to decomposition of the4-order tensor.

Compared with distance-based models, semantic match-ing knowledge graph embeddings are more time-efficient,as we can easily parallel the computation of linear trans-formations and inner products. Moreover, due to the prop-erty of inner products, semantic matching is good at mod-eling one-to-many/many-to-one/many-to-many relations.For example, given the entity embedding u and relationembedding r, we can find different entity embeddings visuch that (u, r, vi) have the same scores. In order to handlecomplex relations, distance-based models usually turn torelation-specific projection and further increase the com-putational burden. Compared with learning-based model,semantic matching knowledge graph embeddings are moretime-efficient thanks to their simpler formulations.

However, also due to the property of inner products,in semantic matching, entities with similar semantics mayhave dissimilar embeddings, which will degrade the knowl-edge graph completion performance. As introduced in Sec-tion 4.1, even if the embeddings of Laue and Schottkyare orthogonal, the triples (Laue, StudiedIn, HU Berlin)and (Schottky, StudiedIn, HU Berlin) are likely to havethe same scores when the two entity embeddings have


proper norms. Moreover, distance-based models are betterat modeling properties of knowledge graphs than semanticmatching. For example, RotatE [14] can handle compositionof relations; HAKE [15] and MuRP [30] can model semantichierarchies.

It is worth noting that many general embedding learningmethods are closely related to knowledge graph embed-dings, all of which aim to optimize similarity betweenobjects. For example, some graph embedding models [31],[32], [33] define similarity functions between nodes, andmore generally, some feature embedding approaches [34],[35], [36] learn discriminative feature representations byoptimizing similarity between samples. Different from theseworks, knowledge graph embeddings need to model com-plex relations between entities, leading to a more challeng-ing problem.

3.2 Regularizers for Semantic Matching KnowledgeGraph EmbeddingsSemantic matching (SM) KGE models usually suffer fromoverfitting problem seriously, which motivates various reg-ularizers. In the original papers of SM models, the au-thors usually use the squared Frobenius norm (L2 norm)regularizer [21], [22], [23]. This regularizer cannot bringsatisfying improvements. Consequently, SM models do notgain comparable performance to distance based models[14], [15]. More recently, [37] propose to use the tensornuclear 3-norm [10] (N3) as a regularizer, which brings moresignificant improvements than the squared Frobenius normregularizer. However, it is designed for the CP-like models,such as CP and ComplEx, and not suitable for more generalmodels such as RESCAL. Moreover, some regularizationmethods aim to leverage external background knowledge.For example, to model equivalence and inversion axioms,[38] impose a set of model-dependent soft constraints on thepredicate embeddings. [39] use non-negativity constraintson entity embeddings and approximate entailment con-straints on relation embeddings to impose prior beliefs uponthe structure of the embeddings space.

4 DURA FOR STATIC KGEWe introduce a novel regularizer—DUality-inducedRegularizAer (DURA)—for semantic matching knowledgegraph embeddings. We first introduce the motivations inSection 4.1 and then introduce a basic version of DURA inSection 4.2. Finally, we introduce the formulation of DURAfor static KGE in Section 4.3.

4.1 MotivationsTo perform accurate link prediction, we expect tail entitiesconnected to a head entity through the same relation tohave similar embeddings. First, we claim that tail entitiesconnected to a head entity through the same relation shouldhave similar semantics. Suppose that we know a head entityui and a relation rj , and our aim is to predict the tail entity.If rj is a one-to-many relation, i.e., there exist two entitiesv1 and v2 such that both (ui, rj , v1) and (ui, rj , v2) arevalid, then we expect that v1 and v2 have similar semantics.For example, if two triples (Planck, isDoctoralAdvisorOf,

Laue

Schottky

Planck HU Berlin

(a) A KGC problem.

𝐮𝑖

𝐯𝒌

𝐮𝑖ഥ𝐑𝑗

(b) Without regularization.

𝐮𝑖𝐯𝒌

𝐮𝑖ഥ𝐑𝑗

(c) With DURA.

Fig. 1: An illustration of why basic DURA can improve theperformance of SM models when the embedding dimen-sions are 2. Suppose that triples (ui, rj , vk) (k = 1, 2, . . . , n)are valid. (a) Figure 1a demonstrates that tail entities con-nected to a head entity through the same relation shouldhave similar embeddings. (b) Figure 1b shows that SMmodels without regularization can get the same score eventhough the embeddings of vk are dissimilar. (c) Figure 1cshows that with DURA, embeddings of vk are encouragedto locate in a small region.

Laue) and (Planck, isDoctoralAdvisorOf, Schottky) arevalid, then Laue and Schottky should have similar seman-tics. Further, we expect that entities with similar semanticshave similar embeddings. In this way, if we have knownthat (Laue, StudiedIn, HU Berlin) is valid, then we canpredict that (Schottky, StudiedIn, HU Berlin) is likely tobe valid, which is consistent with the fact. See Figure 1a foran illustration of the prediction process.

However, SM models fail to achieve the above goal. Asshown in Figure 1b, suppose that we have known uiRjwhen the embedding dimension is 2. Then, we can get thesame score s(ui, rj , vk) for k = 1, 2, . . . , n, so long as vklies on the same line perpendicular to uiRj . Generally, theentities t1 and t2 have similar semantics. However, theirembeddings v1 and v2 can even be orthogonal, which meansthat the two embeddings are dissimilar. Therefore, the per-formance of SM models for knowledge graph completion isusually unsatisfying.

4.2 Basic DURA

Consider the static knowledge graph completion problem(ui, rj , ?). That is, we are given the head entity and the re-lation, aiming to predict the tail entity. Suppose that fj(i, k)measures the plausibility of a given triplet (ui, rj , vk), i.e.,fj(i, k) = s(ui, rj , vk). Then the score function of a SMmodel is

fj(i, k) = Re (uiRjv>k ) = Re (〈uiRj , vk〉). (3)


It first maps the entity embeddings ui by a linear trans-formation Rj , and then uses the real part of an innerproduct to measure the similarity between uiRj and vk.Notice that another commonly used similarity measure—Euclidean distance—can replace the dot product similarityin Equation (3). We can obtain an associated distance basedmodel formulated as

fEj (i, k) = −‖uiRj − vk‖22.

Note that we use the squared score function that is equiva-lent to the form without square. That is to say, there existsa duality—for an existing semantic matching KGE model(primal), there is often another distance based KGE model(dual) closely associated with it.

Specifically, the relationship between the primal and thedual can be formulated as

fEj (i, k) = −‖uiRj − vk‖22= −‖uiRj‖22 − ‖vk‖22 + 2Re (〈uiRj , vk〉)= 2fj(i, k)− ‖uiRj‖22 − ‖vk‖22.

When we train a distance based model, the aim is to maxi-mize fEj (i, k) for all valid triples (ui, rj , vk). Suppose that Sis the set that contains all valid triples. Then, we have that

max(ui,rj ,vk)∈S

fEj (i, k) = min(ui,rj ,vk)∈S

−fEj (i, k)

= min(ui,rj ,vk)∈S

−2fj(i, k) + ‖uiRj‖22 + ‖vk‖22. (4)

Therefore, the duality induces a regularizer for semanticmatching KGE models, i.e.,∑

(ui,rj ,vk)∈S

‖uiRj‖22 + ‖vk‖22, (5)

which is called basic DURA.By Equation (4), we know that basic DURA constrains

the distance between uiRj and vk. If ui and Rj are known,then vk will lie in a small region (see Figure 1c). Thus, tailentities connected to a head entity through the same relationwill have similar embeddings, which meets our expectationand is beneficial to the prediction of unknown triples.

4.3 DURABasic DURA encourages tail entities with similar semanticsto have similar embeddings. However, it cannot handle thecase that head entities have similar semantics.

Suppose that (Laue, StudiedIn, HU Berlin) and(Schottky, StudiedIn, HU Berlin) are valid. Similar tothe discussion in Section 4.1, we expect that Laue andSchottky have similar semantics and thus have similar em-beddings. If we further know that (DU Berlin, hasCitizen,Laue) is valid, we can predict that (DU Berlin, hasCitizen,Schottky) is valid. However, basic DURA cannot handlethe case. Let u1, u2, v1, and R1 be the embeddings of Laue,Schottky, HU Berlin, and StudiedIn, respectively. Then,Re (u1R1v>1 ) and Re (u2R1v>1 ) can be equal even if u1 andu2 are orthogonal, as long as u1R1 = u2R1.

To tackle the above issue, noticing that Re (uiRjv>k ) =

Re (vkR>j u>i ), we define another dual distance based KGE

model

fEj (i, k) = −‖vkR>j − ui‖22.

Then, similar to the derivation in Equation (4), the dualityinduces a regularizer given by∑

(ui,rj ,vk)∈S

‖vkR>j ‖2 + ‖ui‖2. (6)

When a SM model are incorporated with regularizer (6),head entities with similar semantics will have similar em-beddings.

Finally, combining the regularizers (6) and (5), DURAhas the form of∑

(ui,rj ,vk)∈S

[‖uiRj‖22 + ‖vk‖22 + ‖vkR>j ‖22 + ‖ui‖22

]. (7)

4.4 Theoretical Analysis for Diagonal Relation MatricesIgnoring the sampling frequency and summarizing DURAon all possible entities and relations, we can write theregularizer as:

|E||R|∑j=1

(‖URj‖2F + ‖V‖2F + ‖VR>j ‖2F + ‖U‖2F ), (8)

where |E| and |R| are the number of entities and relations,respectively.

In the rest of this section, we use the same definitionsof Xj and X as in the problem (2). When the relationembedding matrices Rj are diagonal in R or C as in CPor ComplEx, DURA gives an upper bound to the tensornuclear 2-norm of X , which is an extension of trace normregularizers in matrix completion. To simplify the notations,we take CP as an example, in which all involved embed-dings are real numbers. The conclusion in complex spacecan be analogized accordingly.

Definition 1 ( [10]). The nuclear 2-norm of a 3D tensor A ∈Rn1 ⊗ Rn2 ⊗ Rn3 is

‖A‖∗ = min

{r∑i=1

‖u1,i‖2‖u2,i‖2‖u3,i‖2 :

A =r∑i=1

u1,i ⊗ u2,i ⊗ u3,i, r ∈ N

},

where uk,i ∈ Rnk for k = 1, ..., 3, i = 1, ..., r, and ⊗ denotesthe outer product.

For notation convenience, we define a relation matrixR ∈ R|R|×D , of which the j-th row consists of the diagonalentries of Rj . That is, R(j, d) = Rj(d, d), where A(i, j)represents the entry in the i-th row and j-th column of thematrix A.

In the knowledge graph completion problem, the tensornuclear 2-norm of X is

‖X ‖∗ = min

{D∑d=1

‖u:d‖2‖r:d‖2‖v:d‖2 :

X =D∑d=1

u:d ⊗ r:d ⊗ v:d

},

where D is the embedding dimension, u:d, r:d, and v:d arethe d-th columns of U, R, and V.

For DURA, we have the following theorem, of which theproof are provided in Appendix A.


Theorem 1. Suppose that Xj = URjV> for j = 1, 2, . . . , |R|,where U,V,Rj are real matrices and Rj is diagonal. Then, thefollowing equation holds

minXj=URjV>

1

4|√R|

|R|∑j=1

(‖URj‖2F + ‖V‖2F + ‖VR>j ‖2F + ‖U‖2F )

=‖X ‖∗.

The minimizers satisfy ‖u:d‖2‖r:d‖2 =√|R|‖v:d‖2 and

‖v:d‖2‖r:d‖2 =√|R|‖u:d‖2, ∀ d ∈ {1, 2, . . . , D}, where u:d,

r:d, and v:d are the d-th columns of U, R, and V, respectively.

Therefore, the minimization of DURA has a form ofthe tensor nuclear 2-norm, which is a tensor analog to thematrix trace norm. Matrix trace norm regularization hasshown great power in the matrix completion problem. Inthe experiments, we will show that DURA reproduces thesuccess in the tensor completion problem.

5 DURA FOR TEMPORAL KGEWe first introduce a general TKGE model TRESCAL—whichis an extension of RESCAL to temporal KGs—and introducethe corresponding DURA regularizer. Then, we introducethe enhanced version of DURA for TComplEx [29].

5.1 TRESCAL and DURA for TRESCAL

Inspired by the classical static KGE model RESCAL, wepropose its temporal extension TRESCAL as follows:

s(ui, rj , vk, tl) = ui(Rj � Tl)v>k , (9)

where Rj and Tl are trainable real matrices. Note that theelement-wise multiplication� is commutative, i.e., A�B =B �A for all matrices A and B.

In Equation (9), the product of relation and time matricesRj �Tl can be seen as the representation of time-dependentrelations, leading to |R| × |T | equivalent relations in a KG.By a similar derivation in Section 4.3, we propose DURAregularizer (DURA1) for TRESCAL as follows:∑

(ui,rj ,vk,tl)∈S

[‖ui(Rj � Tl)‖22 + ‖vk‖22 (10)

+ ‖vk(Rj � Tl)>‖22 + ‖ui‖22]. (11)

5.2 DURA for TComplEx

By introducing representations of time-dependent relations,we can extend ComplEx [23] to temporal KGs. The resultingmodel, TComplEx [29], is formulated as

s(ui, rj , vk, tl) = Re (ui(Rj � Tl)v>k ). (12)

As mentioned by Lacroix et al. [29], timestamp embeddingsin temporal semantic matching models can be used to equiv-alently modulate both the relations and entities to obtaintime-dependent representations. Therefore, we introduceDURA for temporal KGE in two views.

First, as relation embeddings and timestamp embed-dings are diagonal matrices, the score function becomess(ui, rj , vk, tl) = Re ((uiTlRj)v>k ) = Re (ui(TlRjv>k )). This

formulation models time-dependent relation representa-tions. We introduce a form of DURA (DURA1) as∑

(ui,rj ,vk,tl)∈S

[‖ui(RjTl)‖22 + ‖vk‖22

+‖vk(RjTl)>‖22 + ‖ui‖22]. (13)

Note that when the relation and time matrices are di-agonal, the right hand side of Equation (12) satisfy asso-ciative law. Thus, we can reformulate the score functioninto s(ui, rj , vk, tl) = Re ((uiTl)Rjv>k ) = Re (uiRj(Tlv>k )),where uiTl and vkTl are representations of time-dependententity representations. Therefore, we derive another formu-lation of DURA (DURA2):∑

(ui,rj ,vk,tl)∈S

[‖uiRj‖22 + ‖vkTl‖22

+‖uiTl‖22 + ‖vkRj‖22]. (14)

We empirically evaluate the aforementioned two versions ofDURA in Section 6.

5.3 Semantic Meaning of Temporal DURAThe semantic meaning of DURA1 is similar to that ofDURA for static KGE. That is, we expect tail/head enti-ties connected to a head/tail entity through similar time-dependent relation representations to have similar embed-dings. For example, as shown in Figure 2a if two quadru-ples (Planck, isDoctoralAdvisorOf, Laue, 1902-1903) and(Planck, isDoctoralAdvisorOf, Schottky, 1909-1912) arevalid, then we expect that Laue and Schottky have similarsemantics. Let ui,Rj , v

(1)k , v(2)

k ,T(1)l , and T(2)

l be embed-dings of Planck, isDoctoralAdvisorOf, Laue, Schottky,1902-1903, and 1909-1912, respectively. In temporal KGE,close timestamps usually have similar representations [29]by incorporating a timestamp smoothness regularizer. Foranalyses convenience and without loss of generality, weassume that T(1)

l = T(2)l = Tl and hence ui(RjT

(1)l ) =

ui(RjT(2)l ) = ui(RjTl). As shown in Figure 2b, we can

get the same score s(ui, rj , vk, tl) for any vk, so long asvk lies on the same line perpendicular to ui(RjTl). Whenwe train SM models without DURA1, the embeddings v(1)

k

and v(2)k can even be orthogonal, which means that the

two embeddings are dissimilar. In contrast, Figure 2c showsthat DURA1 encourages v(1)

k and v(2)k to locate in a small

region, which means that two embeddings of Laue andSchottky are similar. Thus, when (Laue, StudiedIn, HUBerlin, 1902-1903) is valid, DURA1 encourages (Schottky,StudiedIn, HU Berlin, 1905-1912) to have a high score. Theprediction is reasonable, as two persons advised by the sameperson in a similar period of time tend to study in the sameuniversity.

DURA2 encourages time-dependent entities with sim-ilar semantics to have similar embeddings, while similarembeddings of time-dependent entities do not imply sim-ilar embeddings of original entities. Similar to the abovediscussion, DURA2 encourages the time-dependent entitiesv(1)k Tl and v(2)

k Tl to locate in a small region. However, asshown in Figure 2e, when embeddings of time-dependententities vkTl are projections of vk onto the subspace spanned


Laue

Schottky

Planck HU Berlin

(a) A temporal KGC problem.

𝐮𝑖

𝐯𝒌

𝐮𝐞𝑖(ഥ𝐑𝑗ഥ𝐓𝑙)

(b) Without DURA1.

𝐮𝑖𝐯𝒌

𝐮𝑖(ഥ𝐑𝑗ഥ𝐓𝑙)

(c) With DURA1.

𝐮𝑖

𝐯𝒌𝐓𝒍

𝐮𝑖ഥ𝐑𝑗

(d) Without DURA2.

𝐮𝑖

𝐯𝒌𝐓𝒍

𝐯𝒌

𝐮𝑖ഥ𝐑𝑗

(e) With DURA2.

Fig. 2: An illustration of how temporal DURA influencethe performance of SM models when the embedding di-mensions are 2. Suppose that triples (ui, rj , vk, tl) (k =1, 2, . . . , n) are valid. Figure 2b and 2d show that SMmodels without regularization can get the same score eventhough and the embeddings of original entities vk andtime-dependent entities vkTl are dissimilar, respectively.Figure 2c shows that with DURA1, embeddings of vk areencouraged to locate in a small region. Figure 2e showsthat with DURA2, embeddings of time-dependent entitiesvkTl are encouraged to locate in a small region, whilethe corresponding embeddings of original entities may bedissimilar.

by uiRj , the embeddings of entities v(1)k and v(2)

k can bedissimilar, even if the corresponding time-dependent en-tities vkTl are the same. Thus, when (Laue, StudiedIn,HU Berlin, 1905-1912) is valid, (Schottky, StudiedIn, HUBerlin, 1902-1903) is not guaranteed to have a high score.However, as shown in Section 5.4, both DURA1 and DURA2correspond to nuclear-2 norm of 4-order tensors. Thus, bothof them can improve the performance of semantic match-ing models, while we expect DURA1 outperforms DURA2.Experiments in Section 6.2.2 confirm our points.

5.4 Theoretical Analysis for Diagonal Relation andTime MatricesWe give an analysis for the case that relation and timematrices are diagonal (i.e., TComplEx).

First, we review the definition of 4D tensors.

Definition 2 ( [10]). The nuclear 2-norm of a 4D tensor A ∈Rn1 ⊗ Rn2 ⊗ Rn3 ⊗ Rn4 is

‖A‖∗ = min

{r∑i=1

‖u1,i‖2‖u2,i‖2‖u3,i‖2‖u4,i‖2 :

A =r∑i=1

u1,i ⊗ u2,i ⊗ u3,i ⊗ u4,i, r ∈ N

},

where uk,i ∈ Rnk for k = 1, ..., 4, i = 1, ..., r, and ⊗ denotesthe outer product.

In the temporal knowledge graph completion problem,the tensor nuclear 2-norm of X is

‖X ‖∗ = min

{D∑d=1

‖u:d‖2‖r:d‖2‖v:d‖2‖t:d‖2 :

X =D∑d=1

u:d ⊗ r:d ⊗ v:d ⊗ t:d

},

where u:d, r:d, v:d, and t:d are the d-th columns of the headentities matrix U, the relation matrix R, the tail entitiesmatrix V, and the time matrix V, respectively.

Then, we have the following theorems for DURA, whichcorrespond to the regularizers (13) and (14), respectively.The proofs are provided in the Appendixes B and C.

Theorem 2. Suppose that Xjl = U(RjTl)V> for j =1, 2, . . . , |R|, where U,V,Rj ,Tl are real matrices and Rj andTl are diagonal. Then, the following equation holds

‖X ‖∗ = minXjl=U(RjTl)V>

1

4√|R||T |

|T |∑l=1

|R|∑j=1

(‖URjTl‖2F + ‖V‖2F

+‖V(RjTl)>‖2F + ‖U‖2F ).

The minimizers of the left side satisfy ‖u:d‖2‖r:d‖2‖t:d‖2 =√|R||T |‖v:d‖2 and ‖v:d‖2‖r:d‖2‖t:d‖2 =

√|R||T |‖u:d‖2,

∀ d ∈ {1, 2, . . . , D}, where u:d, r:d, v:d, and t:d are the d-thcolumns of U, R, V and V, respectively.

Theorem 3. Suppose that Xjl = U(RjTl)V> for j =1, 2, . . . , |R|, where U,V,Rj ,Tl are real matrices and Rj andTl are diagonal. Then, the following equation holds

‖X ‖∗ = minXjl=U(RjTl)V>

1

4√|R||T |

|T |∑l=1

|R|∑j=1

(‖URj‖2F + ‖VT>l ‖2F

+‖UTl‖2F + ‖VR>j ‖2F ).

The minimizers satisfy√|T |‖u:d‖2‖r:d‖2 =

√|R|‖v:d‖2‖t:d‖2

and√|R|‖u:d‖2‖t:d‖2 =

√T ‖v:d‖2‖r:d‖2, ∀ d ∈

{1, 2, . . . , D}, where u:d, r:d, and t:d are the d-th columns ofU, R, and V, respectively.

Therefore, the minimization of DURA in the temporalknowledge graph case has a form of the tensor nuclear2-norm of 4-order tensors. Theorem 3 also explains whyDURA2 improves the performance of temporal semanticmatching KGEs, even if it cannot encourage entity withsimilar semantics to have similar representations.


TABLE 1: Hyper-parameters found by grid search. k is the embedding size, b is the batch size, λ is the regularizationcoefficients, and λ1 and λ2 are weights for different parts of the regularizer.

WN18RR FB15k-237 YAGO3-10

k b λ λ1 λ2 k b λ λ1 λ2 k b λ λ1 λ2

CP 2000 100 1e-1 0.5 1.5 2000 100 5e-2 0.5 1.5 1000 1000 5e-3 0.5 1.5ComplEx 2000 100 1e-1 0.5 1.5 2000 100 5e-2 0.5 1.5 1000 1000 5e-2 0.5 1.5RESCAL 512 1024 1e-1 1.0 1.0 512 512 1e-1 2.0 1.5 512 1024 5e-2 1.0 1.0

6 EXPERIMENTS

The knowledge graph completion task is a standard bench-mark task to evaluate KGE models. To demonstrate thatDURA is an effective and widely-applicable regularizer, weconduct extensive experiments on both static and temporalknowledge graph completion datasets.

6.1 Static Knowledge Graph Completion

We introduce the experimental settings for static KGC inSection 6.1.1 and demonstrate the effectiveness of DURA onthree benchmark datasets in Section 6.1.2. Then, we compareDURA with other regularizers in Section 6.1.3.

6.1.1 Experimental Settings for Static KGCWe consider three public static knowledge graph datasets—WN18RR [40], FB15k-237 [18], and YAGO3-10 [7] for theknowledge graph completion task, which have been dividedinto training, validation, and testing set in previous works.The statistics of these datasets are shown in Appendix D.

We use the cross entropy loss function and the “recipro-cal” setting that creates a new triplet (vk, r

−1j , ui) for each

triplet (ui, rj , vk) [37], [41]. We use Adagrad [42] as the opti-mizer, and use grid search to find the best hyperparametersbased on the performance on the validation datasets. Aswe regard the link prediction as a multi-class classificationproblem and use the cross entropy loss, we can assigndifferent weights for different classes (i.e., tail entities) basedon their frequency of occurrence in the training dataset.Specifically, suppose that the loss of a given query (u, r, ?)is `((u, r, ?), v), where v is the true tail entity, then theweighted loss is w(t)`((u, r, ?), v), where

w(v) = w0#v

max{#vi : vi ∈ training set}+ (1− w0),

w0 is a fixed number, #v denotes the frequency of occur-rence in the training set of the entity v. For all models onWN18RR and RESCAL on YAGO3-10, we choose w0 = 0.1and for all the other cases we choose w0 = 0.

Following [12], we use entity ranking as the evaluationtask. For each triplet (ui, rj , vk) in the test dataset, the modelis asked to answer (ui, rj , ?) and (vk, r

−1j , ?). To do this, we

fill the positions of missing entities with candidate entitiesto create a set of candidate triplets, and then rank the tripletsin descending order by their scores. Following the “Filtered”setting in [12], we then filter out all existing triplets knownto be true at ranking. We choose Mean Reciprocal Rank(MRR) and Hits at N (H@N) as the evaluation metrics.Higher MRR or H@N indicates better performance.

We find the best hyper-parameters by grid search. Specif-ically, we search learning rates in {0.1, 0.01}, regularization

coefficients in {0, 1× 10−3, 5× 10−3, 1× 10−2, 5× 10−2, 1×10−1, 5 × 10−1}. On WN18RR and FB15k-237, we searchbatch sizes in {100, 500, 1000} and embedding sizes in{500, 1000, 2000}. On YAGO3-10, we search batch sizes in{256, 512, 1024} and embedding sizes in {500, 1000}. Table1 shows the best hyperparameters for DURA found by gridsearch.

6.1.2 Main ResultsWe demonstrate the performance of DURA on several se-mantic matching KGE models, including CP [24], RESCAL[21], and ComplEx [23]. Note that we reimplement CP,ComplEx, and RESCAL under the “reciprocal” setting [37],[41], and obtain better results than the reported perfor-mance in the original papers. Sun et al. [43] demonstratethat many existing learning-based models use inappropriateevaluation protocols and suffer from inflating performance.Therefore, for ConvE, ConvKB, and KB-GAT, we take theirresults from [43] for a fair comparison.

The baseline models include distance-based models(TransE [12], TransH [8], TransD [44], RotatE [14], MuRP[30], HAKE [15]), learning-based models (ConvE [18], Con-vKB [45], KB-GAT [20], InteractE [46], and StAR [47]), andsemantic matching models (CP [24], RESCAL [21], ComplEx[23]). TransE is the most representative translational dis-tance model, while failing to deal with 1-to-N, N-to-1, andN-to-N relations. TransH introduces relation-specific hy-perplanes to overcome these disadvantages. TransD sharesa similar idea with TransH. It introduces relation-specificspaces rather than hyperplanes. TransD also decomposesthe projection matrix into a product of two vectors forfurther simplification. RotatE defines each relation as arotation from the source entity to the target entity in thecomplex vector space and naturally, it is able to modeland infer various relation patterns. MuRP embeds multi-relational graph in the Poincare ball model of hyperbolicspace to capture multiple simultaneous hierarchies. HAKEmaps entities into the polar coordinate system to modelsemantic hierarchies, which are common in real-world ap-plications. ConvE introduces a multi-layer convolutionalnetwork model to learn more expressive features with lessparameters. ConvKB represents triples as 3-column matrixs,and employs a convolutional neural network to captureglobal relationships and transitional characteristics betweenentities and relations. KB-GAT proposes an attention-basedfeature embedding that captures both entity and relationfeatures to cover the complex and hidden information in thelocal neighborhood surrounding a triple. InteractE improvesConvE by increasing feature interactions. StAR follows thetextual encoding paradigm and augments it with graphembedding techniques.


TABLE 2: Evaluation results on WN18RR, FB15k-237 and YAGO3-10 datasets. We reimplement CP, ComplEx, and RESCALusing the “reciprocal” setting [37], [41], which leads to better results than the reported results in the original paper. *indicates that the model use external textual information.


MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10

TransE .226 - .501 .294 - .465 - - -TransH .186 - .424 .211 - .366 - - -TransD .185 - .428 .286 - .453 - - -RotatE .476 .428 .571 .338 .241 .533 .495 .402 .670MuRP .481 .440 .566 .335 .243 .518 - - -HAKE .497 .452 .582 .346 .250 .542 .546 .462 .694

ConvE .444 - .503 .324 - .501 - - -ConvKB .249 - .524 .243 - .421 - - -KB-GAT .412 - .554 .157 - .331 - - -InteractE .463 .430 .528 .354 .263 .535 .541 .462 .687StAR .401 .243 .709 .296 .205 .482 - - -StAR (Self-Adp)* .551 .459 .732 .365 .266 .562 - - -

CP .438 .414 .485 .333 .247 .508 .567 .494 .698RESCAL .455 .419 .493 .353 .264 .528 .566 .490 .701ComplEx .460 .428 .522 .346 .256 .525 .573 .500 .703

CP-DURA .478 .441 .552 .367 .272 .555 .579 .506 .709RESCAL-DURA .498 .455 .577 .368 .276 .550 .579 .505 .712ComplEx-DURA .491 .449 .571 .371 .276 .560 .584 .511 .713

TABLE 3: Comparison between DURA, squared Frobenius norm (FRO), and nuclear 3-norm (N3) regularizers. Results of* are taken from [37]. CP-N3 and ComplEx-N3 are re-implemented and their performances are better than the reportedresults in [37]. The best performance on each model are marked in bold.



CP-FRO* .460 - .480 .340 - .510 .540 - .680CP-N3 .470 .430 .544 .354 .261 .544 .577 .505 .705CP-DURA .478 .441 .552 .367 .272 .555 .579 .506 .709

ComplEx-FRO* .470 - .540 .350 - .530 .570 - .710ComplEx-N3 .489 .443 .580 .366 .271 .558 .577 .502 .711ComplEx-DURA .491 .449 .571 .371 .276 .560 .584 .511 .713

RESCAL-FRO .397 .363 .452 .323 .235 .501 .474 .392 .628RESCAL-DURA .498 .455 .577 .368 .276 .550 .579 .505 .712

Table 2 shows the evaluation results. Overall, DURAsignificantly improves the performance of all consideredSM models. The results demonstrate that without regu-larization, semantic matching models perform worse thanstate-of-the-art distance-based models on WN18RR, e.g.,RotatE, MuRP, and HAKE. When incorporating with DURA,these semantic matching models achieve competitive per-formance with distance-based models. Moreover, DURAfurther improve the performance gain of semantic match-ing models on FB15k-237 and YAGO3-10. Compared withlearning-based models, semantic matching models withDURA outperforms all of them without using externaltextual information. Notably, semantic matching modelswith DURA even outperform StAR (Self-Adp) on FB15k-237, which use external textual information to enhance theexpressiveness of KGE. Generally, models with more pa-rameters and datasets with smaller sizes imply a larger riskof overfitting. Among the three datasets, WN18RR has thesmallest size of only 11 kinds of relations and around 80ktraining samples. Therefore, the improvements brought byDURA on WN18RR are expected to be larger compared withother datasets, which is consistent with the experiments. As

stated in [48], RESCAL is a more expressive model, but itis prone to overfit on small- and medium-sized datasetsbecause it represents relations with much more parameters.For example, on WN18RR dataset, RESCAL gets an H@10score of 0.493, which is lower than ComplEx (0.522). Theadvantage of its expressiveness does not show up at all.Incorporated with DURA, RESCAL gets an 8.4% improve-ment on H@10 and finally attains 0.577, outperforming allcompared models. On larger datasets such as YAGO3-10,overfitting also exists but will be insignificant. Nonetheless,DURA still leads to consistent improvement, showing theability of DURA to prevent models from overfitting.

6.1.3 Comparison with Other Regularizers

We compare DURA to the squared Frobenius norm reg-ularizer and the tensor nuclear 3-norm (N3) regularizer[37]. The squared Frobenius norm regularizer g(X ) =

‖U‖2F + ‖V‖2F +∑|R|j=1 ‖Rj‖2F . N3 regularizer is given by

g(X ) =∑Dd=1(‖u:d‖33+‖r:d‖33+‖v:d‖33). where ‖·‖3 denotes

L3 norm of vectors. N3 regularizer is only suitable formodels in which Rj is diagonal, such as CP or ComplEx.


TABLE 4: Hyper-parameters found by grid search. λ is the regularization coefficients, λ1 and λ2 are weights for differentparts of the regularizer for DURA2, and λ3 and λ4 are weights for different parts of the regularizer for DURA1.

ICEWS14 ICEWS05-15 YAGO15k

λ λ1 λ2 λ3 λ4 λ λ1 λ2 λ3 λ4 λ λ1 λ2 λ3 λ4

TComplEx-DURA1 1e-2 - - 1e-3 1e2 1e-3 - - 3e-2 30 1e-2 - - 1e-2 1e2TComplEx-DURA2 1e-1 1e-1 3e-2 - - 1e-1 1e-2 3e-2 - - 1e-2 3e-2 3e-2 - -TRESCAL-DURA1 1e-2 - - 1e1 1e1 1e-2 - - 1e1 1 1e-2 - - 1e-1 1e1

We implement both the squared Frobenius norm (FRO)and N3 regularizers in the weighted way as stated in [37].Table 3 shows the performance of the three regularizers onthree popular models: CP, ComplEx, and RESCAL. Notethat when the considered model is RESCAL, we only com-pare DURA to the squared Frobenius norm regularizationas N3 does not apply to it.

For CP and ComplEx, DURA brings consistent im-provements compared to FRO and N3 on all datasets.Specifically, on FB15k-237, compared to CP-N3, CP-DURAgets an improvement of 0.013 in terms of MRR. Even forthe previous state-of-the-art SM model ComplEx, DURAbrings further improvements against the N3 regularizer.Incorporated with FRO, RESCAL performs worse than thevanilla model, which is consistent with the results in [49].However, RESCAL-DURA brings significant improvementsagainst RESCAL. All the results demonstrate that DURA ismore widely applicable than N3 and more effective than thesquared Frobenius norm regularizer.

6.2 Temporal Knowledge Graph Completion

We also conduct extensive experiments on temporal knowl-edge graph completion datasets. Specifically, we introducethe experimental settings for temporal KGC in Section 6.2.1and show the effectiveness of DURA on three benchmarkdatasets in Section 6.2.2.

6.2.1 Experimental Settings for Temporal KGC

We consider three public temporal knowledge graphdatasets—ICEWS14 [50], ICEWS05-15 [50], and YAGO15k[51] for the knowledge graph completion task, which havebeen divided into training, validation, and testing set inprevious works. The statistics of these datasets are shownin Appendix D.

Following [29], we use the following temporal regular-izer for TComplEx to smooth timestamp embeddings

Λ3(T ) =1

|T | − 1

|T |−1∑l=1

‖diag−1(Tl+1 − Tl)‖33,

where diag−1(·) gives the vector made of the diagonalelements from the a matrix. For TRESCAL, we propose touse the temporal regularizer

Λ2(T ) =1

|T | − 1

|T |−1∑l=1

‖Tl+1 − Tl‖2F

to smooth timestamp embeddings.

Moreover, we find it better to assign different weights forthe parts involved with relations. That is, the optimizationproblem has the form of

min∑

(ui,rj ,vk,tl)∈S

[`ijkl(U,R1, . . . ,RJ ,V,T1, . . . ,TL)

+λ(λ1(‖uiRj‖22 + ‖vkTl‖22)

+λ2(‖uiTl‖22 + ‖vkRj‖22)

+λ3(‖ui‖22 + ‖vk‖22)

+λ4(‖ui(Rj � Tl)‖22 + ‖vk(Rj � Tl)‖22))]

where λ, λ1, λ2, λ3, λ4 ≥ 0 are fixed hyperparameters. Wedenote the regularizer (13) by DURA2 (i.e., λ3 = λ4 = 0)and the regularizer (14) by DURA1 (i.e., λ1 = λ2 = 0).

We use grid search to find the best hyperparametersbased on the performance on the validation datasets. Wesearch λ ∈ {1, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4} forall experiments. We search λ1, λ2 ∈ {1, 3e-1, 1e-1, 3e-2, 1e-2,3e-3, 1e-3} for DURA2, and λ3 ∈ {1e2, 3e1, 1e1, 3, 1, 3e-1,1e-1, 3e-2, 1e-2, 3e-3, 1e-3, 3e-4, 1e-4}, λ4 ∈ {1e3, 3e2, 1e2,3e1, 1e1, 3, 1, 3e-1, 1e-1, 3e-2} for DURA1. Table 4 shows thebest hyperparameters found by grid search.

6.2.2 Main Results on Temporal KGC

We compare DURA with the popular squared Frobeniusnorm and the tensor nuclear 3-norm (N3) regularizers [29]on the state-of-the-art temporal semantic matching KGEmodel TComplEx and our proposed TRESCAL method. Wealso include four recent distance-based models, includingTTransE [16], TA-TransE [52], RFTE-HAKE [53] and BoxTE[54]. TTransE [16] proposes a time-aware knowledge graphcompletion model by using temporal order informationamong facts. TA-TransE [52] utilizes recurrent neural net-works to learn time-aware representations. RFTE [53] is aframework to transplant static KGE models to temporalKGE models. It treats the sequence of graphs as a Markovchain. BoxTE [54] assumes that time is a filter that helps pickout answers to be correct during certain periods. Therefore,it introduces boxes to represent a set of answer entities to atime-agnostic query.

Table 5 shows the effectiveness of DURA. Note thatDURA2 as shown in the formulation (14) does not ap-plicable to TRESCAL since the matrix multiplication andelement-wise multiplication are not commutative. More-over, the N3 regularizer is not applicable to TRESCAL sincethis model is not CP tensor factorization. Table 5 shows thatDURA effectively improve the performance of TComplExand TRESCAL. With DURA, these semantic marching mod-els outperform the state-of-the-art distance-based models interms of MRR.


TABLE 5: Evaluation results on ICEWS14, ICEWS05-15, and YAGO15k datasets.



TTransE .26 .07 .60 .27 .08 .62 .32 .23 .51TA-TransE .28 .10 .63 .30 .10 .67 .32 .23 .51RFTE-HAKE .50 .40 .70 .47 .37 .65 - - -BoxTE (k=2) .62 .53 .77 .66 .58 .82 - - -

TComplEx .60 .51 .75 .65 .57 .79 .35 .27 .52TComplEx-FRO .60 .53 .74 .65 .57 .80 .34 .27 .49TComplEx-N3 .61 .53 .77 .66 .59 .80 .36 .28 .54TComplEx-DURA1 .64 .56 .79 .67 .59 .81 .38 .30 .55TComplEx-DURA2 .62 .54 .77 .67 .59 .81 .35 .27 .52

TRESCAL .56 .47 .73 .65 .57 .79 .28 .21 .45TRESCAL-FRO .55 .48 .69 .60 .52 .75 .32 .25 .46TRESCAL-DURA1 .60 .51 .76 .67 .60 .80 .30 .22 .48

0 0.2 0.4 0.6 0.8sparsity

0.24

0.28

0.32

0.36

0.38

MR

R

CP+DURACP+N3CP

(a) CP

0 0.2 0.4 0.6 0.8sparsity

0.28

0.3

0.32

0.34

0.36

0.38

MR

R

RESCAL+DURARESCAL

(b) RESCAL

Fig. 3: The effect of entity embeddings’ λ-sparsity on MRR.The experiments are conducted on FB15k-237.

6.3 Sparsity Analysis

As knowledge graphs usually contain billions of entities, thestorage of entity embeddings faces severe challenges. Intu-itively, if embeddings are sparse, that is, most of the entriesare zero, we can store them with less storage. Therefore,the sparsity of the generated entity embeddings becomescrucial for real-world applications. Generally, there are fewentries of entity embeddings that are exactly equal to 0after training, which means that it is hard to obtain sparseentity embeddings directly. However, when we score triplesusing the trained model, the embedding entries with valuesclose to 0 will have minor contributions to the score of atriplet. If we set the embedding entries close to 0 to beexactly 0, we could transform the embeddings into sparseones. Therefore, there is a trade-off between the sparsity andperformance decrement. We define the following λ-sparsityto indicate the proportion of entries that are close to zero:

sλ =

∑Ii=1

∑Dd=1 1{|x|<λ}(Eid)I ×D

, (15)

where E ∈ RI×D is the entity embedding matrix, Eid is theentry in the i-th row and d-th column of E, I is the numberof entities, D is the embedding dimension, and 1C(x) is theindicator function taking value of 1 if x ∈ C or otherwise 0.

Figure 3 shows the effect of entity embeddings’ λ-sparsity on MRR. Following Equation (15), we select entriesof which the absolute value are less than a threshold andset them to be 0. Note that for any given sλ, we can alwaysfind a proper threshold λ to approximate it, as the formula

TABLE 6: Ablation results on the combined and uncom-bined regularizers. “-I” and “-II” represents the regularizerdefined in formula (5) and (6), respectively.

WN18RR FB15k-237

MRR H@10 MRR H@10

CP .438 .485 .333 .508CP-DURA-I .463 .535 .331 .522CP-DURA-II .439 .492 .332 .507CP-DURA .478 .552 .367 .555

RESCAL .455 .493 .353 .528RESCAL-DURA-I .475 .541 .341 .523RESCAL-DURA-II .462 .524 .342 .510RESCAL-DURA .498 .577 .368 .550

is increasing with respect to λ. Results in the figure showthat DURA causes much gentler performance decrement asthe embedding sparsity increases. In Figure 3a, incorporatedwith DURA, CP maintains MRR of 0.366 unchanged evenwhen 60% entries are set to 0. More surprisingly, when thesparsity reaches 70%, CP-DURA can still outperform CP-N3 with zero sparsity. For RESCAL, when set 80% entriesto be 0, RESCAL+DURA still has the MRR of 0.341, whichsignificantly outperforms vanilla RESCAL, whose MRR hasdecreased from 0.352 to 0.286. In a word, incorporating withDURA regularizer, the performance of CP and RESCALremains comparable to the state-of-the-art models, evenwhen 70% of entity embeddings’ entries are set to 0.

Following [55], we store the sparse embedding matricesusing compressed sparse row (CSR) or compressed sparsecolumn (CSC) format. Experiments show that DURA bringsabout 65% fewer storage costs for entity embeddings when70% of the entries are set to 0.

6.4 Ablation StudyWe conduct ablation studies to show the effectiveness ofthe combination of (5) and (6). The results are shownin Table 6, where “-I” and “-II” represent the regularizerdefined in formula (5) and (6), respectively. On WN18RR,the uncombined regularizer DURA-I and DURA-II can alsoimprove the performance for the semantic matching mod-els, but the combination brings further improvements. OnFB15k-237, DURA-I and DURA-II even lead to a decrementin terms of MRR. One possible reason is that FB15k-237


2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(a) Rank for CP.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(b) Rank for ComplEx.

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

WN18RRFB15k-237YAGO3-10

(c) Rank for RESCAL.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(d) Batch size for CP.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(e) Batch size for ComplEx.

2 2.2 2.4 2.6 2.8 3 3.2

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(f) Batch size for RESCAL.

Fig. 4: Sensitivity to hyper-parameters of DURA. The x-axis is the logarithm of each hyper-parameter based on 10 and They-axis is mean reciprocal rank (MRR) on test data.

0.5360.555

0.455

0.563

0.4

0.45

0.5

0.55

0.6

NA DURA DURA-I DURA-II

(a) 1-1 relations

0.1010.113

0.069

0.123

0

0.05

0.1

0.15


(b) 1-N relations

0.722

0.684

0.7260.732

0.65

0.67

0.69

0.71

0.73

0.75


(c) N-1 relations

0.34

0.3260.329

0.351

0.31

0.32

0.33

0.34

0.35

0.36

DURANA DURA-I DURA-II

(d) N-N relations

Fig. 5: Average MRR of RESCAL with different regularizerson different types of relations. The dataset is FB15k-237.

has a large number of 1-N and N-1 relations (refer to [8]for the definitions relation types) and they have differentpreferences for the uncombined regularizers. Figure 5 showsthe performance of RESCAL with different regularizers ondifferent relation types. Compared with vanilla RESCAL,DURA-I only improves the MRR for 1-1 and 1-N relations,while DURA-II only improves the MRR for N-1 relations,which validates our analysis. DURA significantly outper-forms both DURA-I and DURA-II on all relation types.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 query 1query 2query 3query 4query 5query 6query 7query 8query 9query 10

(a) CP

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 query 1query 2query 3query 4query 5query 6query 7query 8query 9query 10

(b) CP-DURA

Fig. 6: Visualization of tail entity embeddings using T-SNE.A point represents a tail entity. Points in the same colorrepresent tail entities that have the same (hr, rj) context.

6.5 Visualization

We visualize the entity embeddings using T-SNE [56] toshow that DURA encourages entities with similar semanticsto have similar embeddings. Without loss of generality, weuse tail entities for visualization.

Suppose that (ui, rj) is a query, where ui and rj arehead entities and relations, respectively. An entity vk is ananswer to a query (ui, rj) if (ui, rj , vk) is valid. We randomlyselected 10 queries in FB15k-237, each of which has morethan 50 answers. Then, we use T-SNE to visualize the an-swers’ embeddings generated by CP and CP-DURA. Figure6 shows the visualization results. Each entity is representedby a 2D point and points in the same color represent tailentities with the same (ui, rj) context (i.e. query). Figure6 shows that, with DURA, entities with the same (ui, rj)contexts are indeed being assigned more similar representa-tions, which verifies the claims in Section 4.1.


-4 -3.5 -3 -2.5 -2 -1.5 -1

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(a) λ of DURA1 for TComplEx.

-4 -3.5 -3 -2.5 -2 -1.5 -1

Log 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(b) λ3 of DURA1 for TComplEx.

-0.5 0 0.5 1 1.5 2 2.5 3

Log 4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

ICEWS14ICEWS05-15yago15k

(c) λ4 of DURA1 for TComplEx.

-3 -2.5 -2 -1.5 -1 -0.5 0

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(d) λ of DURA2 for TComplEx.

-3 -2.5 -2 -1.5 -1 -0.5

Log 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(e) λ1 of DURA2 for TComplEx.

-3 -2.5 -2 -1.5 -1 -0.5

Log 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(f) λ2 of DURA2 for TComplEx.

-3 -2.8 -2.6 -2.4 -2.2 -2 -1.8 -1.6 -1.4 -1.2 -1

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(g) λ of DURA1 for TRESCAL

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Log 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(h) λ3 of DURA1 for TRESCAL

-1 -0.5 0 0.5 1 1.5

Log 4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(i) λ4 of DURA1 for TRESCAL

Fig. 7: Sensitivity to hyper-parameters of DURA1 and DURA2. The x-axis is the logarithm of each hyper-parameter basedon 10 and the y-axis is mean reciprocal rank (MRR) on test data.

6.6 Sensitivity to Hyper-Parameters

For static KGE, the hyper-parameters mainly include em-bedding sizes k, batch sizes b, and regularization coef-ficients λ, λi (i = 1, 2); for temporal KGE, the hyper-parameters mainly include regularization coefficients λ, λi(i = 1, 2, 3, 4). We provide the performance curves on dif-ferent datasets for these hyper-parameters. Note that whenexploring the effect of a specific hyper-parameter, we fix theother hyper-parameters as their best values.

As shown in Figure 4, the performance is stable whenthe embedding and batch sizes vary. Figure 8 demonstratesthat the performance of static KGE is relatively insensitiveto λi. However, when the regularization coefficients λ aretoo large, the performance drops quickly. It is expectablesince too strong regularization will weaken models’ abilityto model complex relations.

As shown in Figure 7, the performance of temporal KGEis relatively insensitive to λ and λi when the KGE model isTComplEx. However, the performance of TRESCAL dropsquickly when these regularization coefficients are large.

It suggests that we should choose hyperparameters forTRESCAL carefully.

7 CONCLUSION

We propose a widely applicable and effective regularizer—namely, DURA—for semantic matching knowledge graphembedding models. The formulation of DURA is based onthe observation that, for an existing semantic matching KGEmodel (primal), there is often another distance based KGEmodel (dual) closely associated with it. Extensive experi-ments show that DURA yields consistent and significant im-provements on both static and temporal knowledge graphembedding benchmark datasets.

ACKNOWLEDGMENT

This work was supported in part by National Science Foun-dations of China grants 61822604, U19B2026, 61836006, and62021001, and the Fundamental Research Funds for theCentral Universities grant WK3490000004.


2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(a) Rank for CP.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(b) Rank for ComplEx.

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Log rank

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(c) Rank for RESCAL.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(d) Batch size for CP.

2 2.2 2.4 2.6 2.8 3 3.2 3.4

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(e) Batch size for ComplEx.

2 2.2 2.4 2.6 2.8 3 3.2

Log (batch size)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(f) Batch size for RESCAL.

-3 -2.5 -2 -1.5 -1 -0.5 0

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(g) λ for CP.

-1 -0.5 0 0.5

Log 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(h) λ1 for CP.

-1 -0.5 0 0.5

Log 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(i) λ2 for CP.

-3 -2.5 -2 -1.5 -1 -0.5 0

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(j) λ for ComplEx.

-1 -0.5 0 0.5

Log 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(k) λ1 for ComplEx.

-1 -0.5 0 0.5

Log 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(l) λ2 for ComplEx.

-3 -2.5 -2 -1.5 -1 -0.5 0

Log

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(m) λ for RESCAL.

-1 -0.5 0 0.5

Log 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

(n) λ1 for RESCAL.

-1 -0.5 0 0.5

Log 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R


(o) λ2 for RESCAL.

Fig. 8: Sensitivity to hyper-parameters of DURA. The x-axis is the logarithm of each hyper-parameter based on 10 and they-axis is mean reciprocal rank (MRR) on test data.


REFERENCES

[1] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE:Enhanced language representation with informative entities,” inProceedings of the 57th Annual Meeting of the Association for Compu-tational Linguistics, 2019, pp. 1441–1451.

[2] X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embed-ding based question answering,” in Proceedings of the Twelfth ACMInternational Conference on Web Search and Data Mining, 2019, p.105–113.

[3] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo,“Ripplenet: Propagating user preferences on the knowledge graphfor recommender systems,” in Proceedings of the 27th ACM Interna-tional Conference on Information and Knowledge Management, 2018, p.417–426.

[4] K. Marino, R. Salakhutdinov, and A. K. Gupta, “The more youknow: Using knowledge graphs for image classification,” IEEEConference on Computer Vision and Pattern Recognition, pp. 20–28,2017.

[5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free-base: A collaboratively created graph database for structuringhuman knowledge,” in Proceedings of the 2008 ACM SIGMODInternational Conference on Management of Data, 2008, p. 1247–1250.

[6] D. Vrandecic and M. Krotzsch, “Wikidata: a free collaborativeknowledgebase,” Communications of the ACM, vol. 57, no. 10, pp.78–85, 2014.

[7] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “YAGO3: A knowl-edge base from multilingual wikipedias,” in Seventh Biennial Con-ference on Innovative Data Systems Research, 2015.

[8] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graphembedding by translating on hyperplanes,” in Proceedings of theAAAI Conference on Artificial Intelligence, vol. 28, no. 1, 2014.

[9] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity andrelation embeddings for knowledge graph completion,” in Proceed-ings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,2015, p. 2181–2187.

[10] S. Friedland and L.-H. Lim, “Nuclear norm of higher-order ten-sors,” Mathematics of Computation, vol. 87, no. 311, pp. 1255–1281,2018.

[11] Z. Zhang, J. Cai, and J. Wang, “Duality-induced regularizer fortensor factorization based knowledge graph completion,” in Ad-vances in Neural Information Processing Systems, vol. 33, 2020, pp.21 604–21 615.

[12] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, andO. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Advances in Neural Information Processing Sys-tems 26, 2013, pp. 2787–2795.

[13] A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learningstructured embeddings of knowledge bases,” in Proceedings ofthe Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, p.301–306.

[14] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: Knowledgegraph embedding by relational rotation in complex space,” inInternational Conference on Learning Representations, 2019.

[15] Z. Zhang, J. Cai, Y. Zhang, and J. Wang, “Learning hierarchy-awareknowledge graph embeddings for link prediction,” in Proceedingsof the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.

[16] T. Jiang, T. Liu, T. Ge, L. Sha, B. Chang, S. Li, and Z. Sui, “To-wards time-aware knowledge graph completion,” in Proceedingsof COLING 2016, the 26th International Conference on ComputationalLinguistics: Technical Papers, Dec. 2016, pp. 1715–1724.

[17] S. S. Dasgupta, S. N. Ray, and P. Talukdar, “HyTE: Hyperplane-based temporally aware knowledge graph embedding,” in Pro-ceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, Oct.-Nov. 2018, pp. 2001–2011.

[18] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolu-tional 2d knowledge graph embeddings,” in Proceedings of the 32thAAAI Conference on Artificial Intelligence, 2018, pp. 1811–1818.

[19] D. Q. Nguyen, T. Vu, T. D. Nguyen, D. Q. Nguyen, and D. Phung,“A Capsule Network-based Embedding Model for KnowledgeGraph Completion and Search Personalization,” in NAACL, 2019.

[20] D. Nathani, J. Chauhan, C. Sharma, and M. Kaul, “Learningattention-based embeddings for relation prediction in knowledgegraphs,” arXiv preprint arXiv:1906.01195, 2019.

[21] M. Nickel, V. Tresp, and H.-P. Kriegel, “A three-way model forcollective learning on multi-relational data,” in Proceedings of the28th International Conference on Machine Learning, vol. 11, 2011, pp.809–816.

[22] B. Yang, S. W.-t. Yih, X. He, J. Gao, and L. Deng, “Embeddingentities and relations for learning and inference in knowledgebases,” in Proceedings of the International Conference on LearningRepresentations, 2015.

[23] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard,“Complex embeddings for simple link prediction,” in Proceedingsof The 33rd International Conference on Machine Learning, vol. 48,2016, pp. 2071–2080.

[24] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sumof products,” Journal of Mathematics and Physics, vol. 6, no. 1-4, pp.164–189, 1927.

[25] Y. Ma, V. Tresp, and E. A. Daxberger, “Embedding models forepisodic knowledge graphs,” Journal of Web Semantics, vol. 59, p.100490, 2019.

[26] I. Balazevic, C. Allen, and T. Hospedales, “TuckER: Tensor factor-ization for knowledge graph completion,” in Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing,2019, pp. 5185–5194.

[27] B. W. Bader, R. A. Harshman, and T. G. Kolda, “Temporal analysisof semantic graphs using asalsan,” in Seventh IEEE internationalconference on data mining, 2007, pp. 33–42.

[28] R. A. Harshman, P. E. Green, Y. Wind, and M. E. Lundy, “Amodel for the analysis of asymmetric data in marketing research,”Marketing Science, vol. 1, no. 2, pp. 205–242, 1982.

[29] T. Lacroix, G. Obozinski, and N. Usunier, “Tensor decomposi-tions for temporal knowledge base completion,” arXiv preprintarXiv:2004.04926, 2020.

[30] I. Balazevic, C. Allen, and T. Hospedales, “Multi-relationalpoincare graph embeddings,” Advances in Neural Information Pro-cessing Systems, vol. 32, pp. 4463–4473, 2019.

[31] A. Grover and J. Leskovec, “Node2vec: Scalable feature learningfor networks,” in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, 2016, p.855–864.

[32] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,2014, p. 701–710.

[33] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:Large-scale information network embedding,” in Proceedings of the24th International Conference on World Wide Web, 2015, p. 1067–1077.

[34] M. Ye, J. Shen, X. Zhang, P. C. Yuen, and S.-F. Chang, “Aug-mentation invariant and instance spreading feature for softmaxembedding,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 44, no. 2, pp. 924–939, 2022.

[35] A. Van den Oord, Y. Li, and O. Vinyals, “Representationlearning with contrastive predictive coding,” arXiv preprint, p.arXiv:1807.03748, 2018.

[36] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simpleframework for contrastive learning of visual representations,” inProceedings of the 37th International Conference on Machine Learning,vol. 119, 13–18 Jul 2020, pp. 1597–1607.

[37] T. Lacroix, N. Usunier, and G. Obozinski, “Canonical tensor de-composition for knowledge base completion,” in Proceedings of theInternational Conference on Machine Learning, 2018, pp. 2863–2872.

[38] P. Minervini, L. Costabello, E. Munoz, V. Novacek, and P.-Y. Vandenbussche, “Regularizing knowledge graph embeddingsvia equivalence and inversion axioms,” in Machine Learning andKnowledge Discovery in Databases, 2017, pp. 668–683.

[39] B. Ding, Q. Wang, B. Wang, and L. Guo, “Improving knowledgegraph embedding using simple constraints,” in Proceedings of the56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), Jul. 2018, pp. 110–121.

[40] K. Toutanova and D. Chen, “Observed versus latent features forknowledge base and text inference,” in 3rd Workshop on ContinuousVector Space Models and Their Compositionality, 2015.

[41] S. M. Kazemi and D. Poole, “Simple embedding for link predictionin knowledge graphs,” in Advances in Neural Information ProcessingSystems 31, 2018, pp. 4284–4295.

[42] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,” Journal of MachineLearning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

[43] Z. Sun, S. Vashishth, S. Sanyal, P. Talukdar, and Y. Yang, “A re-evaluation of knowledge graph completion methods,” in Proceed-ings of the 58th Annual Meeting of the Association for ComputationalLinguistics, Jul. 2020, pp. 5516–5522.


[44] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph em-bedding via dynamic mapping matrix,” in Proceedings of the 53rdAnnual Meeting of the Association for Computational Linguistics andthe 7th International Joint Conference on Natural Language Processing(Volume 1: Long Papers), 2015, pp. 687–696.

[45] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Phung, “Anovel embedding model for knowledge base completion based onconvolutional neural network,” in NAACL, 2018.

[46] S. Vashishth, S. Sanyal, V. Nitin, N. Agrawal, and P. Talukdar,“Interacte: Improving convolution-based knowledge graph em-beddings by increasing feature interactions,” in Proceedings of the34th AAAI Conference on Artificial Intelligence, 2020, pp. 3009–3016.

[47] B. Wang, T. Shen, G. Long, T. Zhou, Y. Wang, and Y. Chang,“Structure-augmented text representation learning for efficientknowledge graph completion,” in Proceedings of the Web Conference,2021, pp. 1737–1748.

[48] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graphembedding: A survey of approaches and applications,” IEEETransactions on Knowledge and Data Engineering, vol. 29, no. 12, pp.2724–2743, 2017.

[49] D. Ruffinelli, S. Broscheit, and R. Gemulla, “You can teach an olddog new tricks! on training knowledge graph embeddings,” inInternational Conference on Learning Representations, 2020.

[50] E. Boschee, J. Lautenschlager, S. O’Brien, S. Shellman, J. Starz, andM. Ward, “ICEWS Coded Event Data,” 2015.

[51] A. Garcıa-Duran, S. Dumancic, and M. Niepert, “Learning se-quence encoders for temporal knowledge graph completion,”arXiv preprint arXiv:1809.03202, 2018.

[52] A. Garcıa-Duran, S. Dumancic, and M. Niepert, “Learning se-quence encoders for temporal knowledge graph completion,” inProceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, Oct.-Nov. 2018, pp. 4816–4821.

[53] Y. Xu, E. Haihong, M. Song, W. Song, X. Lv, W. Haotian, andY. Jinrui, “Rtfe: a recursive temporal fact embedding frame-work for temporal knowledge graph completion,” arXiv preprintarXiv:2009.14653, 2020.

[54] J. Messner, R. Abboud, and I. I. Ceylan, “Temporal knowl-edge graph completion using box embeddings,” arXiv preprintarXiv:2109.08970, 2021.

[55] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural network with pruning, trained quantization and huff-man coding,” in International Conference on Learning Representations,2015.

[56] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

Jie Wang received the B.Sc. degree in electronicinformation science and technology from Univer-sity of Science and Technology of China, Hefei,China, in 2005, and the Ph.D. degree in compu-tational science from the Florida State University,Tallahassee, FL, in 2011. He is currently a pro-fessor in the Department of Electronic Engineer-ing and Information Science at University of Sci-ence and Technology of China, Hefei, China. Hisresearch interests include reinforcement learn-ing, knowledge graph, large-scale optimization,

deep learning, etc. He is a senior member of IEEE.

Zhanqiu Zhang received the B.Sc. degree ininformation and computing science from Univer-sity of Science and Technology of China, Hefei,China, in 2018. He is currently a Ph.D. candidatein the Department of Electronic Engineering andInformation Science at University of Science andTechnology of China, Hefei, China. His researchinterests include graph representation learningand natural language processing.

Zhihao Shi received the B.Sc. degree in Depart-ment of Electronic Engineering and InformationScience, Hefei, China, in 2020. He is currently agraduate student in the Department of ElectronicEngineering and Information Science at Univer-sity of Science and Technology of China, Hefei,China. His research interests include graph rep-resentation learning and reinforcement learning.

Jianyu Cai received the B.Sc. degree in soft-ware engineering from Southeast University,Nanjing, China, in 2019. He is currently a gradu-ate student in the Department of Electronic Engi-neering and Information Science at University ofScience and Technology of China, Hefei, China.His research interests include knowledge graphand natural language processing.

Shuiwang Ji received the PhD degree in com-puter science from Arizona State University,Tempe, Arizona, in 2010. Currently, he is anAssociate Professor in the Department of Com-puter Science and Engineering, Texas A&M Uni-versity, College Station, Texas. His research in-terests include machine learning, deep learn-ing, data mining, and computational biology. Hereceived the National Science Foundation CA-REER Award in 2014. He is currently an As-sociate Editor for IEEE Transactions on Pattern

Analysis and Machine Intelligence, ACM Transactions on KnowledgeDiscovery from Data, and ACM Computing Surveys. He regularly servesas an Area Chair or equivalent roles for data mining and machinelearning conferences, including AAAI, ICLR, ICML, IJCAI, KDD, andNeurIPS. He is a senior member of IEEE.

Feng Wu received the B.S. degree in electricalengineering from Xidian University in 1992, andthe M.S. and Ph.D. degrees in computer sciencefrom the Harbin Institute of Technology in 1996and 1999, respectively. He is currently a Profes-sor with the University of Science and Technol-ogy of China, where he is also the Dean of theSchool of Information Science and Technology.Before that, he was a Principal Researcher andthe Research Manager with Microsoft ResearchAsia. His research interests include image and

video compression, media communication, and media analysis and syn-thesis. He has authored or coauthored over 200 high quality articles (in-cluding several dozens of IEEE Transaction papers) and top conferencepapers on MOBICOM, SIGIR, CVPR, and ACM MM. He has 77 grantedU.S. patents. His 15 techniques have been adopted into internationalvideo coding standards. As a coauthor, he received the Best PaperAward at 2009 IEEE Transactions on Circuits and Systems for VideoTechnology, PCM 2008, and SPIE VCIP 2007. He also received theBest Associate Editor Award from IEEE Circuits and Systems Societyin 2012. He also serves as the TPC Chair for MMSP 2011, VCIP 2010,and PCM 2009, and the Special Sessions Chair for ICME 2010 andISCAS 2013. He serves as an Associate Editor for IEEE Transactionson Circuits and Systems for Video Technology, IEEE Transactions ONMultimedia, and several other international journals.


APPENDIX APROOF OF THEOREM 1

Proof. We have that

|R|∑j=1

(‖URj‖2F + ‖V‖2F

)=

|R|∑j=1

(I∑i=1

‖ui ◦ rj‖2F +D∑d=1

‖v:d‖2F

)

=

|R|∑j=1

(D∑d=1

‖v:d‖2F +I∑i=1

D∑d=1

u2idr2jd

)

=

|R|∑j=1

D∑d=1

‖v:d‖22 +D∑d=1

‖u:d‖22‖r:d‖22

=D∑d=1

(‖u:d‖22‖r:d‖22 + |R|‖v:d‖22)

≥D∑d=1

2√|R|‖u:d‖2‖r:d‖2‖v:d‖2

=2√|R|

D∑d=1

‖u:d‖2‖r:d‖2‖v:d‖2.

The equality holds if and only if ‖u:d‖22‖r:d‖22 = |R|‖v:d‖22,i.e., ‖u:d‖2‖r:d‖2 =

√|R|‖v:d‖2.

For all CP decomposition X =∑Dd=1 u:d ⊗ r:d ⊗ v:d, we

can always let u′:d = u:d, r′:d =

√‖vd‖2

√|R|

‖u:d‖2‖r:d‖2 r:d and v′:d =√‖u:d‖2‖r:d‖2‖v:d‖2

√|R|

v:d such that

‖u′:d‖2‖r′:d‖2 =√|R|‖v′:d‖2,

and meanwhile ensure that X =∑Dd=1 u′:d ⊗ r′:d ⊗ v′:d.

Therefore, we know that

1√|R|

|R|∑j=1

‖Xj‖∗ =1

2√|R|

|R|∑j=1

minXj=URjV>

(‖URj‖2F + ‖V‖2F )

≤ 1

2√|R|

minXj=URjV>

|R|∑j=1

(‖URj‖2F + ‖V‖2F )

= minX=

∑Dd=1 u:d⊗r:d⊗v:d

D∑d=1

‖u:d‖2‖r:d‖2‖v:d‖2

= ‖X ‖∗.

In the same manner, we know that

1

2√|R|

minXj=URjV>

|R|∑j=1

(‖VR>j ‖2F + ‖U‖2F ) = ‖X ‖∗.

The minimizers satisfy ‖v:d‖2‖r:d‖2 =√|R|‖u:d‖2.

Therefore, the minimizers satisfy ‖u:d‖2‖r:d‖2 =√|R|‖v:d‖2 and ‖v:d‖2‖r:d‖2 =

√|R|‖u:d‖2, ∀ d ∈

{1, 2, . . . , D}.

APPENDIX BPROOF OF THEOREM 2

Proof. First, we have

|T |∑l=1

|R|∑j=1

‖URjTl‖2F + ‖V‖2F

=r∑d=1

‖u:d‖22‖r:d‖22‖t:d‖22 + |R||T |‖v:d‖22,

|T |∑l=1

|R|∑j=1

‖RjTlV>‖2F + ‖U‖2F

=r∑d=1

‖v:d‖22‖r:d‖22‖t:d‖22 + |R||T |‖u:d‖22.

Suppose that ‖X ‖∗ can be attained atD and (u∗:d, r∗:d, v

∗:d, t∗:d)

for d = 1, ..., D. Let

u′:d = αuu∗:d =

(√‖v∗:d‖2‖r∗:d‖2‖t∗:d‖2√|R||T |‖u∗:d‖2

)u∗:d

v′:d = αvv∗:d =

(√‖u∗:d‖2‖r∗:d‖2‖t∗:d‖2√|R||T |‖v∗:d‖2

)v∗:d

r′:d = αrr∗:d =

√ √|R||T |

‖r∗:d‖2‖t∗:d‖2

r∗:d

t′:d = αtt∗:d =

√ √|R||T |

‖r∗:d‖2‖t∗:d‖2

t∗:d.

Notice that

X =D∑d=1

u′:d ⊗ r′:d ⊗ v′:d ⊗ t′:d,

‖X ‖∗ =D∑d=1

‖u′:d‖2‖r′:d‖2‖v′:d‖2‖t′:d‖2,

i.e., (u∗:d, r∗:d, v

∗:d, t∗:d) for d = 1, ..., D also attains the nuclear

p-norm ‖X ‖∗.It follows from the inequality of arithmetic and geomet-

ric means that

1

4

r∑d=1

(‖u:d‖22‖r:d‖22‖t:d‖22 + |R||T |‖v:d‖22)

+(‖v:d‖22‖r:d‖22‖t:d‖22 + |R||T |‖u:d‖22)

≥r∑d=1

√|R||T |‖u:d‖2‖r:d‖2‖v:d‖2‖t:d‖2

≥D∑d=1

√|R||T |‖u′:d‖2‖r′:d‖2‖v′:d‖2‖t′:d‖2

=√|R||T |‖X ‖∗


The equality holds if (u:d, r:d, v:d, t:d) = (u′:d, r′:d, v

′:d, t′:d),

i.e.,

1

4√|R||T |

D∑d=1

(‖u′:d‖22‖r′:d‖22‖t′:d‖22 + |R||T |‖v′:d‖22)

+(‖v′:d‖22‖r′:d‖22‖t′:d‖22 + |R||T |‖u′:d‖22)

=D∑d=1

‖u′:d‖2‖r′:d‖2‖v′:d‖2‖t′:d‖2 = ‖X ‖∗.

Therefore, the minimization of the left side is ‖X ‖∗.

APPENDIX CPROOF OF THEOREM 3Lemma 1. The nuclear p-norm is also derived by

‖A‖p∗ = min

{1

D

r∑i=1

D∑k=1

‖uk,i‖Dp :

A =r∑i=1

u1,i ⊗ · · · ⊗ uD,i, r ∈ N

}, (16)

where uk,i ∈ Rnk for k = 1, ..., D, i = 1, ..., r, and ⊗ denotesthe outer product.

Proof. Suppose that the right side of Equation (16) attains‖A‖p∗ at R and (u∗k,i) for k = 1, ..., D, i = 1, ..., R. Let

u′k,i = αk,iu∗k,i,

αk,i =

D

√∏Dd=1 ‖u∗d,i‖p‖u∗k,i‖

.

Notice that

A =R∑i=1

u′1,i ⊗ · · · ⊗ u′D,i,

R∑i=1

D∏k=1

‖u′k,i‖p = ‖A‖p∗,

i.e., (u′k,i) for k = 1, ..., D, i = 1, ..., R also attains thenuclear p-norm ‖A‖p∗.

It follows from the inequality of arithmetic and geomet-ric means that

1

D

r∑i=1

D∑k=1

‖uk,i‖Dp =r∑i=1

(1

D

D∑k=1

‖uk,i‖Dp

)

≥r∑i=1

(D∏k=1

‖uk,i‖p

)

≥R∑i=1

D∏k=1

‖u′k,i‖p

= ‖A‖p∗where A =

∑ri=1 u1,i ⊗ · · · ⊗ uD,i. As the left side attains

the minimization ‖A‖p∗ at R and (u′k,i) for k = 1, ..., D,i = 1, ..., R, i.e.,

1

D

r∑i=1

D∑k=1

‖u′k,i‖Dp =R∑i=1

D∏k=1

‖u′k,i‖p = ‖A‖p∗.

Therefore, the minimization of the left side is ‖A‖p∗.

TABLE 7: Statistics of KGC benchmark datasets.


#Entity 40,943 14,541 123,182#Relation 11 237 37#Train 86,835 272,115 1,079,040#Valid 3,034 17,535 5,000#Test 3,134 20,466 5,000

TABLE 8: Statistics of temporal KGC benchmark datasets.


#Entity 6,869 10,094 15,403#Relation 230 251 51#Timestamp 365 4017 170#Train 72,826 368,962 110,441#Valid 8,941 46,275 13,815#Test 8,963 46,092 13,800

APPENDIX DDATASET STATISTICS

For static knowledge graph completion, we consider threepublic static knowledge graph datasets—WN18RR [40],FB15k-237 [18], and YAGO3-10 [7], which have been dividedinto training, validation, and testing set in previous works.The statistics of these datasets are shown in Table 7.

For temporal knowledge graph completion, we considerthree public temperal knowledge graph datasets—ICEWS14[50], ICEWS05-15 [50], and YAGO15k [51], which have beendivided into training, validation, and testing set in previousworks. ICEWS14 and ICEWS05-15 are extracted from the In-tegrated Conflict Early Warning System (ICEWS). YAGO15kis a modification of FB15k. The statistics of these datasets areshown in Table 8.

APPENDIX EEVALUATION METRICS

The mean reciprocal rank is the average of the reciprocalranks of results for a sample of queries Q:

MRR =1

|Q|

|Q|∑i=1

1

ranki.

The Hits@N is the ratio of ranks that no more than N :

Hits@N =1

|Q|

|Q|∑i=1

1x≤N (ranki),

where 1x≤N (ranki) = 1 if ranki ≤ N or otherwise1x≤N (ranki) = 0.

APPENDIX FTHE OPTIMAL VALUE OF P IN STATIC KGCIn distance-based models, the commonly used p is either1 or 2. When p = 2, DURA takes the form as the one inEquation (7) in the main text. If p = 1, we cannot expandthe squared score function of the associated DB models as inEquation (4). Thus, the induced regularizer takes the formof∑

(ui,rj ,vk)∈S ‖uiRj − vk‖1 + ‖vkR>j − ui‖1. The aboveregularizer with p = 1 (Reg p1) does not gives an upper


TABLE 9: The queries in T-SNE visualizations.

Index Query

1 (political drama, /media common/netflix genre/titles, ?)2 (Academy Award for Best Original Song, /award/award category/winners./award/award honor/ceremony,?)3 (Germany, /location/location/contains,?)4 (doctoral degree , /education/educational degree/people with this degree./education/education/major field of study,?)5 (broccoli, /food/food/nutrients./food/nutrition fact/nutrient,?)6 (shooting sport, /olympics/olympic sport/athletes./olympics/olympic athlete affiliation/country,?)7 (synthpop, /music/genre/artists, ?)8 (Italian American, /people/ethnicity/people,?)9 (organ, /music/performance role/track performances./music/track contribution/role, ?)10 (funk, /music/genre/artists, ?)

bound on the tensor nuclear-2 norm as in Theorem 1. Table10 shows that, DURA significantly outperforms Reg p1 onWN18RR and FB15k-237. Therefore, we choose p = 2.

TABLE 10: Comparison to Reg p1, where ”R” denotesRESCAL and ”C” denotes ComplEx.

WN18RR FB15k-237

MRR H@1 H@10 MRR H@1 H@10

R-Reg p1 .281 .220 .394 .310 .228 .338C-Reg p1 .409 .393 .439 .316 .229 .487

R-DURA .498 .455 .577 .368 .276 .550C-DURA .491 .449 .571 .371 .276 .560

APPENDIX GTHE QUERIES IN T-SNE VISUALIZATION

In Table 9, we list the ten queries used in the T-SNE visu-alization (Section 6.5 in the main text). Note that a query isrepresented as (u, r, ?), where u denotes the head entity andr denotes the relation.

Duality-Induced Regularizer for Semantic Matching ... - arXiv

Documents

Transcript of Duality-Induced Regularizer for Semantic Matching ... - arXiv