generating image descriptions with top-down spatial knowledge

12
Proceedings of The 12th International Conference on Natural Language Generation, pages 540–551, Tokyo, Japan, 28 Oct - 1 Nov, 2019. c 2019 Association for Computational Linguistics 540 What Goes Into A Word: Generating Image Descriptions With Top-Down Spatial Knowledge Mehdi Ghanimifard Simon Dobnik Centre for Linguistic Theory and Studies in Probability (CLASP) Department of Philosophy, Linguistics and Theory of Science (FLoV) University of Gothenburg, Sweden {mehdi.ghanimifard,simon.dobnik}@gu.se Abstract Generating grounded image descriptions re- quires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their lo- cation with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrange- ments of pixels but to complex geometric and conceptual representations. The aim of this pa- per is to evaluate what representations facil- itate generating image descriptions with spa- tial relations and lead to better grounded lan- guage generation. In particular, we investigate the contribution of four different representa- tional modalities in generating relational re- ferring expressions: (i) (pre-trained) convolu- tional visual features, (ii) spatial attention over visual features, (iii) top-down geometric re- lational knowledge between objects, and (iv) world knowledge captured by contextual em- beddings in language models. 1 Introduction Spatial recognition and reasoning are essential bases for visual understanding. Automatically generating descriptions of scenes involves both recognising objects and their spatial configuration. This project follows up on recent attempts to im- prove language generation and understanding in terms of using spatial modules in the fusion of vi- sion and language (Xu et al., 2015; Johnson et al., 2016; Lu et al., 2017; Hu et al., 2017; Anderson et al., 2018) (see also Section 6). Generating spatial descriptions is an important part of the image description task which requires several types of knowledge obtained from differ- ent modalities: (i) invariant visual clues for object identification, (ii) geometric configuration of the scene representing relations between objects rel- ative to the size of the environment (iii) object- specific functional relations that capture interac- tion between them and are formed by our knowl- edge of the world for example an umbrella is over a man is true if the referring umbrella serves its function, protecting the man from the rain (Coven- try et al., 2001), and (iv) for projective relations (e.g. “to the left of” and “above”) but not topo- logical relations (e.g. “close” and “at”), the frame of reference which can be influenced from other modalities such as scene attention and dialogue in- teraction (Dobnik et al., 2015). Work in cognitive psychology (Logan, 1994, 1995) argues that while object identification may be pre-attentive, identi- fication of spatial relations is not and is accom- plished by a top-down mechanisms of attention af- ter the objects have been identified. It is also the case that we do not identify all possible relations between objects but only those that are attended by such top-down mechanisms considering differ- ent kinds of high-level knowledge. Experiments on training neural recurrent lan- guage models in a bottom-up fashion from data 1 demonstrated that spatial relations are frequently not learned to be grounded in visual inputs (Lu et al., 2017; Tanti et al., 2018a; Ghanimifard and Dobnik, 2018) which has been attributed to the de- sign choices of these models that primarily focus on identification of objects (Kelleher and Dobnik, 2017). Therefore, targeted integration of differ- ent modalities is required to capture the properties from (i) to (iv). We can do this top-down (Ander- son et al., 2018; Hu et al., 2017; Liu et al., 2017). However, it is not immediately obvious what kind of top-down spatial knowledge will benefit the bottom-up models most. Therefore, in this paper we investigate the integration of different kind of 1 A bottom-up learning acquires higher level representa- tions from examples of local features rather than using an external procedure to extract them. See also Section 6.

Transcript of generating image descriptions with top-down spatial knowledge

Proceedings of The 12th International Conference on Natural Language Generation, pages 540–551,Tokyo, Japan, 28 Oct - 1 Nov, 2019. c©2019 Association for Computational Linguistics

540

What Goes Into A Word: Generating Image DescriptionsWith Top-Down Spatial Knowledge

Mehdi Ghanimifard Simon DobnikCentre for Linguistic Theory and Studies in Probability (CLASP)

Department of Philosophy, Linguistics and Theory of Science (FLoV)University of Gothenburg, Sweden

{mehdi.ghanimifard,simon.dobnik}@gu.se

Abstract

Generating grounded image descriptions re-quires associating linguistic units with theircorresponding visual clues. A commonmethod is to train a decoder language modelwith attention mechanism over convolutionalvisual features. Attention weights align thestratified visual features arranged by their lo-cation with tokens, most commonly words, inthe target description. However, words suchas spatial relations (e.g. next to and under)are not directly referring to geometric arrange-ments of pixels but to complex geometric andconceptual representations. The aim of this pa-per is to evaluate what representations facil-itate generating image descriptions with spa-tial relations and lead to better grounded lan-guage generation. In particular, we investigatethe contribution of four different representa-tional modalities in generating relational re-ferring expressions: (i) (pre-trained) convolu-tional visual features, (ii) spatial attention overvisual features, (iii) top-down geometric re-lational knowledge between objects, and (iv)world knowledge captured by contextual em-beddings in language models.

1 Introduction

Spatial recognition and reasoning are essentialbases for visual understanding. Automaticallygenerating descriptions of scenes involves bothrecognising objects and their spatial configuration.This project follows up on recent attempts to im-prove language generation and understanding interms of using spatial modules in the fusion of vi-sion and language (Xu et al., 2015; Johnson et al.,2016; Lu et al., 2017; Hu et al., 2017; Andersonet al., 2018) (see also Section 6).

Generating spatial descriptions is an importantpart of the image description task which requiresseveral types of knowledge obtained from differ-ent modalities: (i) invariant visual clues for objectidentification, (ii) geometric configuration of the

scene representing relations between objects rel-ative to the size of the environment (iii) object-specific functional relations that capture interac-tion between them and are formed by our knowl-edge of the world for example an umbrella is over

a man is true if the referring umbrella serves itsfunction, protecting the man from the rain (Coven-try et al., 2001), and (iv) for projective relations(e.g. “to the left of” and “above”) but not topo-logical relations (e.g. “close” and “at”), the frameof reference which can be influenced from othermodalities such as scene attention and dialogue in-teraction (Dobnik et al., 2015). Work in cognitivepsychology (Logan, 1994, 1995) argues that whileobject identification may be pre-attentive, identi-fication of spatial relations is not and is accom-plished by a top-down mechanisms of attention af-ter the objects have been identified. It is also thecase that we do not identify all possible relationsbetween objects but only those that are attendedby such top-down mechanisms considering differ-ent kinds of high-level knowledge.

Experiments on training neural recurrent lan-guage models in a bottom-up fashion from data1

demonstrated that spatial relations are frequentlynot learned to be grounded in visual inputs (Luet al., 2017; Tanti et al., 2018a; Ghanimifard andDobnik, 2018) which has been attributed to the de-sign choices of these models that primarily focuson identification of objects (Kelleher and Dobnik,2017). Therefore, targeted integration of differ-ent modalities is required to capture the propertiesfrom (i) to (iv). We can do this top-down (Ander-son et al., 2018; Hu et al., 2017; Liu et al., 2017).However, it is not immediately obvious what kindof top-down spatial knowledge will benefit thebottom-up models most. Therefore, in this paperwe investigate the integration of different kind of

1A bottom-up learning acquires higher level representa-tions from examples of local features rather than using anexternal procedure to extract them. See also Section 6.

541

h “teddy bear”, “partially under”, “go cart”i

Figure 1: hTARGET, RELATION, LANDMARKi annota-tion of bounding boxes in VisualGenome 2318741a

aRaSeLaSeD Il Pinguino (2008): CC BY-SA 2.0.

top-down spatial knowledge beyond object locali-sation represented as features with the bottom-upneural language model.

The paper is organised as follows. In Sec-tion 2, we discuss how spatial descriptions areconstructed and what components are required togenerate descriptions. In Section 3, the neural net-works’ design is explained. In Section 4, we ex-plain what dataset is used for this study, what pre-processing was applied on it and how the mod-els are trained. Then the experiments and evalu-ation results are presented in Section 5. The re-lated work in relation to our methods and findingsis discussed in Section 6. The conclusion is givenin Section 7.

2 Generating Spatial Descriptions

When describing a scene, there are several ways toconstruct spatial descriptions referring to objectsand places and their relation with each other. Aspatial description has three parts: a TARGET anda LANDMARK referring to objects or places and aRELATION denoting the location of the target in re-lation to the landmark (Logan and Sadler, 1996).2

These are in the example in Figure 1 as follows:There is a teddy bear

TARGET

partially under

RELATION

a go cart

LANDMARK

.

Therefore generating such description requires (a)identification of objects and their locations: thetarget is what we want to describe and the land-mark is what we will relate the target to; thesalience of the landmark is important for thehearer. (b) Grounding of the relation in geomet-

2Sometimes these are also known as referent and relatum

(Miller and Johnson-Laird, 1976), figure and ground (Talmy,1983) or the located object and the reference object (Her-skovits, 1986; Gapp, 1994; Dobnik, 2009).

ric space: the spatial relation is expressed relativeto the landmark which grounds a 3-dimensionalcoordinate system; furthermore, for projective re-lations, the coordinate system is aligned with theorientation of the external viewpoint which de-termines the frame of reference (Maillat, 2003).(Viewpoint may also be the landmark object itselfin which case the coordinate system is oriented inthe same way as the landmark). (c) Grounding infunction: a spatial relation may be selected alsobased on the functional properties between targetand landmark objects, e.g. the difference between“the teapot is over the cup” and “the teapot is

above the cup” (Coventry et al., 2001).Generating spatial descriptions requires know-

ing the intended target object and how we want toconvey its location to the listener. The bottom-upapproach in image captioning is focused on learn-ing the salience of objects and events to generatecaptions expressed in the dataset (e.g. Xu et al.(2015)). The combination of bottom-up and top-down approaches for generating descriptions usemodularisation in order to improve the generationof descriptions of different kind (e.g. You et al.(2016)). However, as we have seen in the preced-ing discussion, the generation of spatial descrip-tions requires a highly specific geometric knowl-edge. How is this knowledge approximated bythe bottom-up models? To what degree can weintegrate this knowledge with the top-down mod-els? In this paper, we investigate these questionsin a language generation task by comparing differ-ent variations of included top-down spatial knowl-edge. More specifically, for each image, we gener-ate a description for every pair of objects that arelocalised in the image. We consider a variety oftop-down spatial knowledge representations aboutobjects as inputs to the model: (a) explicit ob-ject localisation and extraction of visual features;(b) explicit identification of the target-landmark byspecifying their order in the feature vector; and (c)explicit geometric representation of objects in a2D image. We investigate the contribution of eachof these sets of features to generation of image de-scriptions.

3 Neural Network Design

Our method is to add step-by-step modules andconfigurations to the network providing differentkind of top-down knowledge in Section 2 andinvestigating the performance of such configura-

542

tions. There are several design choices with smalleffects on the performance but costly in terms ofparameter size (Tanti et al., 2018b). Therefore, ifthere is no research question related to that choice,we take the simplest choice as reported in the pre-vious work such as (Lu et al., 2017; Andersonet al., 2018). We use the following configurations:

1. Simple bottom-up encoder-decoder;2. Bottom-up object localisation with attention;3. Top-down object annotated localisation;4. Top-down target and landmark assignment;5. Two methods of top-down representation of

geometric features (s-features).These five configurations give us 10 variations ofthe model design as shown in Table 1. A de-tailed definition of each module is given in the Ap-pendix A in the supplementary material.

Generative language model We use a simpleforward recurrent neural model with cross-entropyloss in all model configurations.

Simple encoder-decoder An encoder-decoderarchitecture without spatial attention shown inFigure 3a and similar to (Vinyals et al., 2015) isthe simplest baseline for fusing vision and lan-guage. The input to the model is an image and thestart symbol < s > of a description and the outputis produced by the language model decoder. Theembeddings are randomly initialised and learnedas a parameter set of the model. The visual vec-tors are produced by a pre-trained ResNet50 (Heet al., 2016). A multi-layer perceptron module (Fv

in Figure 2) is used to fine-tune the visual features.

224 x 224 x 3 49 x 2048 49 x 100

.

.

.

.

.

. Res

Net

50

Figure 2: Visual features are obtained from the pre-trained ResNet50, then translated to a low dimensionalvector with a dense layer Fv.

Bottom-up localisation With visual feature rep-resenting all regions of the image as in Figure 2,the attention mechanism is used as a localisationmodule. We generalised the adaptive attention in-troduced in (Lu et al., 2017) to be able to fuse themodalities. As shown in Figure 3b, the interac-tion between the attention mechanism and the lan-guage model is more similar to (Anderson et al.,2018): two layers of stacked LSTM, the first stack

(LSTMa) to produce the features for the attentionmodel and the second stack (LSTMl) to producecontextualised linguistic features which are fusedwith the attended visual features. This design iseasier to extend with additional top-down vectors.

Top-down localisation Unlike the bottom-upunsupervised localisation, the top-down methodincludes a provision of a list regions of inter-est (ROI) from external procedures. For exam-ple, the region proposals can come from anotherbottom-up task as in (Anderson et al., 2018; John-son et al., 2016) which use a Faster R-CNN (Renet al., 2015) to extract possible regions of interestfrom the ConvNets regions in Figure 2. Here, asshown in Figure 4 we use the bounding box anno-tations of objects in images as the top-down local-isation knowledge and then extract ResNet50 vi-sual features from these regions. In the first stagethe top-down visual representation only proposesvisual vectors of the two objects in a random orderwithout their spatial role as targets and landmarksin the descriptions. The model is shown in Fig-ure 3d.

Top-down target-landmark assignment In thesecond iteration of the top-down localisation mod-ule we assign semantic roles to regions as targetsand landmarks. This is directly related to locali-sation as spatial relations are asymmetric. We en-code this top-down knowledge by fixing the orderof the regions in the feature vector. The first objectis the target and the second object is the landmark.Otherwise, the model is the same as in the previ-ous iteration shown in Figure 3d.

Top-down geometric features The localisationprocedure of objects discussed previously does notprovide any geometric information about the rela-tion between the two regions. However, top-downgeometric features are required for grounding spa-tial relations where the location of the target objectis expressed relative to the landmark. For example,a simple (but by no means sufficient) geometric re-lation between two bounding boxes can be repre-sented by an arrow from the centre of one bound-ing box to the centre of the other and by order-ing the information about bounding boxes in thefeature vector as in the previous model to encodetarget-landmark asymmetry. The network archi-tecture of the model with top-down geometric fea-tures expressing relations between the objects isshown in Figure 3e. We consider two different rep-

543

Model name Regions Of Interest TARGET-LANDMARK s-features Architecturesimple - - - Figure 3abu49 Bottom-up (7⇥7 grid) Bottom-up attention - Figure 3bbu49+mask Bottom-up (7⇥7 grid) Bottom-up attention Multi-hot 98 Figure 3cbu49+VisKE Bottom-up (7⇥7 grid) Bottom-up attention Dense 11 Figure 3ctd Top-down (2 bbox) Bottom-up attention - Figure 3dtd +mask Top-down (2 bbox) Bottom-up attention Multi-hot 98 Figure 3etd +VisKE Top-down (2 bbox) Bottom-up attention Dense 11 Figure 3etd order Top-down (2 bbox) Top-down assignment - Figure 3dtd order+mask Top-down (2 bbox) Top-down assignment Multi-hot 98 Figure 3etd order+VisKE Top-down (2 bbox) Top-down assignment Dense 11 Figure 3e

Table 1: The 10 variations of the neural network model after incrementally adding modules and features.

(a)

Attention...

(b)

Attention...

(c)

Attention

(d)

Attention

(e)

Figure 3: Five architectures: (a) simple encoder-decoder (simple). (b) bottom-up localisation with adaptive at-tention on 49 regions (bu49). (c) bottom-up localisation with explicit spatial vectors of the bounding boxesbu49+mask/bu49+VisKE. (d) top-down localisation with attentions on two bounding boxes (td). (e) top-downlocalisation augmented with explicit spatial vectors of the bounding boxes (td +mask/td +VisKE).

Res

Net

50

Avg

Res

Net

50

Avg

224 x 224 x 3 2048 100

Figure 4: Top-down localisation of objects with bound-ing boxes whose visual features are extracted and trans-lated to lower dimensions with Fv.

resentations of the top-down geometric featuresshown in Figure 5: Multi-hot mask over 49 vec-tors independently for target and landmark (Mask)over 49 locations (Figure 5a) and VisKE (Sadeghiet al., 2015) dense representations with 11 geomet-ric features (Figure 5b) where dx,dy are changesin the coordinates of the centres, ov,ov1,ov2 theoverlapping areas (total, relative to the first, andthe second bounding box), h1,h2 heights, w1,w2widths and a1,a2 areas. Note that Mask featuresprovide geometric information about the size andthe location of objects relative to the picture frameand VisKE feature provide more detailed geomet-ric information that expresses the relation betweenthe objects. The latter therefore more closelymatch the features that were identified in spatialcognitive models. A feed-forward network with

two layers (Fs) is used to project geometric fea-tures into a vector with the same dimensionalityas the Fv outputs so that different modalities arecomparable in weighted sum model of attention.

7 x 7 x 2 98 100

(a)

11 100

(b)

Figure 5: (a) Each bounding box is converted to a maskof multi-hot vector on 49 regions. (b) The geometric re-lation between the two bounding boxes are representedwith features from (Sadeghi et al., 2015).

4 Dataset and TrainingWe use the relationship dataset in Visual Genome(Krishna et al., 2017) which is a collectionof referring expressions represented as triplets

544

hsubject,predicate,objecti on 108K images. Un-like image captioning datasets such as MSCOCO(Chen et al., 2015) and Flickr30K (Plummer et al.,2015) where only 5 captions are given for each im-age, each image in this dataset is annotated with50 phrases. The annotators were asked to anno-tate relations between two given bounding boxesof subject and object by freely writing the textfor each of the three parts of the annotation. Thebounding boxes produced by another annotationprocedure which detected objects in the images. Intotal, there are 2,316,104 annotations of 664,805unique triplets, 35,744 unique labels of subjectsand 21,299 unique labels of objects most of whichconsist of multiple tokens. We omit all repeti-tions of triplets on each image, this leaves total1,614,055 annotations.3

Spatial relations Based on the lists of spatialprepositions in (Landau, 1996) and (Herskovits,1986), we have created a dictionary of spatial re-lations and their possible multi-word variants in-cluding their composite forms. This dictionarycontains 7,122 entries of 235 relations (e.g. right

to represent both on the right hand side of andto the right of ). Of these only 202 are found inVisual Genome dataset covering 79 spatial rela-tions. 328,966 unique triplets in Visual Genomeare based on exactly one of these terms which cov-ers 49.4% of all possible relationships.4

Bounding boxes Each bounding box is a tupleof 4 numbers (x,y,w,h). We normalise the num-bers to the range of (0,1) relative to the image sizeto create geometric feature vectors (Section 3).The image is split into a grid with 7x7 cells towhich bounding boxes are mapped, one boundingbox potentially covering more than one cell. Withthis bounding box granularity, there are exactly308,330 possible bounding boxes. However, only151,974 are observed in the relationships dataset.

3The repetitions include reflexive expressions (e.g. horse

next to horse), annotations of several objects of the same type(e.g. cup on table), and repetitions due to several boundingbox annotations of the same objects with different sizes.

4Other triplets in Visual Genome also have spatial con-tent. Some of them include modifiers such as partially under

as in Figure 1 and some of them are descriptions of an eventor an action such as sitting on and jumping over. Some anno-tated relationships are verbs such as flying with less obviousspatial denotation. The spatial bias in the dataset was studiedin (Collell et al., 2018). The most frequent spatial relation inthe dataset is “on” (over 450K instances), the second placeis “in” (150K instances), then “with”, variations of “behind”,“near”, “top”, “next”, “under”, “front”, and “by” (less than10K instances each).

The spatial distribution of paired objects reflectshow natural pictures are framed and how relatedobjects are understood by annotators.

Pre-processing We first removed duplicatetriplets describing the same image. Then weconverted each triplet into a word sequence byconcatenating the strings and de-tokenising themwith the white space separator. This produced acorpus with a vocabulary of 26,530 types witha maximum sequence length of 16 tokens andon average 15 referring expressions per image.We use 95% of the descriptions for training and5% for validation and testing (5,230 images with80,231 triplets).

Training We use Keras (Chollet et al., 2015)with TensorFlow backend (Abadi et al., 2015) toimplement and train all of the neural network ar-chitectures in Section 3. The models are trainedwith the Adam optimiser (Kingma and Ba, 2014)(a = 0.001, b1 = 0.9, b2 = 0.999) with a batchsize of 128 and 15 epochs.

5 Evaluation

All implementations are available online5.

5.1 Qualitative ExamplesFigure 6 shows generated descriptions for two ex-amples of unseen pictures from the test dataset byfive models. The generated word sequence is thatwith the lowest loss using beam search with k = 5.The first example shows exactly how top-down lo-calisation of objects is important especially if thegoal is to refer to specific objects in the scene.In the second example, the visual features insidethe bounding box are confusing for all 5 models.More examples are in Figure 13 in the Appendix.

5.2 Overall Model PerformanceHypothesis Top-down spatial knowledge im-proves the model performance. We consider threecategories of top-down spatial knowledge: (i) top-down localisation of regions of interest; (ii) top-down assignment of semantic roles to regions; and(iii) two kinds of geometric feature vectors.

Method After training the models we evaluatethem by calculating the average word level cross-entropy loss on held out instances in the test set6.

5https://gu-clasp.github.io/generate_spatial_descriptions/

6Equivalent to log-perplexity of the language model.

545

h “bat”, “over”, “shoulder”isimple playerbu49 man wearing shirttd bat in handtd order bat in handtd order+VisKE bat in hand

h “hood”, “above”, “oven”isimple windowbu49 pot on stovetd oven has doortd order vent above sinktd order+VisKE cabinet has door

Figure 6: From VisualGenome: 2412051a 2413282b

aHerholz (2005): CC BY-SA 2.0.bjuanjogasp (2013): CC BY-NC-SA 2.0.

We also calculate the loss on descriptions contain-ing specific spatial relations for qualitative under-standing of the effects of each type of top-downknowledge.

Results The overall loss of each model on theunseen descriptions of images is shown in Fig-ure 7. The fully bottom-up model with no spatialattention (simple) has the highest loss. The loss inthe variations of the model with bottom-up local-isation in bu49 is higher than the one in the mod-els with top-down localisation. The models withthe top-down assignment of TARGET-LANDMARKachieves the best results. The effect of top-downgeometric features is not significant.

Figure 7: Cross-entropy loss of different model config-urations on evaluation data.

Figure 8 shows the performance of the modelson a selection spatial relations.

Figure 8: Cross-entropy loss of different model con-figurations on 40 descriptions for each relation: near,inside, above and below.

Discussion The top-down localisation (td) cer-tainly improves the performance of the languagemodels compared to purely bottom-up representa-tions. However, additional top-down assignmentof TARGET-LANDMARK (td order) and their ad-ditional geometric arrangement of bounding boxfeatures (mask and VisKe) has a small positive ef-fect on overall performance. The overall perfor-mance is not a representative of how these con-figurations effect the grounding of spatial rela-tions. More specifically, the imbalance of certaingroups of relations (especially a generally lowerproportion of geometrically biased relations suchas “left” and “right” in this dataset and the pres-ence of relations with a minimum spatial contentsuch as has, wearing) makes it harder to make con-clusions about overall performance of the models.We further examine two groups of some frequentspatial relations. The relations such as inside andnear represent one group and above and below

represent the other. Some top-down knowledge(as represented by our features) is less informa-tive for the first group but is informative for thesecond group. For example near does not requirethe assignment of TARGET-LANDMARK roles. Weobserve that td order is not performing better thantd. On the other hand, inside is sensitive to TAR-GET-LANDMARK assignment. However, since therelation is also restricted by a choice of objects(only certain objects can be inside others) theirTARGET-LANDMARK assignment may already be

546

inferred without such top-down knowledge from alanguage model. For the second group, the top-down knowledge about the semantic role of ob-jects is important. However, left and right areamong the least frequent relations in the datasetwhich is demonstrated by the fact that their de-scriptions have a higher loss than above and be-

low. For these relations the loss of the simple

model is much higher than other configurations.It can be seen that td is performing better than bu

and td order is contributing over td but geometricfeatures have a lesser effect than identification ofsemantic roles (td order).

5.3 Grounding in featuresHypotheses With the aim to evaluate what top-down information contributed to grounding ofwords we examine the following hypotheses:

H1 s-features contribute to predicting spatial re-lation words.

H2 Without top-down TARGET-LANDMARK roleassignments to each region, attention is uni-formly distributed over region choices at thebeginning of a sequence generation.

Method In order to check the contribution ofeach feature from different modalities in predic-tion of each word, we look at the adaptive atten-tion on each feature at the point of predicting theword7. Since feature vectors are not normalisedagainst the number of features of each modality,we first multiply each attention measure with themagnitude of the feature vector, and then we nor-malised it to sum to 1 again:

�t, fi=

↵t, fi|| fi||

 j ↵t, f j|| f j||

(1)

where t refers to the time in the word sequence,and fi is the feature the attention of which ↵t, fi

isapplied to it. We report the average �t, fi

over theinstances in the validation dataset.

Figure 9 shows � on two examples in threemodels. For each word, the bar chart is dividedbetween four features (in Figure 3e): (1) targetvob j1 (2) landmark vob j2 (3) s-features for bound-ing boxes (4) contextualized embeddings h

l .7In this experiment, we do not check if the estimated like-

lihood for the correct word is the highest predicted score. Thegenerated descriptions may still be acceptable with an alter-native spatial relation. Furthermore, in the following analysiswe report the attention over semantic roles and not individualwords.

Figure 9: � is plotted in bar charts for each word. (a)td order+VisKE (b) td +VisKE (c) td. The values of� for each word that constitute description referring toeach bounding box region is given in images.

After measuring the normalised attention oneach feature according to Equation 1, we reportthe average of attentions on each token at that timestep of the word sequence. We also group the to-kens based on their semantic role in the triplets andreport the average � on these tokens for a givenrole.

Results The average of attentions over triplets oftokens is plotted in Figure 10. The behaviour ofattentions on word sequences in the four modelsin given in Figure 11.

Figure 10: The overall average of � on tokens of eachsemantic role (target, relation, landmark) on all exam-ples of the test dataset, for 6 variations of the top-downknowledge about regions of interest (ROI): location ofobjects and their order as target and landmark.

547

(a)

(b)

Figure 11: The average of � attentions of top-downmodels over sequences of words 1. . . 11 (a) compar-ing td and td +VisKE and (b) comparing td order andtd order+VisKE.

Discussion The comparison of 6 models in Fig-ure 10 shows that geometric mask s-featuresare not contributing as well as dense VisKE s-features. In the models without top-down seman-tic role assignment only the model with +VisKE

features has the expected attention on target andlandmark, but there is no attention on the s-features. In the models with top-down semanticrole assignment, the model with VisKE s-featureshas higher attention on s-features when predictinga relation word (H1). A similar situation is observ-able over word sequences in Figure 11. Withoutprior semantic role assignment the model is moreconfused how to attend target or landmark (H2).Finally, note that geometric VisKE s-features helppredicting the TARGET-LANDMARK roles whenthese are not assigned top-down.

6 Related Work

Generating referring expressions Generatinglocative expressions is part of the general fieldof generating referring expressions (Dale and Re-iter, 1995; Krahmer and van Deemter, 2011) withapplications such as describing scenes (Viethenand Dale, 2008) and images (Mitchell et al.,2012). The research on describing visible objects(Mitchell et al., 2013) and human-robot dialogue(Kelleher and Kruijff, 2006) raised question aboutgrounding relations in hierarchical representationof context. Application of neural language modelsand using convolutional neural networks for en-coding visual features is an open question in inter-active GRE tasks.

Encoder-decoder models with attention Re-cently several methods focused on finding betterneural architectures for generating image descrip-tions based on pre-trained convolutional neuralnetworks have been introduced. Karpathy and Fei-Fei (2015) align descriptions with images. Vinyalset al. (2015) introduce an encoder-decoder frame-work. Xu et al. (2015) improve this approach withspatial attention. Lu et al. (2017) introduce adap-tive attention that balances language and visualembeddings. The attention measure provides anexplanation of encoder-decoder architectures onhow each modality contributes to language gen-eration. Based on the attended features the per-formance of these models can be examined (Liuet al., 2017; Ghanimifard and Dobnik, 2018). Inour paper, we develop a model similar to the adap-tive attention which exploits its expressive aspectsas a degree of grounding in different features.

Outputs of external models as top-down fea-tures In another line of work, the output of thebottom-up visual understanding is used as top-down features for language generation. For ex-ample, an object detection pipeline is combinedexplicitly with language generation. This proce-dure was previously used in template-based lan-guage generation (Elliott and Keller, 2013; Elliottand de Vries, 2015). There have been attempts tocombine this process with neural language mod-els with attention. For example, You et al. (2016)extract candidate semantic attributes from images(e.g. a list of objects in the scene), then the at-tention mechanism is used to learn to attend onthem when generating tokens of image descrip-tions. Instead of semantic attributes, Andersonet al. (2018) use a region proposal network froma pre-trained object detection model to extract thegenerated bounding box regions as possible loca-tions of visual clues. Then, the attention modellearns to attend on the visual features associatedwith these regions. The idea of using an object de-tection module is also used in Johnson et al. (2016)where Faster R-CNN (Ren et al., 2015) is usedto find regions of interest. Instead of assigningone object class to each region, a full descriptionis generated for each proposed region. In all ofthese models, an image understanding module ex-tracts some proposed representations and then thisknowledge is used as a top-down representation ofthe scene to generate an image description. In thispaper, we investigate the extent to which differ-

548

ent spatial information is facilitating as a top-downknowledge to generate descriptions of scenes withneural language models.

Modular design Our paper examines strategiesthat can demonstrate language grounding within aneural architecture. The studies of neural archi-tectures such as (Tanti et al., 2018b) provide ana-lytical insight on differences between multimodalarchitectures for language generation. The modu-lar design is mostly used in language parsing taskssuch as (Hu et al., 2017) where object recognition,localisation and relation recognition are separatemodules for grounding different parts of image de-scriptions in images in order to solve tasks such asvisual question answering. In our paper, the mod-ularity of the neural architecture is not focused onparsing text but used to incrementally demonstratethe contribution of each introduced modality tolanguage generation.

Multimodal embeddings There are relatedstudies on learning multimodal embeddings (Kiroset al., 2014; Lazaridou et al., 2015) to represent vi-sion and language in the same semantic space. Thefocus of our paper is to investigate how these dif-ferent modalities complement each other in neurallanguage generation. In our models, the semanticrepresentations of spatial relations are consideredas a separate modality extending both the languageand visual embeddings. There are related studieson encoding spatial knowledge in feature spacein order to predict spatial prepositions (Ramisaet al., 2015) or on prepositional embeddings whichcan predict regions in space (Collell and Moens,2018). In our paper, we investigate the degreein which each embedding contributes to languagegeneration within the neural language model.

7 Conclusions

We explored the effects of encoding top-down spa-tial knowledge in a bottom-up trained generativeneural language model for the image descriptiontask. The findings of the experiments in this paperare as follows:

(1) Overall, integration of top-down knowledgehas a positive effect on grounded neural languagemodels for this task. (2) When combining bottom-up language grounding with top-down knowledgerepresentation as different features, different typesof top-down knowledge have different contribu-tion to grounded language models. The general

picture is further complicated by the fact that dif-ferent spatial relations have different bias to differ-ent knowledge. (3) The performance gain from thegeometric features extracted from bounding boxes(s-features) is smaller than initially expected, withtwo possible explanations related to the nature ofthe corpora of image descriptions: (i) The corpuscontains images of typical scenes where the rela-tion of objects with each other is predictable fromthe description and therefore is captured in the lan-guage model; (ii) As annotators are focused ondescribing “what is in the image” rather “wherethings are spatially in relation to each other”, de-scriptions of geometric spatial relations which re-fer to the locational information are rare in the cor-pus. (4) The majority of attention is placed on thelanguage model which demonstrates that this pro-vides significant information when generating spa-tial descriptions. While this may be a confound-ing factor if the visual features are ignored, thelanguage model also encodes useful informationabout spatial information as discussed in (Kulka-rni et al., 2011; Dobnik et al., 2018).

The results open several questions aboutgrounded language models. Firstly, the degree towhich the system is using each modality can beaffected by dataset biases and this should be takeninto account in the forthcoming work. Given thisbias, learning a single common language modelfor descriptions of spatial scenes is insufficient asdifferent kinds of knowledge may come to focus indifferent interactional scenarios. This further sup-ports the idea that top-down integration of knowl-edge is required where we hope that the modelswill learn to attend to the appropriate features.Secondly, our investigation leaves open the ques-tion whether the representations both visual andgeometric that we use are good representations forlearning spatial relations. Further work will in-clude a focused investigation of what kind of geo-metric relations they encode.

Acknowledgements

The research reported in this paper was supportedby a grant from the Swedish Research Council(VR project 2014-39) for the establishment of theCentre for Linguistic Theory and Studies in Prob-ability (CLASP) at the University of Gothenburg.

549

ReferencesMartın Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schus-ter, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke,Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-scale machine learning on heterogeneous sys-tems. Software available from tensorflow.org.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attentionfor image captioning and visual question answering.CVPR, 3(5):6.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprint

arXiv:1504.00325.

Francois Chollet et al. 2015. Keras. https://keras.io.

Michael Coghlan. 2011. Tony cook. VisualGenomeimage id 2413371.

Guillem Collell and Marie-Francine Moens. 2018.Learning representations specialized in spatialknowledge: Leveraging language and vision. Trans-

actions of the Association of Computational Linguis-

tics, 6:133–144.

Guillem Collell, Luc Van Gool, and Marie-FrancineMoens. 2018. Acquiring common sense spatialknowledge through implicit spatial templates. InThirty-Second AAAI Conference on Artificial Intel-

ligence.

Kenny R Coventry, Merce Prat-Sala, and LynnRichards. 2001. The interplay between geometryand function in the comprehension of over, under,above, and below. Journal of memory and language,44(3):376–398.

Robert Dale and Ehud Reiter. 1995. Computationalinterpretations of the Gricean maxims in the gen-eration of referring expressions. Cognitive science,19(2):233–263.

Simon Dobnik. 2009. Teaching mobile robots to use

spatial words. Ph.D. thesis, University of Oxford:Faculty of Linguistics, Philology and Phonetics andThe Queen’s College, Oxford, United Kingdom.

Simon Dobnik, Mehdi Ghanimifard, and John D.Kelleher. 2018. Exploring the functional and ge-ometric bias of spatial relations using neural lan-guage models. In Proceedings of the First Interna-

tional Workshop on Spatial Language Understand-

ing (SpLU 2018) at NAACL-HLT 2018, pages 1–11, New Orleans, Louisiana, USA. Association forComputational Linguistics.

Simon Dobnik, Christine Howes, and John D. Kelle-her. 2015. Changing perspective: Local alignmentof reference frames in dialogue. In Proceedings of

goDIAL – Semdial 2015: The 19th Workshop on the

Semantics and Pragmatics of Dialogue, pages 24–32, Gothenburg, Sweden.

Desmond Elliott and Frank Keller. 2013. Image de-scription using visual dependency representations.In Proceedings of the 2013 Conference on Empiri-

cal Methods in Natural Language Processing, pages1292–1302.

Desmond Elliott and Arjen de Vries. 2015. Describ-ing images using inferred visual dependency repre-sentations. In Proceedings of the 53rd Annual Meet-

ing of the Association for Computational Linguistics

and the 7th International Joint Conference on Natu-

ral Language Processing (Volume 1: Long Papers),volume 1, pages 42–52.

Klaus-Peter Gapp. 1994. Basic meanings of spatial re-lations: computation and evaluation in 3D space. InProceedings of the twelfth national conference on

Artificial Intelligence (AAAI’94), volume 2, pages1393–1398, Menlo Park, CA, USA. American Asso-ciation for Artificial Intelligence, AAAI Press/MITPress.

Mehdi Ghanimifard and Simon Dobnik. 2018. Know-ing when to look for what and where: Evaluatinggeneration of spatial descriptions with adaptive at-tention. In Proceedings of the 1st Workshop on

Shortcomings in Vision and Language (SiVL’18),

ECCV, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 770–778.

Dave Herholz. 2005. Wide stance. VisualGenome im-age id 2412051.

Annette Herskovits. 1986. Language and spatial cog-

nition: an interdisciplinary study of the preposi-

tions in English. Cambridge University Press, Cam-bridge.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ronghang Hu, Marcus Rohrbach, Jacob Andreas,Trevor Darrell, and Kate Saenko. 2017. Modeling

550

relationships in referential expressions with compo-sitional modular networks. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 1115–1124.

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016.Densecap: Fully convolutional localization net-works for dense captioning. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 4565–4574.

juanjogasp. 2013. Baltic trip. VisualGenome image id2413282.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages3128–3137.

John D. Kelleher and Simon Dobnik. 2017. What is notwhere: the challenge of integrating spatial represen-tations into deep learning architectures. In Proceed-

ings of the Conference on Logic and Machine Learn-

ing in Natural Language (LaML 2017), Gothenburg,

12 –13 June, volume 1 of CLASP Papers in Com-

putational Linguistics, pages 41–52, Gothenburg,Sweden. Department of Philosophy, Linguistics andTheory of Science (FLOV), University of Gothen-burg, CLASP, Centre for Language and Studies inProbability.

John D. Kelleher and Geert-Jan M. Kruijff. 2006. In-cremental generation of spatial referring expressionsin situated dialog. In Proceedings of the 21st In-

ternational Conference on Computational Linguis-

tics and 44th Annual Meeting of the Association for

Computational Linguistics, pages 1041–1048, Syd-ney, Australia. Association for Computational Lin-guistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.2014. Multimodal neural language models. In In-

ternational Conference on Machine Learning, pages595–603.

Emiel Krahmer and Kees van Deemter. 2011. Compu-tational generation of referring expressions: A sur-vey. Computational Linguistics, 38(1):173–218.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-ing Li, Yejin Choi, Alexander C Berg, and Tamara LBerg. 2011. Baby talk: Understanding and generat-ing image descriptions. In Proceedings of the 24th

CVPR. Citeseer.

Barbara Landau. 1996. Multiple geometric representa-tions of objects in languages and language learners.Language and space, pages 317–363.

Angeliki Lazaridou, Marco Baroni, et al. 2015. Com-bining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Confer-

ence of the North American Chapter of the Associ-

ation for Computational Linguistics: Human Lan-

guage Technologies, pages 153–163.

Chenxi Liu, Junhua Mao, Fei Sha, and Alan Yuille.2017. Attention correctness in neural image cap-tioning. In Thirty-First AAAI Conference on Arti-

ficial Intelligence.

Gordon D Logan. 1994. Spatial attention and the ap-prehension of spatial relations. Journal of Experi-

mental Psychology: Human Perception and Perfor-

mance, 20(5):1015.

Gordon D Logan. 1995. Linguistic and conceptualcontrol of visual spatial attention. Cognitive psy-

chology, 28(2):103–174.

Gordon D. Logan and Daniel D. Sadler. 1996. A com-putational analysis of the apprehension of spatial re-lations. In Paul Bloom, Mary A. Peterson, LynnNadel, and Merrill F. Garrett, editors, Language and

Space, pages 493–530. MIT Press, Cambridge, MA.

Jiasen Lu, Caiming Xiong, Devi Parikh, and RichardSocher. 2017. Knowing when to look: Adaptive at-tention via a visual sentinel for image captioning. InProceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), volume 6.

Didier Maillat. 2003. The semantics and pragmat-

ics of directionals: a case study in English and

French. Ph.D. thesis, University of Oxford: Com-mittee for Comparative Philology and General Lin-guistics, Oxford, United Kingdom.

George A. Miller and Philip N. Johnson-Laird. 1976.Language and perception. Cambridge UniversityPress, Cambridge.

Margaret Mitchell, Kees van Deemter, and Ehud Reiter.2013. Generating expressions that refer to visibleobjects. In Proceedings of the 2013 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 1174–1184, Atlanta, Georgia. Asso-ciation for Computational Linguistics.

Margaret Mitchell, Xufeng Han, Jesse Dodge, AlyssaMensch, Amit Goyal, Alex Berg, Kota Yamaguchi,Tamara Berg, Karl Stratos, and Hal Daume III. 2012.Midge: Generating image descriptions from com-puter vision detections. In Proceedings of the 13th

Conference of the European Chapter of the Associa-

tion for Computational Linguistics, pages 747–756.Association for Computational Linguistics.

551

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE

international conference on computer vision, pages2641–2649.

Arnau Ramisa, JK Wang, Ying Lu, Emmanuel Del-landrea, Francesc Moreno-Noguer, and RobertGaizauskas. 2015. Combining geometric, textualand visual features for predicting prepositions in im-age descriptions. In Proceedings of the 2015 Con-

ference on Empirical Methods in Natural Language

Processing, pages 214–220. Association for Com-putational Linguistics.

RaSeLaSeD Il Pinguino. 2008. Killer bear. Visu-alGenome image id 2318741.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems,pages 91–99.

Fereshteh Sadeghi, Santosh K Kumar Divvala, and AliFarhadi. 2015. Viske: Visual knowledge extractionand question answering by visual verification of re-lation phrases. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,pages 1456–1464.

Jared Schmidt. 2010. Desk 2010-08-08. Visu-alGenome image id 2413204.

Leonard Talmy. 1983. How language structures space.In Herbert L. Pick Jr. and Linda P. Acredolo, editors,Spatial orientation: theory, research, and applica-

tion, pages 225–282. Plenum Press, New York.

Marc Tanti, Albert Gatt, and Kenneth P Camilleri.2018a. Quantifying the amount of visual informa-tion used by neural caption generators. In Proceed-

ings of the European Conference on Computer Vi-

sion (ECCV), pages 0–0.

Marc Tanti, Albert Gatt, and Kenneth P Camilleri.2018b. Where to put the image in an image cap-tion generator. Natural Language Engineering,24(3):467–489.

Jette Viethen and Robert Dale. 2008. The use of spa-tial relations in referring expression generation. InProceedings of the Fifth International Natural Lan-

guage Generation Conference, pages 59–67. Asso-ciation for Computational Linguistics.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-

tern Recognition (CVPR), 2015 IEEE Conference

on, pages 3156–3164. IEEE.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on Machine

Learning, pages 2048–2057.

Brian Yap. 2008. New england highway. Visu-alGenome image id 2417890.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages4651–4659.