Pretraining on Interactions for Learning Grounded Affordance ...

20
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 258 - 277 July 14-15, 2022 ©2022 Association for Computational Linguistics Pretraining on Interactions for Learning Grounded Affordance Representations Jack Merullo [email protected] Dylan Ebert [email protected] Carsten Eickhoff [email protected] Ellie Pavlick [email protected] Abstract Lexical semantics and cognitive science point to affordances (i.e. the actions that objects support) as critical for understanding and rep- resenting nouns and verbs. However, study of these semantic features has not yet been integrated with the “foundation” models that currently dominate language representation re- search. We hypothesize that predictive model- ing of object state over time will result in rep- resentations that encode object affordance in- formation “for free”. We train a neural network to predict objects’ trajectories in a simulated interaction and show that our network’s latent representations differentiate between both ob- served and unobserved affordances. We find that models trained using 3D simulations from our SPATIAL dataset outperform conventional 2D computer vision models trained on a similar task, and, on initial inspection, that differences between concepts correspond to expected fea- tures (e.g., roll entails rotation). Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations. 1 Introduction Much of natural language semantics concerns events and their participants–i.e., verbs and the nouns with which they compose. Evidence from cognitive science (Borghi and Riggio, 2009; Maz- zuca et al., 2021) and neuroscience (Sakreida et al., 2013) suggests that grounding such words in per- ception is an essential part of linguistic process- ing, in particular suggesting that humans repre- sent nouns in terms of their affordances (Gib- son, 1977), i.e., the interactions which they sup- port. Affordance-based representations have been argued to form the basis of formal accounts of compositional syntax and semantics (Steedman, 2002). As such, prior work in formal semantics has sought to build grounded lexical semantic repre- sentations in terms of objects and their interactions Play with these objects! Seen Unseen Now play with this one! Can it roll? Can it contain other objects? Figure 1: We investigate whether observing interactions with an object in a 3D environment encodes information about their affordances and whether this generalizes in the zero shot setting to unseen object types in 3D space. For example, Pustejovsky and Kr- ishnaswamy (2014) and Siskind (2001) represent verbs like roll as a set of entailed positional and rotational changes specified in formal logic, and Pustejovsky and Krishnaswamy (2018) argue that nouns imply (latent) events–e.g., that cups gener- ally hold things–which should be encoded as TELIC values within the noun’s formal structure. Such work provides a compelling story of grounded semantics, but has not yet been connected to the types of large scale neural network models that currently dominate NLP. Thus, in this work, we ask whether such semantic representations emerge naturally as a consequence of self-supervised pre- dictive modeling. Our motivation stems from the success of predictive language modeling at encod- ing syntactic structure. That is, if neural language models trained to predict text sequences learn to encode desirable grammatical structures (Kim and Smolensky, 2021; Tenney et al., 2018), perhaps similar models trained to predict event sequences will learn to encode desirable semantic structures. To test this intuition, we investigate whether a trans- former (Vaswani et al., 2017) trained to predict the 258

Transcript of Pretraining on Interactions for Learning Grounded Affordance ...

Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 258 - 277July 14-15, 2022 ©2022 Association for Computational Linguistics

Pretraining on Interactions for Learning Grounded AffordanceRepresentations

Jack [email protected]

Dylan [email protected]

Carsten [email protected]

Ellie [email protected]

AbstractLexical semantics and cognitive science pointto affordances (i.e. the actions that objectssupport) as critical for understanding and rep-resenting nouns and verbs. However, studyof these semantic features has not yet beenintegrated with the “foundation” models thatcurrently dominate language representation re-search. We hypothesize that predictive model-ing of object state over time will result in rep-resentations that encode object affordance in-formation “for free”. We train a neural networkto predict objects’ trajectories in a simulatedinteraction and show that our network’s latentrepresentations differentiate between both ob-served and unobserved affordances. We findthat models trained using 3D simulations fromour SPATIAL dataset outperform conventional2D computer vision models trained on a similartask, and, on initial inspection, that differencesbetween concepts correspond to expected fea-tures (e.g., roll entails rotation). Our resultssuggest a way in which modern deep learningapproaches to grounded language learning canbe integrated with traditional formal semanticnotions of lexical representations.

1 Introduction

Much of natural language semantics concernsevents and their participants–i.e., verbs and thenouns with which they compose. Evidence fromcognitive science (Borghi and Riggio, 2009; Maz-zuca et al., 2021) and neuroscience (Sakreida et al.,2013) suggests that grounding such words in per-ception is an essential part of linguistic process-ing, in particular suggesting that humans repre-sent nouns in terms of their affordances (Gib-son, 1977), i.e., the interactions which they sup-port. Affordance-based representations have beenargued to form the basis of formal accounts ofcompositional syntax and semantics (Steedman,2002). As such, prior work in formal semantics hassought to build grounded lexical semantic repre-sentations in terms of objects and their interactions

Play with

these objects!

Seen Unseen

Now play

with this one!

Can it roll?

Can it contain

other objects?

Figure 1: We investigate whether observing interactionswith an object in a 3D environment encodes informationabout their affordances and whether this generalizes inthe zero shot setting to unseen object types

in 3D space. For example, Pustejovsky and Kr-ishnaswamy (2014) and Siskind (2001) representverbs like roll as a set of entailed positional androtational changes specified in formal logic, andPustejovsky and Krishnaswamy (2018) argue thatnouns imply (latent) events–e.g., that cups gener-ally hold things–which should be encoded as TELIC

values within the noun’s formal structure.Such work provides a compelling story of

grounded semantics, but has not yet been connectedto the types of large scale neural network modelsthat currently dominate NLP. Thus, in this work, weask whether such semantic representations emergenaturally as a consequence of self-supervised pre-dictive modeling. Our motivation stems from thesuccess of predictive language modeling at encod-ing syntactic structure. That is, if neural languagemodels trained to predict text sequences learn toencode desirable grammatical structures (Kim andSmolensky, 2021; Tenney et al., 2018), perhapssimilar models trained to predict event sequenceswill learn to encode desirable semantic structures.To test this intuition, we investigate whether a trans-former (Vaswani et al., 2017) trained to predict the

258

Bottle1 wine bottle 04Bottle2plate02_flat coin plate02

Balls

Apple SplitMetalBallAtomBall Soccer Ball WheelBall SpikeBall BuckyBallBombBallEyeBall

Boxes

cardboardBox_01 cardboardBox_02 cardboardBox_03

Bowls

bowl01 bowl02 bowl03

book_0001cbook_0001bbook_0001a

Cans and Mugs

meat can boxcan small mug03canspam can Cola Canmug02

Disks

Old_USSR_Lamp_01 lamp

Lamps

Vase_VoluteKrater Gas BottleVase_AmphoraVase_Hydria

Vases

Bottles Ladle

Ladle

PenBooks

Pen black

Roll Bounce Contain Stack W-Grasp Slide

Figure 2: Objects uses for train (light gray) and test (dark gray). Colored dots indicate which affordances eachobject has.

future state of an object given perceptual infor-mation about its appearance and interactions willlatently encode affordance information of the typethought to be central to lexical semantic represen-tations. In sum:

• We present a first proof-of-concept neuralmodel that learns to encode the concept ofaffordance without any explicit supervision.

• We demonstrate empirically that 3D spa-tial representations (simulations) substantiallyoutperform 2D pixel representations in learn-ing the desired semantic features.

• We release the SPATIAL dataset of 9.5K sim-ulated object interactions and accompanyingvideos, and an additional 200K simulationswithout videos to support further research inthis area.1

Overall, our findings suggest a process by whichgrounded lexical representations–of the type dis-cussed by Pustejovsky and Krishnaswamy (2014)and Siskind (2001)–could potentially arise organi-cally. That is, grounded interactions and observa-tions, without explicit language supervision, cangive rise to the types of conceptual representationsto which nouns and verbs are assumed to ground.We interpret this as corroborative of traditionalfeature-based lexical semantic analyses and as apromising mechanism of which modern “founda-tion” model (Bommasani et al., 2021) approachesto language and concept learning can take advan-tage.

1https://github.com/jmerullo/affordances

2 Experimental Setup

2.1 Objects and Affordances in SPATIAL

To collect a set of affordances to use in our study,we begin with lists of affordances and associ-ated objects that have been compiled by previ-ous work on affordance learning: Aroca-Ouelletteet al. (2021) provides on a small list of concreteactions for evaluating physical reasoning in largelanguage models; Myers et al. (2015) provides asmall list for training computer vision models torecognize which parts of objects afford certain ac-tions; Chao et al. (2015) use crowdworkers2 tojudge noun-verb pairs and includes over 900 verbsthat are both abstract and concrete in nature. Wethen filter this list down to only a subset of con-crete actions that include objects which exist inthe Unity asset store, since we use Unity simu-lated environments to build our training and evalu-ation data (§2.4). This results in a list of six affor-dances (roll, slide, stack, contain,wrap-grasp, bounce) which are used to as-sign binary labels to each of 39 objects from 11object categories (Figure 2; see also Appendix A).

2.2 Representation Learning

We hypothesize that predictive modeling of objectstate will result in implicit representations of af-fordance and event concepts, similarly to how pre-dictive language modeling results in implicit rep-resentations of syntactic and semantic types. Thus,for representation learning, we use a sequence-modeling task defined in the visual and/or spatialworld. Specifically, given a sequence of frames

2In the case of Chao et al. (2015), we use a score ≥ 4 aspositive label, as they do in their paper.

259

depicting an object’s trajectory, our models’ objec-tive is to predict the next several timesteps of theobject’s trajectory as closely as possible. We con-sider several variants of this objective, primarilydiffering in how the represent they frames (e.g., as2D visual vs. 3D spatial features). These modelsare described in detail in Section 3.

2.3 Evaluation Task

We are interested in evaluating which variants ofthe above representation learning objective resultin readily-accessible representations of affordanceand event concepts. To do this, we train probingclassifiers (Belinkov and Glass, 2019) on top ofthe latent representations that result from the rep-resentation learning phase. That is, we freeze theweights of our pretrained models and feed the inter-mediate representation for a given input from theencoder into a single linear-layer trained to classifywhether the observed object has the affordance. Wetrain a separate classifier probe for each affordance.

We construct train and test splits by holding outa fraction of the objects from each category. Insome cases, the held-out objects are very similarto what has been seen in training (e.g., slightlydifferent dimensions of boxes) and in other cases,the objects are visually very distinct (e.g., a winebottle vs. a gas tank as instances of objects whichafford both roll and contain). Figure 2 showsour objects, affordances, and train-test splits.

2.4 SPATIAL Environment and DataCollection

The SPATIAL dataset consists of simulations ofinteractions with a variety of 3D objects in theUnity game engine3. Our data is collected in a flatempty room using the Unity physics engine on theabove-described 39 objects. For each sequence,an object is instantiated at rest on the ground. Arandom impulse force–either a ‘push’ flat along theground, or a ‘throw’ into the air–is exerted on theobject. We only exert a single impulse on an objectper sequence. The sequence ends when the objectstops moving or after 4 seconds elapse.

We record the coordinates of the object in 3Dspace at a rate of 60 frames per second. Specif-ically, each sequence is defined by the coordi-nates describing the object’s 3D position in spaceP = {p1, ..., pt} for t timesteps. Since we careabout capturing the manner in which the object

3https://unity.com/

travels and rotates through space, pi contains 9 dis-tinct 3D points around the object: 8 corners aroundan imaginary bounding box and the center pointof that bounding box (see Appendix A for a visualaid). Simultaneously we collect videos of eachinteraction from a camera looking down at a 60degree angle towards the object that we will use totrain our 2D vision based model. Each image inthe videos is collected at a resolution of 384x216pixels. We filter videos where the object leavesthe frame. Overall, this process results in 2,376training sequences and 9,283 evaluation sequences.Due to computational constraints, we decided tofocus on collecting as many evaluation examplesas possible to make comparison to spatial mod-els easier and more accurate. It may be the casethat adding more data creates stronger representa-tions, but even with this smaller training set, we seehigh test time performance on the visual dynamicstask. All our data are publicly available at https://github.com/jmerullo/affordances.

2.5 Assumptions and Limitations

This work serves as initial investigation of our hy-pothesis about representation learning for affor-dances (§2.2). We use simple simulations whichinvolve only a single object. Thus, we expectthat our setup makes some affordances (roll,slide, bounce) more readily available thanothers (contain, stack, w-grasp). Forexample, our models likely will observe objectsrolling during pretraining, but will never observeobjects being stacked on top of one another. How-ever, during evaluation, we will assess how well themodel’s internal representations encode both typesof affordances. This is intended. Our hypothesisis that, to a large extent, these affordances are afunction of the relationship between the shape4 ofthe objects and the physics of how those objectsbehave in our simulation. For example, we expectthat long, thin grasp-able objects will display dif-ferent trajectories than will wide, round objects

4In fact, Gibson (1977)’s original theory of affordancesdefined them to be purely-perceptual, without even dependingon internal processing and representation. We do not endorsethis view in general; we are enthusiastic about future workwhich involves richer internal processing (e.g., interaction andplanning) during pretraining. See Sahin et al. (2007) for areview of the various definitions and interpretations of theterm that have been used in different fields and Mota (2021)for an argument that affordances are not solely perceptual.That said, this basic-perception approach is helpful startingpoint for understanding the relationship between pretrainingand affordance representations.

260

2D Visual 3D Blind 3D Visual-spatial

t=1 t=2 t=3

h0 h1 h2

h0h1

h2

Probing

Clf“roll”

t=4 t=5 t=6 t=4 t=5 t=6 t=4 t=5 t=6t=6

t=1 t=2 t=3

h0 h1 h2

h0h1

h2

Probing

Clf“roll”

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

t=1

h0 hk+1 hk+6

Probing

Clf“roll”

hk

t=k

hk+1hk+i…

hk+6

… …

Figure 3: Model architectures. The model receives either images, 3D coordinates or both to make predictions. Forthe 3D models, the transformer encoder encodes the input sequence (and multiview images, if applicable) and thedecoder predicts the rest of the sequence. For the 2D model (RPIN), a convolutional network extracts object-centricfeatures (hi) and interaction reasoning is performed over each to predict the next time steps.

that cannot be grasped. Thus, we expect that amodel trained to predict object trajectories can en-code differences that map onto affordances suchas grasp or contain, even without observingthose actions per se. Given initial promising results(§4), we are excited about future work which ex-tends the simulation to include richer multi-objectand agent-object interactions, which likely wouldenable learning of more complex semantic con-cepts.

3 Models

We consider two primary variants of the represen-tation learning task described in Section 2.2 whichdiffer in how they represent the world state–i.e.,using 2D visual data (§3.1) vs. using 3D visual-spatial data (§3.2). To provide additional insightsinto performance differences, we also consider twoablations in the 3D model (§3.2.2), one that re-moves visual information and one that further re-moves pretraining altogether. These models aresummarized in Figure 3.

3.1 2D Visual Model

We first consider a standard computer-vision (CV)approach for our defined representation learningobjective. For this, we use a Region Proposal In-teraction Network (RPIN) proposed in Qi et al.(2021). We choose to use RPIN because it wasdesigned to solve a task very similar to ours–i.e.,

object tracking over time–and has access to objectrepresentations via bounding boxes provided assupervision during training. Using a model with ac-cess to explicit object representations ensures thatwe are not unfairly handicapping the CV approach(by requiring it to learn the concept of objects fromscratch) but rather are analyzing the relative ben-efits of a 2D CV approach vs. a 3D spatial dataapproach for latently encoding semantic event andaffordance concepts.

We train the model with similar settings to thoseQi et al. (2021) used to train on the Simulated Bil-liards dataset, but with some small differences. Forexample, we subsample our frames to be coarser-grained to encourage learning of longer-range de-pendencies. Exact details and explanations of otherparameter differences can be found in Appendix B.

To probe object representations for affordanceproperties, we take the average of the hiddenrepresentations–i.e., the model’s representationsjust prior to predicting explicit bounding box coor-dinates on the screen.

3.2 3D Visual-spatial Model

3.2.1 Full Model

Recent work has argued that models based di-rectly on 3D game engine data are more cognitivelyplausible for modeling verb semantics (Ebert andPavlick, 2020). In this spirit, we consider a modelthat learns to encode the objects visual appearance

261

jointly with predicting the objects’ behavior in 3Dspace. Specifically, our model is trained with bothan object loss and a trajectory loss as follows.

To model the 3D trajectory, the model encodesa sequence P containing positions {p1, p2, ..., pt}.As described in Section 2, each position pi con-tains 9 distinct points corresponding to the objectcenter and the 8 corners of the rectangular bound-ing box encapsulating the object. We use a singlelinear layer to project the 27D (9 3D points) inputcoordinate vectors to the embedding dimension ofa transformer (Vaswani et al., 2017). The trans-former is then fed the first t − k timesteps wherek ≥ 1. We treat k as a hyperparameter, and findthat a value of k = 8 or k = 16 tends to work thebest. Our model is trained to minimize the MeanSquared Error (MSE) computed against the truefuture location of the object, summed over all ofthe predicted points.

To model the object appearance, we give themodel access to a static view of the object at rest.We use ResNet-34 (He et al., 2016) to encode theobject’s multiview–i.e., images of the object’s sixfaces, one from each side of the object–denoted asI , and pass these as additional inputs to the model,separated by a SEP token. The transformer en-coder encodes the sequence P and I together, andthe transformer decoder predicts the object’s nextseveral positions in space. To encourage the modelto connect the sequence and image representations,we randomly (50% of the time) replace the objectin I with an object with different affordances andadd an auxiliary loss in which the model must clas-sify whether the object was perturbed. We add alinear binary classification layer on top of a CLStoken to perform this task, and add the cross en-tropy loss of this objective to our MSE loss for thetrajectory objective.

The hidden representation we use for probingexperiments is the average pooled transformer en-coder output of the multiview tokens only.

3.2.2 Ablation ModelsTo better understand which aspects of the abovemodel matter most, we also train and evaluate twoablated variants.

Without Visual Information (3D Blind). Our3D Blind model is like the above, but contains nomultiview tokens or associated loss. That is, themodel is trained only on the 3D positional data,using an MSE loss to predict the future location of

the object. For probing, we average the transformerencoder outputs across all timesteps and feed thesingle averaged emebedding into the probing clas-sifier. This model provides insight into how wellthe physical behavior alone, with no visual inputs,encodes key features for determining affordances,such as shape and material.

Without Pretraining (No-Training). Gibson be-lieved that understanding affordances only requiredraw perception, without need for mental processing.Given how saliently actions like rolling and slidingare encoded in 3D coordinates (Figure 6), it is rea-sonable to ask how much benefit our pretrainingobjective provides for encoding affordance infor-mation. To test this, we evaluate a model that isidentical to the 3D Blind model, but contains onlyrandomly initialized encoder weights (i.e., whichare never set via pretraining). If the pretraining taskencodes affordance structure the way we hypoth-esize, the randomly initialized model should per-form much worse than the trained 3D Blind variant.We refer to this model simply as the No-Trainingmodel.

4 Results

Figure 4 shows our primary results. Overall, the3D Visual-spatial model substantially outperformsthe 2D Vision-only model across all affordances,often by a large margin (4–11 percentage points).We also see, perhaps unexpectedly, that the 3D pre-trained representations encode information aboutaffordances even when the associated actions arenot explicitly observed. For example, the model dif-ferentiates objects that can stack and objects thatcan contain other objects from those that cannot,even though the model has not directly observedobjects being stacked or serving as containers dur-ing training. This result points to the richness ofthe physical information that is required to performthe pretraining task of next-state prediction.5

Looking more closely at the ablated variantsof the 3D model, we see that most of the gainsare from the 3D input representation itself. That

5We note that, unintuitively, stack and contain probesgenerally outperform slide probes. One reason may bebecause our data are labeled by object rather than by individualinteraction. For example, although an object typically slides,it’s not hard to imagine scenarios where a cardboard box mightroll over. This is not the case for affordances like stack andcontain. In the rolling cardboard box example, the sharpedges of the box and the distinct way it rolls is still indicativeof the object being stackable.

262

Figure 4: Results for predicting affordances of objects given frozen hidden states of 2D and 3D sequence predictionmodels. Test sets are balance so that random guess achieves 50%. 3D models (even ablated variants) outperform the2D computer vision models across the board.

(a) 3D Blind model (b) 3D Visual-spatial model

(c) 3D No-Training (d) 2D Computer-Visionmodel (RPIN).

Figure 5: t-SNE projections of model representations ofsliding (red) vs. rolling (blue) objects.

is, the 3D No-Training model–which does not in-clude visual information and does not even includepretraining–outperforms the CV baseline in allcases, and often substantially. Pretraining on top ofthe 3D inputs often (but not always) yields perfor-mance gains. Pretraining with visual informationdoes not provide a clear benefit over pretraining onthe spatial data alone–i.e., visual information leadsto performance gains on three affordances (slide,roll, and stack) and losses on the other three(contain, w-grasp, and bounce).

Figure 6: Visualization showing how 3D coordinatedata clearly distinguishes a rolling object from a slidingone, making it easier for a model to learn the differencebetween the two.

5 Qualitative Analysis

In order to better understand the nature of theaffordance-learning problem, we run a series ofqualitative analyses on the trained models. We fo-cus our analysis on the pair of affordances roll vs.slide. These are verbs have received significantattention in prior literature (Pustejovsky and Krish-naswamy, 2014; Levin, 1993) since they exemplify

263

the types of manner distinctions that we would likelexical semantic representations to encode.

We first compare the 2D video vs. 3D simula-tion variants of our pretraining objective. Figure 5shows a t-SNE projection of the sequence represen-tations from all four models, labeled based on if theobject affords rolling or sliding. We find that objectrepresentations from the 3D Blind model clusterstrongly according to the distinction between thesetwo concepts. The trend is notably not apparent inthe No-Training model. Figure 6 demonstrates whyspatial data pretraining may encourage this split.In the example shown, we take two thrown objectsfrom our dataset–one round and one not round–andtrack the height of the center point of the objectand one of the corners of the object bounding box.When they hit the ground the center point stays rel-atively constant as it moves across the floor in both,but in the rolling action, the corner point movesup and down as it rotates around the center point.Since this is so distinguishable given the input rep-resentation, the model is better able to differentiatethese concepts.

It may be that the next state prediction task fa-cilitates learning the slide vs. roll distinction in the3D Blind setting. However, the same pattern isnot present in the 3D Visual-spatial model (whichalso predicts next state). One possibility is that thepresence of visual information competes with the3D information, and as a result the joint space doesnot encode this distinction as well as the 3D spacealone. Designing more sophisticated models thatincorporate visual and spatial information and pre-serve the desirable features of both is an interestingarea for future work.

5.1 Counterfactual Perturbations

An important aspect of lexical semantics is deter-mining the entailments of a word–e.g., what aboutan observation allows it to be described truthfullyas roll? Thus, in asking whether affordancesare learned from next-state-prediction pretraining,it is important to understand not just whether themodel can differentiate the concepts, but whetherit differentiates them for the “right” reasons.

We investigate this using counterfactualperturbations of the inputs as a way of doingfeature attribution, similar in spirit to prior workin NLP (Huang et al., 2020) and CV (Goyalet al., 2019). Specifically, we create a controlleddataset in which, for each of 10 interactions, we

generate 20 minimal-pairs which differ from theiroriginals by a single parameter of the underlyingphysics simulation. The parameters we perturbare {mass, force velocity, startingx position, starting z position,shape, angular rotation}. For example,given an instance of a lamp rolling across the floor,we would generate one minimally-paired examplein which we only change the mass of the lamp, andanother the same as the original except it does notexhibit any angular rotation, and so forth for eachof the parameters of interest. More implementationdetails are given in Appendix C.

We use our pretrained slide probe to classifythe representations from each sequence as eitherrolling or sliding, and compare the effect of eachperturbation on the model’s belief about the affor-dance label. Figure 7 shows the resulting beliefchanges for several of the perturbed parameters.We see that changing the angular rotation of an oth-erwise identical sequence has the greatest effect onwhether an instance is deemed to afford rolling.This is an encouraging result, as it aligns well withstandard lexical semantic analyses: i.e., generally,roll is assumed to entail rotation in the directionof the translation.

Figure 7: Change in predicted probability of the encod-ing of a round object affording slide after generatingthe interaction again with one feature changed (See Ap-pendix C for a visualization)

However, our analysis also reveals that the mod-els rely on some spurious features which, ideally,would not be part of the lexical semantic represen-tation. For example, the 3D blind model is affectedby the travel distance the object. If we increasedthe mass or decreased the force applied to a rollingobject, such that it only moved a small distance orrotated a small number of times, the model wasless inclined to label the instance as rolling; though

264

this was usually not by enough to have the unde-sirable effect of flipping the prediction. Intuitivelythis makes sense given the model’s training data:rolling objects tend to travel a greater distance thansliding objects. An interesting direction for futurework is to investigate how changes in pretrainingor data distribution influence which features areencoded as “entailments”, i.e., key distinguishingfeatures of a concept’s representation.

6 Related Work

6.1 Lexical Semantics and Cognitive Science

In formal semantics, there has been significant sup-port for the idea that motor trajectories and affor-dances should form the basis of lexical semanticrepresentations (Pustejovsky and Krishnaswamy,2014; Siskind, 2001; Steedman, 2002). Such workbuilds on the idea in cognitive science that simu-lation lies at the heart of cognitive and linguisticprocessing (Feldman, 2008; Bergen et al., 2007;Bergen, 2012). For example, Borghi and Riggio(2009) argue that language comprehension involvesmental simulation resulting in a "motor prototype"which encodes stable affordances and affects pro-cessing speed for identifying objects. Cosentinoet al. (2017) point to such simulation as a factorin determining surprisal of affordances dependingon linguistic context. Similar arguments have beenmade based on evidence from fMRI data (Sakreidaet al., 2013) as well as processing in patients withbrain lesions (Taylor et al., 2017). It is worth not-ing that there is debate on the general nature ofaffordances in humans. For instance, Mota (2021)argues that affordances are not solely perceptual.We view our work as being compatible with thismore general view of affordances, in which directperception plays a role, but not the only role, inconcept formation.

6.2 Affordances in Language Technology

The idea of affordances has been incorporated intowork on robotics (Sahin et al., 2007; Zech et al.,2017). Kalkan et al. (2014); Ugur et al. (2009)build a model of affordances based on (object, ac-tion, effect) tuples, but focus only on start and endstate, and do not encode anything about manner.Relatedly, Nguyen et al. (2020) connects images ofobjects to language queries describing their uses.

Affordances are also well studied for text under-standing tasks. McGregor and Jezek (2019) dis-cuss the importance of affordances in disambiguat-

ing meaning of sentences such as “we finished thewine". Other neural net based approaches for affor-dance learning have relied on curated datasets withexplicit affordance labels for each object (Chaoet al., 2015; Do et al., 2018). Sometimes, affor-dance datasets leverage multimodal settings suchas images (Myers et al., 2015), or 3D models andenvironments (Suglia et al., 2021; Mandikal andGrauman, 2021; Nagarajan and Grauman, 2020),but require annotations for every object. In contrast,our model learns affordances in an unsupervisedmanner, and unlike Fulda et al. (2017), Loureiroand Jorge (2018), McGregor and Lim (2018), andPersiani and Hellström (2019) which extract affor-dance structure automatically from word embed-dings alone, our model learns from interacting withobjects in a 3D space, grounding its representa-tions to cause-and-effect pairs of physical forcesand object motion.

6.3 Physical Commonsense Reasoning

There has been success in building deep learn-ing networks that reason about object physics bylearning to predict their trajectories. These canbe broken up into either predicting points in 3Dspace given object locations (like our approach, e.g.Mrowca et al. (2018), Byravan and Fox (2017),Battaglia et al. (2016), Fragkiadaki et al. (2016),Ye et al. (2018), Rempe et al. (2020)) or inferringfuture bounding box locations of objects in videos(Weng et al., 2006; Do et al., 2018; Qi et al., 2021;Ding et al., 2021). Both approaches have been suc-cessful in encoding complex visual and physicalfeatures of objects. We focus on training with 3Dsimulations, but also test a visual dynamics model(Qi et al., 2021) to compare the affordance infor-mation that is encoded from spatial vs. visual data.

More broadly, we contribute to a line of work onbuilding non-linguistic representations of lexicalconcepts (Bisk et al., 2019). Explicit attempts atgrounding to the physical world ground languagein 2D images or videos (i.e., pixels) (Hahn et al.,2019; Groth et al., 2018), despite the fact that recentwork suggests that text and video pretraining offersno boost to lexical semantic understanding (Yunet al., 2021). Such efforts motivate the creation oflarge datasets such as Krishna et al. (2016), Yatskaret al. (2016), and Gupta and Malik (2015), whichrequire in-depth human provided annotations thatprovide a limited list of semantic roles of objects.

Our approach is most directly related to prior

265

work that learns in interactive, 3D settings (Thoma-son et al., 2016; Ebert and Pavlick, 2020). Espe-cially related are Nagarajan and Grauman (2020)and Zellers et al. (2021). However, their modelsdo not directly ground to the physical phenomena(e.g., entailed positional changes). Instead, theyuse a symbolic vocabulary of object state changes,whereas our model learns from unlabeled interac-tions.

7 Conclusion

We propose an unsupervised pretraining methodfor learning representations of object affordancesfrom observations of interactions in a 3D environ-ment. We show that 3D trajectory data is a strongsignal for grounding such concepts and performsbetter than a standard computer vision approach forlearning the desired concepts. Moreover, we showthrough counterfactual analyses that the learnedrepresentations can encode the desired entailments–e.g., that roll entails axial rotation.

Our work contributes to an existing line of workthat seeks to develop lexical semantic represen-tations of nouns and verbs that are grounded inphysical simulations. We advance this agenda byoffering a way in which modern “foundation model”approaches to visual and linguistic processing canin fact be corroborative of traditional feature-basedapproaches to formal lexical semantics. Our resultssuggest a promising direction for future work, inwhich pretraining objectives can be augmented toinclude richer notions of embodiment (e.g., plan-ning, agent-agent interaction) and consequently en-coder richer lexical semantic structure (e.g., pre-suppositions, transitivity).

Acknowledgments

This research is supported in part by DARPAvia the GAILA program (HR00111990064). Theviews and conclusions contained herein are thoseof the authors and should not be interpreted asnecessarily representing the official policies, eitherexpressed or implied, of DARPA or the U.S. Gov-ernment. Thanks to Chen Sun, Roman Feiman,members of the Language Understanding and Rep-resentation (LUNAR) Lab and AI Lab at Brown,and the reviewers for their help and feedback onthis work.

References

Stéphane Aroca-Ouellette, Cory Paik, Alessandro Ron-cone, and Katharina Kann. 2021. PROST: Physi-cal reasoning about objects through space and time.In Findings of the Association for ComputationalLinguistics: ACL-IJCNLP 2021, pages 4597–4608,Online. Association for Computational Linguistics.

Peter Battaglia, Razvan Pascanu, Matthew Lai, DaniloJimenez Rezende, and koray kavukcuoglu. 2016. In-teraction networks for learning about objects, rela-tions and physics. In Advances in Neural InformationProcessing Systems, volume 29. Curran Associates,Inc.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Benjamin K Bergen. 2012. Louder than words: Thenew science of how the mind makes meaning. BasicBooks.

Benjamin K Bergen, Shane Lindsay, Teenie Matlock,and Srini Narayanan. 2007. Spatial and linguisticaspects of visual imagery in sentence comprehension.Cognitive science, 31(5):733–764.

Yonatan Bisk, Jan Buys, Karl Pichotta, and Yejin Choi.2019. Benchmarking hierarchical script knowledge.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 4077–4085,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, RussAltman, Simran Arora, Sydney von Arx, Michael S.Bernstein, Jeannette Bohg, Antoine Bosselut, EmmaBrunskill, Erik Brynjolfsson, Shyamal Buch, Dal-las Card, Rodrigo Castellon, Niladri S. Chatterji,Annie S. Chen, Kathleen Creel, Jared QuincyDavis, Dorottya Demszky, Chris Donahue, MoussaDoumbouya, Esin Durmus, Stefano Ermon, JohnEtchemendy, Kawin Ethayarajh, Li Fei-Fei, ChelseaFinn, Trevor Gale, Lauren Gillespie, Karan Goel,Noah D. Goodman, Shelby Grossman, Neel Guha,Tatsunori Hashimoto, Peter Henderson, John He-witt, Daniel E. Ho, Jenny Hong, Kyle Hsu, JingHuang, Thomas Icard, Saahil Jain, Dan Jurafsky,Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keel-ing, Fereshte Khani, Omar Khattab, Pang Wei Koh,Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi,and et al. 2021. On the opportunities and risks offoundation models. CoRR, abs/2108.07258.

Anna M Borghi and Lucia Riggio. 2009. Sentencecomprehension and simulation of object temporary,canonical and stable affordances. Brain Research,1253:117–128.

266

Arunkumar Byravan and Dieter Fox. 2017. SE3-nets:Learning rigid body motion using deep neural net-works. In 2017 IEEE International Conference onRobotics and Automation (ICRA), pages 173–180.

Yu-Wei Chao, Zhan Wang, Rada Mihalcea, and JiaDeng. 2015. Mining semantic affordances of visualobject categories. In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages4259–4267.

Erica Cosentino, Giosuè Baggio, Jarmo Kontinen, andMarkus Werning. 2017. The time-course of sentencemeaning composition. n400 effects of the interactionbetween context-induced and lexically stored affor-dances. Frontiers in Psychology, 8.

David Ding, Felix Hill, Adam Santoro, MalcolmReynolds, and Matt Botvinick. 2021. Attention overlearned object embeddings enables complex visualreasoning. In Advances in Neural Information Pro-cessing Systems, volume 34, pages 9112–9124. Cur-ran Associates, Inc.

Thanh-Toan Do, Anh Nguyen, and Ian Reid. 2018. Af-fordancenet: An end-to-end deep learning approachfor object affordance detection. In 2018 IEEE In-ternational Conference on Robotics and Automation(ICRA), pages 5882–5889.

Dylan Ebert and Ellie Pavlick. 2020. A visuospatialdataset for naturalistic verb learning. In Proceed-ings of the Ninth Joint Conference on Lexical andComputational Semantics, pages 143–153.

Jerome Feldman. 2008. From molecule to metaphor: Aneural theory of language. MIT press.

Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine,and Jitendra Malik. 2016. Learning Visual Pre-dictive Models of Physics for Playing Billiards.arXiv:1511.07404 [cs]. ArXiv: 1511.07404.

Nancy Fulda, Daniel Ricks, Ben Murdoch, and DavidWingate. 2017. What can you do with a rock? af-fordance extraction via word embeddings. In Pro-ceedings of the Twenty-Sixth International Joint Con-ference on Artificial Intelligence, IJCAI-17, pages1039–1045.

James J Gibson. 1977. The theory of affordances. Hill-dale, USA, 1(2):67–82.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, DeviParikh, and Stefan Lee. 2019. Counterfactual visualexplanations. In International Conference on Ma-chine Learning, pages 2376–2384. PMLR.

Oliver Groth, Fabian B. Fuchs, Ingmar Posner, andAndrea Vedaldi. 2018. Shapestacks: Learning vision-based physical intuition for generalised object stack-ing. In The European Conference on Computer Vi-sion (ECCV).

Saurabh Gupta and Jitendra Malik. 2015. Visual seman-tic role labeling. arXiv preprint arXiv:1505.04474.

Meera Hahn, Andrew Silva, and James M. Rehg. 2019.Action2Vec: A Crossmodal Embedding Approachto Action Learning. arXiv:1901.00484 [cs]. ArXiv:1901.00484 version: 1.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recogni-tion. In 2016 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 770–778.

Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan-forth, Johannes Welbl, Jack Rae, Vishal Maini, DaniYogatama, and Pushmeet Kohli. 2020. Reducing sen-timent bias in language models via counterfactualevaluation. In Findings of the Association for Com-putational Linguistics: EMNLP 2020, pages 65–83,Online. Association for Computational Linguistics.

Sinan Kalkan, Nilgün Dag, Onur Yürüten, Anna MBorghi, and Erol Sahin. 2014. Verb concepts fromaffordances. Interaction Studies, 15(1):1–37.

Najoung Kim and Paul Smolensky. 2021. Testing forgrammatical category abstraction in neural languagemodels. Proceedings of the Society for Computationin Linguistics, 4(1):467–470.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,Michael Bernstein, and Li Fei-Fei. 2016. Visualgenome: Connecting language and vision usingcrowdsourced dense image annotations.

Beth Levin. 1993. English verb classes and alterna-tions: A preliminary investigation. University ofChicago press.

Daniel Loureiro and Alípio Jorge. 2018. Affordanceextraction and inference based on semantic role la-beling. In Proceedings of the First Workshop on FactExtraction and VERification (FEVER), pages 91–96,Brussels, Belgium. Association for ComputationalLinguistics.

Priyanka Mandikal and Kristen Grauman. 2021. Learn-ing dexterous grasping with object-centric visual af-fordances. In ICRA.

Claudia Mazzuca, Chiara Fini, Arthur Henri Michalland,Ilenia Falcinelli, Federico Da Rold, Luca Tummolini,and Anna M. Borghi. 2021. From affordances to ab-stract words: The flexibility of sensorimotor ground-ing. Brain Sciences, 11(10).

Stephen McGregor and Elisabetta Jezek. 2019. A dis-tributional model of affordances in semantic typecoercion. In Proceedings of the 13th InternationalConference on Computational Semantics - Short Pa-pers, pages 1–7, Gothenburg, Sweden. Associationfor Computational Linguistics.

Stephen McGregor and KyungTae Lim. 2018. Affor-dances in grounded language learning. In Proceed-ings of the Eight Workshop on Cognitive Aspects ofComputational Language Learning and Processing,

267

pages 41–46, Melbourne. Association for Computa-tional Linguistics.

Sergio Mota. 2021. Dispensing with the theory (andphilosophy) of affordances. Theory & Psychology,31(4):533–551.

Damian Mrowca, Chengxu Zhuang, Elias Wang, NickHaber, Li Fei-Fei, Joshua B. Tenenbaum, and DanielYamins. 2018. Flexible neural representation forphysics prediction. In NeurIPS.

Austin Myers, Ching L. Teo, Cornelia Fermüller, andYiannis Aloimonos. 2015. Affordance detection oftool parts from geometric features. In 2015 IEEE In-ternational Conference on Robotics and Automation(ICRA), pages 1374–1381.

Tushar Nagarajan and Kristen Grauman. 2020. Learn-ing affordance landscapes for interaction explorationin 3d environments. In NeurIPS.

Thao Nguyen, Nakul Gopalan, Roma Patel, MatthewCorsaro, Ellie Pavlick, and Stefanie Tellex. 2020.Robot Object Retrieval with Contextual Natural Lan-guage Queries. In Proceedings of Robotics: Scienceand Systems, Corvalis, Oregon, USA.

Michele Persiani and Thomas Hellström. 2019. Unsu-pervised inference of object affordance from text cor-pora. In Proceedings of the 22nd Nordic Conferenceon Computational Linguistics, pages 115–120, Turku,Finland. Linköping University Electronic Press.

James Pustejovsky and Nikhil Krishnaswamy. 2014.Generating simulations of motion events from ver-bal descriptions. In Proceedings of the Third JointConference on Lexical and Computational Semantics(*SEM 2014), pages 99–109, Dublin, Ireland. Associ-ation for Computational Linguistics and Dublin CityUniversity.

James Pustejovsky and Nikhil Krishnaswamy. 2018. Ev-ery object tells a story. In Proceedings of the Work-shop Events and Stories in the News 2018, pages1–6, Santa Fe, New Mexico, U.S.A. Association forComputational Linguistics.

Haozhi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma,and Jitendra Malik. 2021. Learning long-term visualdynamics with region proposal interaction networks.In ICLR.

Davis Rempe, Srinath Sridhar, He Wang, and Leonidas J.Guibas. 2020. Predicting the physical dynamics ofunseen 3d objects. Proceedings of the Winter Confer-ence on Applications of Computer Vision (WACV).

Katrin Sakreida, Claudia Scorolli, Mareike M Menz,Stefan Heim, Anna M Borghi, and Ferdinand Binkof-ski. 2013. Are abstract action words embodied? anfmri investigation at the interface between languageand motor cognition. Frontiers in human neuro-science, 7:125.

Jeffrey Mark Siskind. 2001. Grounding the lexical se-mantics of verbs in visual perception using forcedynamics and event logic. Journal of artificial intel-ligence research, 15:31–90.

Mark Steedman. 2002. Plans, affordances, and com-binatory grammar. Linguistics and Philosophy,25(5):723–753.

Alessandro Suglia, Qiaozi Gao, Jesse Thomason,Govind Thattai, and Gaurav S. Sukhatme. 2021.Embodied BERT: A transformer model for embod-ied, language-guided visual task completion. CoRR,abs/2108.04927.

Lawrence J. Taylor, Carys Evans, Joanna Greer, CarlSenior, Kenny R. Coventry, and Magdalena Ietswaart.2017. Dissociation between semantic representationsfor motion and action verbs: Evidence from patientswith left hemisphere lesions. Frontiers in HumanNeuroscience, 11.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, AdamPoliak, R Thomas McCoy, Najoung Kim, BenjaminVan Durme, Samuel R Bowman, Dipanjan Das, et al.2018. What do you learn from context? probingfor sentence structure in contextualized word repre-sentations. In International Conference on LearningRepresentations.

Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Pe-ter Stone, and Raymond J. Mooney. 2016. Learningmulti-modal grounded linguistic semantics by play-ing "i spy". In Proceedings of the 25th InternationalJoint Conference on Artificial Intelligence (IJCAI-16), pages 3477–3483, New York City.

Emre Ugur, Erol Sahin, and Erhan Oztop. 2009. Pre-dicting future object states using learned affordances.In 2009 24th International Symposium on Computerand Information Sciences, pages 415–419. IEEE.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, volume 30. Curran Associates, Inc.

Shiuh-Ku Weng, Chung-Ming Kuo, and Shu-Kang Tu.2006. Video object tracking using adaptive Kalmanfilter. Journal of Visual Communication and ImageRepresentation, 17(6):1190–1208.

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016.Situation recognition: Visual semantic role labelingfor image understanding. 2016 IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 5534–5542.

Tian Ye, Xiaolong Wang, James Davidson, and AbhinavGupta. 2018. Interpretable intuitive physics model.In ECCV.

Tian Yun, Chen Sun, and Ellie Pavlick. 2021. Doesvision-and-language pretraining improve lexicalgrounding? In Findings of EMNLP.

268

Philipp Zech, Simon Haller, Safoura Rezapour Lakani,Barry Ridge, Emre Ugur, and Justus Piater. 2017.Computational models of affordance in robotics: ataxonomy and systematic classification. AdaptiveBehavior, 25(5):235–271.

Rowan Zellers, Ari Holtzman, Matthew E. Peters,R. Mottaghi, Aniruddha Kembhavi, Ali Farhadi, andYejin Choi. 2021. PIGLeT: Language GroundingThrough Neuro-Symbolic Interaction in a 3D World.In ACL/IJCNLP.

Erol Sahin, Maya Çakmak, Mehmet R. Dogar, EmreUgur, and Göktürk Üçoluk. 2007. To afford or not toafford: A new formalization of affordances towardaffordance-based robot control. Adaptive Behavior,15(4):447–472.

269

A Appendix A

Figure 8: Example of an interaction from SPATIAL.The model predicts the position of the soccer ball at fu-ture timesteps. To do that, it must encode some knowl-edge that soccer balls bounce and roll. As input,our model takes 9 3D points: the eight corners of thebox surrounding the ball, plus the center point.

A.1 Spatial Model Training Details

For both of the 3D spatial data models, we trainwith an encoder-decoder transformer with one en-coder and one decoder layer with one attentionhead. We found that changing the number of atten-tion heads did not affect performance noticeablyin either direction. We use a batch size of 64 anda transformer embedding dimension of 100. Weuse a feed-forward dimension of 200. We initial-ize with a learning rate of 1e-4. The models weretrained to predict the next k = 8− 16 frames andwe did not see a large benefit in training to predictlonger sequences. We trained the models for 400epochs although we notcied the ablated 3D Blindmodel tended to converge at or before 100 epochsacross our experiments.

The beginning of the sequence, which was up tofour seconds minus the k prediction frames, wasfed into the transformer encoder which encodedrepresentations of dimension e. We averaged theseoutput embeddings as input in our probing experi-ments. The e embeddings were fed into the decodernetwork, which then predicts the next k frames.We believe that training with longer sequenceswould be more beneficial for training a decoder-only model, which we would like to explore infuture work. In preliminary experiments, we testedwhether masking a proportion of frames in the en-coder would be beneficial for the representationlearning task. We saw a slight decrease in perfor-mance, and so did not perform a thorough analysison the effect of masking.

A.2 t-SNE Configuration

We report a t-SNE of representations derived fromour 3D Blind model and the 2D Visual model. Theparameters for creating each t-SNE was similar butvaried in a few ways: Common Hyperparame-ters: learning rate: 200, iterations: 1000, stoppingthreshold of gradient norm: 1e-7 3D Blind t-SNEspecifics: perplexity: 30, initialized randomly 2DVisual specifics: perplexity: 5, initialized withPCA. We found that random initialization was in-consistent in that it would sometimes cause smallclouds of dense points to appear as their own clus-ters.

B Appendix B

B.1 RPIN Training details

We use a learning rate of 1e-3 with a batch sizeof 50 and train for a maximum of 20M iterationswith 40,000 warmup iterations. Training data isaugmented with random horizontal flips. Unlike inQi et al. (2021) we don’t use vertical flips becauseour videos contain objects falling due to gravity.One important difference is that at training timethe model predicts 10 frames in the future, andat test time predicts 20 (as opposed to 20 and 40respectively in Simulated Billiards). Within onevideo, our interactions seem more complex thanone sequence in the Simulated Billiards dataset,so we introduced this difference to create moretraining examples.

C Appendix C

C.1 Counterfactual Perturbations Setup

We start with a base set of 10 sequences: 5with a sliding object (cardboardBox_03)and 5 with a rolling object (BuckyBall).We then create 20 minimal-edit perturbationsto create a final set of 200 sequences. Weperturb the following features one at a time:{mass, force velocity, startingx position, starting z position,shape, angular rotation}. For mostfeatures, we generate 4 perturbations. For example,the x and z positions are altered by {-2m, -1m,+1m, +2m} where ‘m’ is the Unity meter. Allobjects start with 1.14 units of mass and similarto the starting position variable, is altered by1.14 + (i × .1) where i is in the set {-2, -1, 1,2}. For the shape parameter, we only change the3D model used to generate the base sequence.

270

For the sliding videos, we use plate02 andbook_0001c. For the rolling videos we useBombBall and modified Soccer Ball.Note that we modify the Soccer Ball modelthat is in the train set, but modify the mass (1.14)and size of the model so that it is technically anunseen object. We chose to do this because wewanted to use a more plain spherical object, whichwas not an option for the remaining test objects.Angular rotation either perturbs the sequence byfreezing the rotation along all axes (in the case ofobjects that normally roll) or replacing the physicscollider with a sphere (causing the object to roll– in the case of objects that tend to slide insteadof roll). Figure 11 shows additional perturbationsand a sliding object example of the counterfactualanalysis.

271

Object Test set? Slide Roll Stack Contain W-Grasp BounceBombBallEyeBall

SpikeBallVase_AmphoraVase_Hydria

Vase_VoluteKraterbook_0001abook_0001bbook_0001c

bowl01cardboardBox_01cardboardBox_02cardboardBox_03

Cola CanPen blackGas BottleSoccer Ballcan small

canmeat can box

spam canAtomBall

Bottle2plate02

plate02_flatBottle1

WheelBallwine bottle 04

coinBuckyBall

SplitMetalBallbowl02bowl03mug02mug03

Old_USSR_Lamp_01lampLadleApple

Table 1: All objects in the dataset and their associated affordances

272

Binary Classification Accuracy of Affordance Probes (Random=50%)N examples 3D Blind | 3D Visual-Spatial | 2D Visual | No-Training

Affordance Acc. (%) F1 Acc. (%) F1 Acc. (%) F1 Acc. (%) F1Slide 1715 59.2 62 63.7 67 59.6 62 61.2 64Roll 1258 66.5 66 69.0 70 56.0 58 66.3 66Stack 1307 64.7 65 65.7 63 58.0 58 63.4 63

Contain 1510 66.5 68 65.2 67 58.2 63 59.3 64W-Grasp 1652 67.2 68 66.3 0.68 62.9 61 63.4 64Bounce 276 82.6 83 79.7 79 73.9 76. 76.0 75

Table 2: Results from probing experiments on RPIN compared to the unity models trained on the same amountof data. Because data was limited, we partition the data so that there is an even number of positive and negativeexamples in the test set for each affordance. Interaction based pretraining outperforms visual dynamics in allcategories

Affordance Number of Objects Percentage of objects (%)Slide 22 56.41Roll 23 58.97Stack 17 43.59

Contain 8 20.51Wrap-grasp 13 33.33

Bounce 7 17.95

Table 3: Each affordance we are interested in learning and the number and percentage of objects out of the 39 havea positive label for that affordance.

RPIN Model validation loss in the t ∈ [Ttrain, 2× Ttrain] settingModel Loss (MSE)

SimB (Qi et al., 2021) 25.77SimB (our results) 15.53

Unity Videos 20.98

Table 4: Each affordance we are interested in learning and the number and percentage of objects out of the 39 havea positive label for that affordance.

273

Balls

Boxes

Bowls

Dish-

like

Cans &

Mugs

Vases

Bottles

Books

Ladle

Lamps

Pen

Train Test

Apple

Vase_VoluteKrater Gas Bottle

SplitMetalBall

AtomBall

Old_USSR_Lamp_01

cardboardBox_02 cardboardBox_03cardboardBox_01

meat can boxcan small

plate02coinplate02_flat

bowl01 bowl03bowl02

Soccer Ball WheelBall SpikeBall

BuckyBallBombBallEyeBall

book_0001cbook_0001bbook_0001a

wine bottle 04Bottle2Bottle1

Vase_AmphoraVase_Hydria

Cola Canmug02

mug03canspam can

Pen black

lamp

Ladle

Figure 9: All objects that were used in training and testing. Some objects in the test set are visually similar to theirtraining analogues, but differ in size and mass.

274

Figure 10: Results from the training of the RPIN visual dynamics model on videos of our Unity dataset interactions.Red circles show the predictions of the following center points of the bounding boxes of the object given the start ofthe interaction

275

Figure 11: Two examples from the counterfactual analysis that show robustness to changing spurious features. Thetable in the top example displays the changes in probability in predicting the object as sliding. Conversely, thebottom example table shows the change in probability of predicting the object as rolling. Arrows in the left tableindicate where the perturbation does affect the label of the action (either by making the object able or unable to roll).In both cases, the probe correctly flips its prediction on the encoding. The sequence prediction model appears to besensitive to certain features such as distance traveled. For example, changing the object from the "bucky ball" tothe "bomb ball" decreases the model’s confidence that the object rolling (though, the probe still correctly assigns amajority of the probability to roll). However, in this perturbation, the bomb ball gets stuck on its ‘cap’ (Figure 9)and only completes one rotation.

276

Figure 12: Left: a sequence generated with normal physics. Right: rotation locked, with all other physical propertiesof the interaction the same. Freezing the rotation such that the object slides causes the model to encode the action asa slide rather than a roll

277