arXiv:2109.12212v2 [cs.CL] 29 Sep 2021

30
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog Xingyao Wang University of Michigan [email protected] David Jurgens University of Michigan [email protected] Abstract Online conversations include more than just text. Increasingly, image-based responses such as memes and animated gifs serve as culturally recognized and often humorous re- sponses in conversation. However, while NLP has broadened to multimodal models, conver- sational dialog systems have largely focused only on generating text replies. Here, we intro- duce a new dataset of 1.56M text-gif conversa- tion turns and introduce a new multimodal con- versational model PEPE THE KING PRAWN for selecting gif-based replies. We demon- strate that our model produces relevant and high-quality gif responses and, in a large ran- domized control trial of multiple models re- plying to real users, we show that our model replies with gifs that are significantly better re- ceived by the community. 1 Introduction Conversations are central to many online social platforms. While most conversations are text- based, computer mediated dialog also affords al- ternative forms of communication, such as emoji or stickers like bitmoji, that allow users to ex- press themselves (Tang and Hew, 2019; Konrad et al., 2020). Increasingly, these visual forms of communication have become common in social media (Bourlai and Herring, 2014; Highfield and Leaver, 2016), with a notable use of the reaction gif (Bakhshi et al., 2016; Miltner and Highfield, 2017). These gifs are short video sequences that depict a particular scene and sometimes contain text that acts as a meta-commentary (Eppink, 2014). As a result, conversations become multimodal where in- dividuals reply to one another using combinations of text and gifs (Figure 1). While conversational AI systems have been developed in a purely text- based setting, such systems do not capture the full multimodal behavior seen online. Here, we study multimodal conversation by introducing new dialog models for selecting gif replies in conversation. PizzaMagic: Ahhhhh!!! The EMNLP deadline is in 24 hours!! x CasualModel: Figure 1: Gif responses in conversation like the one shown above are embodied dialog that use visual im- agery to convey reactions and emotions. This paper de- velops a system to select the appropriate gif response to messages. (PDF best viewed with Adobe Acrobat) Conversation analysis is central to NLP and mul- tiple approaches have analyzed this dialog struc- ture (Jurafsky et al., 1998; Pareti and Lando, 2018; Cohn et al., 2019) and developed conversational agents to engage with people (e.g., Fang et al., 2018; Xu et al., 2020; Hong et al., 2020). Recent work has focused on generating open domain social chatbots that engage in sustained conversations in a natural way (Ram et al., 2018). Because many of these systems are designed to support voice- based dialog, they overlook non-textual forms of interaction used in social media conversations. In parallel, multimodal NLP systems have been devel- oped for image data, often focusing on image-to- text tasks such as image captioning (Melas-Kyriazi et al., 2018; Sharma et al., 2018) and visual ques- tion answering (Antol et al., 2015; Huang et al., 2019; Khademi, 2020). More recent work has fo- cused on the reverse text-to-image dimension, such as generating an image from a description (Niu et al., 2020; Ramesh et al., 2021). Our work unites these two strands of research by integrating image- based communication into conversational agents. Our paper offers three main contributions. First, arXiv:2109.12212v2 [cs.CL] 29 Sep 2021

Transcript of arXiv:2109.12212v2 [cs.CL] 29 Sep 2021

An animated picture says at least a thousand words: Selecting Gif-basedReplies in Multimodal Dialog

Xingyao WangUniversity of Michigan

[email protected]

David JurgensUniversity of [email protected]

AbstractOnline conversations include more than justtext. Increasingly, image-based responsessuch as memes and animated gifs serve asculturally recognized and often humorous re-sponses in conversation. However, while NLPhas broadened to multimodal models, conver-sational dialog systems have largely focusedonly on generating text replies. Here, we intro-duce a new dataset of 1.56M text-gif conversa-tion turns and introduce a new multimodal con-versational model PEPE THE KING PRAWNfor selecting gif-based replies. We demon-strate that our model produces relevant andhigh-quality gif responses and, in a large ran-domized control trial of multiple models re-plying to real users, we show that our modelreplies with gifs that are significantly better re-ceived by the community.

1 Introduction

Conversations are central to many online socialplatforms. While most conversations are text-based, computer mediated dialog also affords al-ternative forms of communication, such as emojior stickers like bitmoji, that allow users to ex-press themselves (Tang and Hew, 2019; Konradet al., 2020). Increasingly, these visual forms ofcommunication have become common in socialmedia (Bourlai and Herring, 2014; Highfield andLeaver, 2016), with a notable use of the reaction gif(Bakhshi et al., 2016; Miltner and Highfield, 2017).These gifs are short video sequences that depict aparticular scene and sometimes contain text thatacts as a meta-commentary (Eppink, 2014). As aresult, conversations become multimodal where in-dividuals reply to one another using combinationsof text and gifs (Figure 1). While conversationalAI systems have been developed in a purely text-based setting, such systems do not capture the fullmultimodal behavior seen online. Here, we studymultimodal conversation by introducing new dialogmodels for selecting gif replies in conversation.

PizzaMagic: Ahhhhh!!! The EMNLP deadlineis in 24 hours!!

x CasualModel:

Figure 1: Gif responses in conversation like the oneshown above are embodied dialog that use visual im-agery to convey reactions and emotions. This paper de-velops a system to select the appropriate gif responseto messages. (PDF best viewed with Adobe Acrobat)

Conversation analysis is central to NLP and mul-tiple approaches have analyzed this dialog struc-ture (Jurafsky et al., 1998; Pareti and Lando, 2018;Cohn et al., 2019) and developed conversationalagents to engage with people (e.g., Fang et al.,2018; Xu et al., 2020; Hong et al., 2020). Recentwork has focused on generating open domain socialchatbots that engage in sustained conversations ina natural way (Ram et al., 2018). Because manyof these systems are designed to support voice-based dialog, they overlook non-textual forms ofinteraction used in social media conversations. Inparallel, multimodal NLP systems have been devel-oped for image data, often focusing on image-to-text tasks such as image captioning (Melas-Kyriaziet al., 2018; Sharma et al., 2018) and visual ques-tion answering (Antol et al., 2015; Huang et al.,2019; Khademi, 2020). More recent work has fo-cused on the reverse text-to-image dimension, suchas generating an image from a description (Niuet al., 2020; Ramesh et al., 2021). Our work unitesthese two strands of research by integrating image-based communication into conversational agents.

Our paper offers three main contributions. First,

arX

iv:2

109.

1221

2v2

[cs

.CL

] 2

9 Se

p 20

21

we propose the new task of selecting gif responsesin multimodal conversation analysis and introducea new dataset of 1,562,701 real-world conversa-tion turns with gif replies. Second, we introducea new model PEPE THE KING PRAWN that fusesimage and text-based features to select a relevantgif response. In in-house experiments, we showthat our model substantially outperforms strongbaseline models at selecting the exact gif used inreal data and, in a manual test of the quality of thebest responses, achieves an nDCG of 0.8145 on theannotated test set. Third, in a real-world test, wedeploy our model as a part of a large-scale random-ized controlled trial and show that the gif repliesproduced by our model are more highly voted bythe community. Data, code, and models are avail-able at https://github.com/xingyaoww/gif-reply.

2 GIF Communications

Gifs have been widely adopted in communicationas a natural form of embodied speech where the vi-sual imagery conveys emotions or a reaction as a re-sponse (Bakhshi et al., 2016; Tolins and Samermit,2016). These gifs commonly come from widely-known cultural products, such as movies or televi-sion shows, which provides common knowledgefor how they could be interpreted (Eppink, 2014;Miltner and Highfield, 2017). However, a single gifmay have multiple interpretations, depending onthe context, cultural knowledge of its content, andthe viewer (Jiang et al., 2017). As a result, a singlegif can serve multiple functions in communication(Tolins and Samermit, 2016).

Gifs have grown in their use through increas-ing affordances by platforms like Tumblr, Reddit,Imgur, and Twitter that allow gifs to be nativelydisplayed like text in conversation threads (Jianget al., 2018). Further, gif-based keyboards havebeen introduced that allow users to search for gifsthat have been tagged with keywords or other meta-data (Griggio et al., 2019). Yet, these technologiesrequire that gif data be prepared with sufficient tagsto be searchable or to have sufficient data to usecollaborative filtering techniques for recommenda-tions (Jiang et al., 2018, p.9). As a result, there is aclear gap in identifying appropriate response gifsdirectly from the text, which this work fills.

3 Data

Despite the widespread use of gifs, no standarddataset exists for text and gif replies. Further, al-

though platforms like Twitter support gif replies,these gifs are not canonicalized to identify which re-sponses correspond to the same gif. Therefore, weconstruct a new dataset for this task by collectingresponses, matching their images, and augmentingthis data with metadata about the gif, where possi-ble. A visual description of the whole procedurecan be found in Appendix Figure 7.

3.1 Gif Response Data

Gifs have many uses (Miltner and Highfield, 2017)and so we use a two-step approach to collect datathat focus specifically on those likely to be usedin conversation. First, gif responses are collectedfrom Twitter by identifying all replies to English-language tweets containing animated_gif asembedded media. Tweets were collected from a∼10% sample of Twitter from March 13th, 2019to Jan 24th, 2020, totaling 42,096,566 tweets witha gif that we were able to retrieve. Twitter does notcanonicalize its gifs so two separate gif files mayactually have the same imagery. Further, these filesmay not be identical due to small differences suchas color variations or aspect ratios. To identify usesof the reference gifs, we use Average Hash fromthe imagehash library to create low-dimensionalrepresentations of each gif where hash distancecorresponds to perceptual distance. Since gifs areanimated and may contain varying scenes, we com-pute the hash for the first, middle, and final frames,concatenating these into a single hash. Two gifsare considered the same if (i) they have identicalhashes or (ii) their hamming distance is < 10 andgifs with that hash have been used more than 500times in Twitter. This latter condition was selectedafter manual evaluation of thresholds to trade-offbetween increasing the size of the training data andreducing potential noise caused by matching error.A visual example of this process can be found inAppendix Figure 8.

Not all gif responses in the Twitter data are con-versational or appropriate for wider re-use. There-fore, we filter these responses to only those gifswhose imagery matches gifs hosted by the Giphywebsite, which is the backend for many gif-basedkeyboards. Giphy contains a wide collection of gifsthat are curated to remove content inappropriate forgeneral use (e.g., violent or sexual imagery). Gifson the platform are categorized (e.g., “reaction” or“celebrities”) and we identify 28 categories con-taining 972 keywords likely to contain gifs used

Figure 2: The frequency distribution of gifs in our dataroughly follows a log-normal distribution, with a fewgifs used often, while a long tail of gifs are used rela-tively infrequently.

in conversation. A total of 2,095,993 gifs linkedto those keywords were ultimately retrieved andstored as image hashes. Additional details of cate-gories and keywords are in Appendix B.

After the matching image hashes to filter replies,we identify 115,586 unique gifs, referred to as ref-erence gifs, and 1,562,701 tweet replies using oneof these gifs, which forms our official dataset. Fig-ure 2 shows these gifs’ frequency in the data; muchlike words, a handful of gifs receive widespreaduse, while a long tail of gifs are rarely used.

3.2 Gif Metadata

We augment our gif data with information abouttheir content. Some gifs have text that transcribeswhat a person is saying in the gif’s scene or is ameta-commentary on the content. This text is ex-tracted using paddleOCR (Du et al., 2020). Sincesome gifs are long enough to contain multiple utter-ances, we run OCR on four frames sampled fromeach quartile of the gif’s length. Roughly 50%(58,020) of gifs contain at least one extracted wordfrom the selected frames, with an mean of 5.5 ex-tracted words per gif across the dataset.

Second, some gif repositories like Giphy allowusers to tag gifs with information on their contentor theme, e.g., “face palm” or “movie.” We collecttags for the 115K reference gifs used in Twitter, ob-taining 39,651 unique tags. These user-generatedtags were moderately noisy due to orthographicvariations like spelling, capitalization, and spacing.Therefore, we merge tags by (i) lower-casing thetext and (ii) performing a manual merge for similarword forms (e.g., “excited” and “exciting”). Tominimize noise, we retain only tags that have been

used with at least five gifs and where those gifshave been used at least 1000 times in total; thisprocess removes many low-frequency tags that areeither overly-specific or idiosyncratic in their use.

Finally, we performed a manual inspection of allremaining tags to remove tags that are too general(e.g., “emotion”) and retain only noun, adjective,and verb tags (words or multi-word expressions)that describe specific emotions or actions. A totalof 241 unique tags were retained (Appendix C).6.0% of gifs have at least one tag associated withthem (mean 1.9 tags). However, these tagged gifsaccount for 38.7% of the replies in our dataset,suggesting tags are only available for more-populargifs. Our dataset represents roughly an order ofmagnitude more data and more tags than the closestrelated dataset of Chen et al. (2017) that contained23K gifs with 17 manually-curated emotions.

4 Gif Reply Models

We introduce a series of models for producing a gifresponse in conversation. Each model will select agif from the 115K gifs in our dataset as a responseto a text-based message. This task is related to butdistinct from work on image-text matching (Leeet al., 2018), which aims to find an image describ-ing a piece of text, or text-to-image (e.g., Wen et al.,2015; Xu et al., 2018), which generates an imagefrom a text description. Here, we aim to select gifsthat reflect natural continuations or reactions to amessage in a dialog, akin to how gifs are used insocial media. For all models, additional details onthe training procedures and hyperparameters areprovided in Appendix A. The three models thatfollow use varying degrees of information aboutthe gifs and text to select a response.

4.1 Tag-based Predictions

The first model uses tags as a shared representationfor characterizing gifs and text. Analogous to howobject tags are used as anchor points for image-textmatching (Li et al., 2020) and pivot languages areused in machine translation (Cheng et al., 2017),we use tags to bridge information between the textin a tweet and the visual content of a gif. Here,each gif becomes associated with a set of tags de-scribing its conversational functions and for eachtext, we predict the set of tags for gifs responses toit—in essence, predicting what types of responsesare most appropriate. We describe both of theseprocesses next and how gifs are ultimately selected.

Estimating Gif Tags Only 6.0% of the gifs in ourdata have associated tags. Therefore we train aneural model to predict tags using known tags astraining data. To capture any changes in emotionor imagery across the gif, we make separate pre-dictions for four frames sampled across the gif(the same used in §3.2). Each frame is passedthrough an EfficientNet-based (Tan and Le, 2019)GIF encoder, shown in Figure 3, to extract a low-dimensional feature vector from each frame. Theseframe embeddings are fused using the attentionmechanism from a transformer encoder layer. Theoutput of the transformer feeds into a fully con-nected layer, which is trained as a multi-label clas-sifier using binary cross-entropy to predict whichtags should be present.Predicting Response Tags for Text For each mes-sage, we predict the k-hot distribution of tagsfor a gif response by training a BERTweet model(Nguyen et al., 2020), which has been pre-trainedon a large corpus of Twitter data (shown as “TweetEncoder" in Figure 3). The model with an addi-tional fully connected layer is trained as a multi-label classifier using binary cross-entropy, usingthe tags for the gifs used in reply (if known).Tag-based Gif Selection At inference time, givena message, we use the text-to-tag model to predicta k-hot distribution over tags. Then, we select thegif whose estimated tag distribution is closest inEuclidean distance.

4.2 CLIP variant

The second model uses an end-to-end training ap-proach based on the architecture of OpenAI CLIP(Radford et al., 2021). The architecture featurestwo encoders, one for text and one for images. Dur-ing training, the encoders are updated using con-trastive loss that maximizes the cosine similarity ofpaired image-text representations and minimizesthe cosine similarity of random pairs of images andtexts. We replicate the CLIP architecture and train-ing procedure, using BERTweet to encode text andEfficientNet (Tan and Le, 2019) to encode a com-posite image of four frames from the gif (comparedwith BERT and ResNet in their implementation).While originally designed to select an image for atext description, our model is trained to select a gifreply for a text message—a more challenging taskthan the image retrieval task used in the originalCLIP setup, as the message may not contain wordsdescribing elements of the gif. At inference time,

given a tweet, we use the trained tweet encoder toextract its representation and compute its cosinesimilarity with each encoded representation for ourgifs. The gif with the highest cosine similarity isreturned as the best response.

4.3 PEPE THE KING PRAWN

Our final model, KING PRAWN1 (referred to as“PEPE”.) selects gif responses by using a richerset of multimodal features to create a gif represen-tation. Rather than encode the gif solely from itsimage content, we use a multimodal encoder thatcaptures (i) any text it might have, (ii) the types ofobjects present in the gif, and (iii) object regions asvisual features. We encode these gif aspects usingan OSCAR transformer (Li et al., 2020) to createa unified representation, shown in Figure 3 (bot-tom). Object names and regions of interest featurevectors are extracted using a pre-trained bottom-upattention model (Anderson et al., 2018).

As input to the OSCAR encoder, the captions toeach of the gif’s four frames are concatenated to-gether with an “[INTER_FRAME_SEP]" separatortoken. We filter object areas detected by the bottom-up attention model (Anderson et al., 2018) and wekeep all objects with probability >0.5. We thenconcatenate object names together with the sameinter-frame separator between names of differentframes. Together, the caption text, object names,and image-region features are fed into the OSCAR

transformer encoder to generate a GIF feature vec-tor; the transformer is initialized with the defaultOSCAR weights. We use BERTweet to encode text.The entire PEPE model is trained end-to-end usingcontrastive loss, similar to the CLIP model.

5 Evaluation

We initially evaluate the methods in two ways.First, we use traditional classification-based eval-uation, testing whether the models can reproducethe observed gif replies. However, some messagescould have multiple valid gif responses. Therefore,as a second test, we evaluate the model in a retrievalsetting, measuring whether its most-probable re-sponses are good quality for a message.Experimental Setup Models are trained andtested on a dataset containing 1,562,701 Tweet-

1KING PRAWN refers to “selecKting INteresting Gifs forPersonal RespAWNses.” In this crazy muppet-name-land-grab world we live in, our only regret is that we couldn’tget “Pepino Rodrigo Serrano Gonzales” to fit as a bacronym,which we leave to future work.

Oscar GIF Encoder

Tweet Encoder

TweetWe 're gettingmarried today !

BERTweet-baseTransformer

feature vector fordownstream task

1 x 512

EfficientNet GIF Encoder

TransformerEncoder

Layer

feature vector fordownstream task

1 x 512

EfficientNetEfficientNetEfficientNetEfficientNet-b0(shared weight)

4 selected frames4 x 224 x 224 (after reshape)

OCR

4 selected frames4 x 224 x 224 (after reshape)

OscarMultimodal

Transformer

extracted object namesface woman [INTER_FRAME_SEP] face woman

extracted captionAww , thank you

feature vector fordownstream task

1 x 512

extracted object feature vectors

Bottom-up Attention

Figure 3: The different encoder modules used to construct the models in §4.

GIF pairs associated with 115,586 unique gifs,where 605,063 tweet-gif pairs are associatedwith at least one tag. Using the finalized 241unique tags as classes for multi-label classifica-tion, we split the dataset by stratify on tags us-ing the iterative train-test split method provided byscikit-multilearn library (Sechidis et al.,2011; Szymanski and Kajdanowicz, 2017) to cre-ate a 80:10:10 train, dev, and test split whichis finalized to train the models described in §4.Following BERTweet (Nguyen et al., 2020), wepreprocess tweets in our dataset using NLTKTweetTokenizer for tokenization, emojipackage to translate emotion icons, and convertedmentions and links to special “@USER" and“HTTPURL" tokens.

Annotated Data To test whether each model’s pre-dictions are valid responses, we annotate the tenmost-probable gif predictions for a subset of thetweets in our test data. Many tweets in our test setrequire substantial context to understand due to hav-ing few tokens, linking to URLs that provide extraknowledge, mentioning other users in directed com-munication. These factors suggest social contextor general knowledge aids in the recipient’s under-standing of the gif’s intentions. While the modelcan still benefit from training on such examples,judging the appropriateness of response is difficultwithout access to the social context. Therefore, toreduce interpretation ambiguity, we annotate onlytweets without URLs or user mentions and havingat least 10 tokens. This process selects tweets with

sufficient content to judge appropriateness indepen-dent of the larger social context.

Two annotators (the authors) were shown a listof potential gif responses for a tweet and asked tojudge whether this is an appropriate gif response(a binary rating). Gifs were selected from the tenmost-probable replies for each system and collec-tively shown in random order to prevent knowingwhich system generated each reply. A total of 2,500gif-tweet pairings were annotated. Annotators at-tained a Krippendorf’s α of 0.462; while moderateagreement, this value is expected given known dif-ferences in how people interpret and value gif re-sponses based on their familiarity with its content,message interpretation, and life-experience (Jianget al., 2018). We follow the evaluation setup fromother retrieval-based dialog systems (e.g. Yu et al.,2021; Kumar and Callan, 2020) and use normal-ized Discounted Cumulative Gain (nDCG), whichmeasures whether more appropriate gif responsesare ranked higher. A gif’s appropriateness score isthe sum of annotators’ ratings.

Results The PEPE model was able to identify rele-vant and good-quality gif responses, as shown byits performances on the test data (Table 1) and an-notated data (Table 2). Performance on the test setis expected to be low, given the challenge of identi-fying the exact gif used for a tweet when multiplepossible gifs are likely to be equally valid. How-ever, the PEPE model is still able to identify theexact gif (out of 115K) in its top 10 predictionsfor 3% of the data, substantially outperforming all

Model Top-1 Top-5 Top-10Tag-based 0.000000 0.000092 0.000119Random 0.000020 0.000059 0.000158CLIP variant 0.000488 0.001669 0.002783Distribution sampling 0.000996 0.005098 0.009780PEPE 0.005375 0.018723 0.030918

Table 1: Models’ precision-at-k on selecting the exactgif used as a response for a tweet in the test set; this per-formance is an underestimate of each model, as manymodel-predicted gifs may be appropriate.

Model nDCGRandom 0.3273Tag-based 0.4526Distribution sampling 0.4969CLIP variant 0.5934PEPE 0.8145

Table 2: Models’ nDCG scores at proposing appropri-ate gif replies, measured from annotations on the top10 most probable gif replies of each model.

other models.Performance on the annotated data (Table 2) pro-

vides a more realistic assessment of whether mod-els can generate high-quality replies, as it measureswhether the models’ replies themselves were good.The PEPE model attains substantially higher per-formance (p<0.01) than other models. While theCLIP variant model performs well, the content-agnostic Distribution sampling baseline performsnearly as well. This baseline’s high performancespeaks to the multiple interpretations of gifs andthe ease at which readers can make connections be-tween a gif and message. Indeed, even the random-gif model has a non-zero nDCG, highlighting theability for an arbitrary gif to still be consideredappropriate. We speculate that popular gifs maybe popular because of this ease of multiple inter-pretations. Table 4 shows the top predictions formodels and baselines for two example messages, il-lustrating the variety of relevant gifs; the PEPE andrandom baseline replies for the second message ex-emplify the type of gifs that can be widely appliedto many messages, often to humorous effects.Ablation study PEPE fuses multiple types of input,which may uniquely contribute to model’s ability toselect gif replies. To understand how these inputseach contribute, we performed an ablation study onthe annotated test set by removing one input fromOscar GIF Encoder shown in Figure 3 (i.e., a gif’scaption, object names, or objects’ visual features)

Model nDCGPEPE 0.8145PEPE without object names 0.7665PEPE without caption 0.7559PEPE without object features 0.7533

Table 3: Results for ablated versions of PEPE wherespecific input is removed (cf. Table 2) show that all in-put forms contribute to the ability to select replies.

and evaluating the model’s resulting gifs on thesame test instances.

The ablated model performances, shown in Ta-ble 3, reveal that each input is useful for selectinggifs.2 Object features capture visual informationabout what specifically is present in the gif (beyondthe discrete names of what is present, e.g., “person”or “building”) and show that multimodality is im-portant for high performance—predicting repliesjust from a gif’s caption and categorized contentare insufficient. Similarly, the caption of a gif (ifpresent) is important, as the text can help makeexplicit the intended interpretation of a gif.

6 Field Experiment

To test the generalizability of our models and qual-ity of their responses, we conduct a large-scale ran-domized controlled trial (RCT) that has the modelsrespond to real users and measure their perceptionof reply quality.3

6.1 Experimental Setup

Gifs were posted to the Imgur platform, which is ahighly active social media community that supportsboth image and text-based interactions. On Imgur,users may create posts, which contain one or moreimages with optional commentary, or comment onposts or replies. Similar to pre-2018 Twitter, com-ments are limited to 140 characters. Imgur conver-sations are threaded and frequently contain bothimage and text comments. Like Reddit, users mayupvote and downvote content, providing a score ofhow well it was received by the community; we use

2The performance decrease for removing object names isstatistically significant (p<0.01, bootstrapped). The decreasesfor removing captions and objects’ visual features are sig-nificant from the name-removal model (p<0.01) but the twomodels are statistically equivalent (p>0.19).

3This experiment was ruled as Not Regulated by the Uni-versity of Michigan IRB (HUM00197631). However, IRBapproval is not sufficient to prevent harm (Bernstein et al.,2021) and significant precautions were taken to minimizepotential risk (See §9) .

Parent Tweets Tag-based CLIP variant PEPE Dist. Samp. RandomThat wonderful feelingyou get when you ar-rive to a business din-ner that you’re suppos-edly paying for...and re-alize you’ve forgottenyour credit card

I’m convinced some ofy’all don’t get laid

Table 4: Model-selected replies to messages (paraphrased for privacy). Click an image to view the gif on Giphy.

this score in our experiments to evaluate quality.Our experiment focuses on generating Gif-based

replies to top-level text comments (comments madedirectly to the post). This setup mirrors the conver-sational data our models were trained on. Imgursupports several ways of filtering its stream of posts.To ensure that our replies have sufficient visibility,we select posts that have already receive 10 com-ments and appear in the “most viral” sorting. Fromthese posts, we reply to the top-rated text comment.The RCT runs from 8 AM to 8 PM (local time),making at most 10 replies per hour.

Not all topics or comments are suitable for auto-mated responses and great care was taken to pre-vent potential harm to the community. Throughmultiple rounds of testing which replies would beresponded to, we curated a list of keywords thatcould lead to potential controversial replies, suchas terms about religion or race (full list in Ap-pendix D). Any comment containing a token orlemma matching a word on this list is excludedand not replied to. As a further safeguard, exper-imenters monitored all replies to remove any thatwere deemed inappropriate. See the Ethics Section(§9) for a longer discussion of safeguards.

The field experiment consists of five arms, cor-responding to the three trained models and the twobaseline models. During each trial, one model is se-lected and generates a response; the trained modelreplies with the most probable gif.4

Not all models are equally likely to perform welland so to make the most use of our trial budget,

4Due to a bug, early experimental trials for the CLIP andPEPE models used the tenth most-probable gif; however, usingthe ratings in the annotated data, a t-test of the difference inquality for most- and tenth-most probable gifs showed nostatistically-significant difference in quality for both models(p>0.1). Therefore, we include this data in our results.

we use Thompson sampling (Russo et al., 2018)to randomly select which arm of the trial to use.Thompson sampling builds a probability model forthe estimated reward of each arm (here, the score areply receives) and samples from the model suchthat higher-rewarding arms are sampled more fre-quently. As a result, this method can provide tighterestimates for the reward of the most useful arms.Scores in Imgur have a skewed distribution, withfew comments receiving very high scores and mostreceiving near the default score (1). Therefore, weuse Poisson Thompson sampling. Some commentsmay be downvoted to receive scores below zero, sofor simplicity, we truncate these scores to 0.

We initialize the reward estimates for our ex-periment by selecting one of the five models ina round-robin manner to reply to an Imgur com-ment for 3 days. These initial scores act as priorsfor Thompson sampling to update Poisson distri-butions for each model. In the trial, we choose amodel by sampling from the up distributions usingall previous days’ scores as the prior. The exper-iment ran from April 15th, 2021 to August 30th,2021, and models generated a total of 8,369 replies.

To evaluate the results of the RCT, we constructa Negative Binomial regression on the dependentvariable of the score received for a model’s reply,truncating negative scores to zero. The Negativebinomial was chosen instead of Poisson due toover-dispersion in the score variable. The modelsare treated as a categorical variable, using the ran-dom model as a reference. Since the score willdepend, in part, on the attention received by theparent post and comment (higher-rated commentsare displayed first), we include linear effects forthe post and parent comment. Finally, we includefive text-related variables to control for the con-

0.2 0.0 0.2Coefficient for GIF Reply Score

CLIP variant

Dist. Samp.

Pepe

Tag-based

***

***

Figure 4: Negative Binomial regression coefficients foreach model on predicting a gif reply’s score, using therandom-gif model as the reference category; bars showstandard error and *** denotes significance at 0.01.

tent of the parent comment: the topic distribution(Appendix Table 9) from a 10-topic model (drop-ping one topic due to collinearity), the sentimentand subjectivity of the message estimated usingTextBlob library, the length of the comment, andwhether the comment contained a question.

6.2 Results

The field experiment demonstrates that the PEPE

model is able to generate significantly higher-scoring responses. Figure 4 shows the NegativeBinomial regression coefficients for the three mod-els and empirical distribution baseline, with therandom gif model as a reference; full regressionresults are shown in Appendix Table 6. The PEPE

model substantially outperforms all other models(p<0.01) in this real-world setting. Surprisingly,despite performing second-best in our annotatedevaluations, the CLIP model performs worst, withits replies receiving fewer upvotes than the twobaselines that randomly select gifs. We investigatepotential explanations for these performances next.

The Random and Distributional-sampling base-line models perform surprisingly well relative tomodels that take the text and gif content into ac-count, with only the PEPE model outperformingthem. The performance of the random baselinesmatches prior work showing people are still able todraw some connection between their interpretationand the reply (Madden, 2018, p.29). Further, weobserved that, when the model’s reply truly seemedrandom, some users replied say they upvoted solelybecause they enjoyed the gif.

As a follow-up experiment, we tested whethermodels could be getting higher (or lower) scores byrepeatedly picking the same gifs that are skewedtowards a positive or negative reaction. Figure 5shows the score distribution for the top ten most fre-

60 40 20 0 20GIF Reply Score

2gG2xiMTtFwsgfnjxvV295sWEJjvwXU

BAPSj0xM1cFe8iJsvRxNTAcup6DVfLP

3oEjHLcg4QMU5umb9maKrTvuOv4hlKM

3oKIPllDN24q8AwtwcjIu44mYwUItSHTW3tjjTrWAzlFGfvVY34PSJ

l396L17pwHWOIJrTG

GIPH

Y GI

F ID

(a) Tag-based

30 20 10 0 10 20GIF Reply Score

lfesfEtobCSbsHzC8dm9d3Xif3ShZ42CxlWPf9k1tV7HyORcngKF8v

loitbnzQ1JQ8Iizx8wbfrlODgSLqXxS

4HmjGg306HiLHWlm2f7J26CGAahos6d5S1A68hZ9FMolyKc0X8BSr7

iqkHA3DmB8GjORY030OOzcnk3PzLDHqWs6Tb

GIPH

Y GI

F ID

(b) CLIP variant

40 20 0 20GIF Reply Score

tnYri4n2Frnig5wWf7GR2nhgamhRnEuA

5gw0VWGbgNm8wiXTrbbYMQBCMM

65ODCwM00NVmEyLsX326AHLBZUC1n53ozi83o8doT9BL7dgtolp7O

Fq6Bdki3coEWQ3oEjHAUOqG3lSS0f1C

KzyMcEfDh4Jiw

GIPH

Y GI

F ID

(c) PEPE

Figure 5: Score distributions for most-frequently usedgifs show few are universally skewed positive. Boxesshow quartile ranges; gifs are in Appendix Table 7.

quently used gifs (visual examples in Appendix Ta-ble 7) for each of the three trained models andreveals surprisingly divergent behavior for how thecommunity reacts. Each model had a differentset of most-used gifs, indicating the models didnot converge to a universal set of common replies.Indeed, a gif’s frequency-of-use and mean replyscore were uncorrelated in all three models (r ≈-0.01, p>0.73 for all models). The most-used gifsfor each model had average scores that were pos-itive, but the distributions for each gif show thatsome uses were occasionally downvoted. This highvariance in scores indicates that a gif’s intrinsicqualities are not solely responsible for the receivedscore and, instead, appropriate use in context isplays a significant part in community reception.

We examined whether models relied on the sameset of gifs. Figure 6 shows the distribution of gif

0.0 0.5 1.0 1.5 2.0 2.5 3.0log10(rank of unique gif)

0.0

0.5

1.0

1.5

2.0

2.5

log1

0(fre

quen

cy o

f uni

que

gif) Tag-based

CLIP variantPepeDist. Samp.random

Figure 6: Gif use frequency by each model, shownas frequency-vs-rank log-scaled with first-order line fit(jitter added for separation).

uses by each model, indicating that the tag-basedmodel relied frequently on a small set of gifs. How-ever, the PEPE and CLIP variant models were sub-stantially more varied, indicating they draw fromthe long-tail of possible gifs.

Do any of our models spark more subsequentconversation? We fit a separate Negative Binomialregression on the total number of comments madeto our reply, using the same IVs as the score regres-sion and include the reply’s score itself as anotherIV. This model (Appendix Table 8) shows that boththe distributional-sampling baseline and PEPE mod-els produced replies that led to fewer subsequentcomments (p<0.01)—despite the PEPE model hav-ing the most-upvoted replies. However, the scoreof the gif reply was positively associated (p<0.01)indicating that more appropriate replies do receivemore subsequent conversation. We speculate thatthe random models may have led to more conver-sation due to users replying to express confusionabout why the particular gif was used. This resultpoints to a need to understand what text and visualfactors in gifs influence the volume of subsequentdialog and an opportunity to optimize gif modelsfor both quality and number of conversation turns.

7 Related Work

This work draws upon two strands of research fromdialog systems and multimodal NLP. Conversa-tional dialog systems have traditionally been builtupon large-scale dialog corpora from social me-dia platforms (Bessho et al., 2012) such as Twitter.Our approaches are fundamentally information re-trieval based systems that mirror the approach bytext-based conversational systems that retrieve ex-

isting messages from a large social media corpus aspotential replies and rank these to select a response.Our work mirrors models that use neural networksfor ranking (Yan et al., 2016; Inaba and Takahashi,2016; Penha and Hauff, 2021, e.g.,); however, wenote that many recent knowledge-grounded andopen domain models use encoder-decoder meth-ods to improve versatility and applicability (e.g.,Ghazvininejad et al., 2018; Gao et al., 2019; Zhouet al., 2020). Generative approaches are likely in-appropriate for gif-based conversation as gifs aremore akin to mimetic artifacts that build on culturalknowledge (Eppink, 2014), making synthesizing anew gif from scratch likely less effective.

All three models used here rely on joint embed-ding spaces for gif and text. Multiple works inNLP have been proposed to align these representa-tions (Kiros et al., 2014; Wang et al., 2016), oftenfor particular applications such as visual questionanswering (Antol et al., 2015). Recent work hasfocused on embeddings these media with a singleencoder that takes both text and images as input(e.g., Wang et al., 2019; Chen et al., 2020), in con-trast to our model that uses separate image and textencoders (Figure 3); these multimodal encodersare prohibitively computationally expensive to usein our setting during inference time, as the modelwould need to be run on each gif (and message) torank replies, compared with our model that onlyneeds to encode text. However, performance andefficiency improvements in aligning image and textrepresentations would likely benefit our task.

8 Conclusion

People like using gifs in online conversations—gifsare a fun and playful way to communicate. How-ever, modern NLP conversational agents operateonly by text. Here, we introduce a new datasetof 1.56M conversation turns using gifs, includingcaptions and metadata, and develop a new conversa-tional model PEPE THE KING PRAWN that selectsappropriate gif responses for messages throughcomparing encoded gif and text representations. Intwo evaluations, we show that PEPE is able to gen-erate highly-relevant gif responses and in a large-scale RCT, we show that the gif replies from thePEPE model received significantly higher scoresfrom the general public. Our work demonstratesthe opportunity for using NLP methods to success-fully engage in multimodal conversations.

9 Ethics

The interactive nature of the RCT necessitateda close consideration of ethical issues (Thieltgeset al., 2016). Prior to beginning the RCT, the studyteam obtained IRB approval to interact with users.While necessary in the legal sense, IRB approvalis not sufficient to justify the ethical grounds ofthe study. The primary risks of the study are if theautomated models respond with an inappropriategif or respond to a message that is not suitable forautomated response (e.g., discussing the death of aloved one or making an offensive statement). Theserisks were mitigated in multiple ways throughoutthe dataset construction and field experiment.

First, the selection criteria for which commentswe reply to was designed to only reply to contentthat was already deemed appropriate by the com-munity. By selecting only posts that had receivedsufficient upvotes to be called “viral” and werealready receiving comments, we mitigate the riskof engaging in topics or conversations that are in-appropriate according to the norms of the Imgurcommunity, as these posts would be removed bymoderators or would have received sufficient down-votes to stay in obscurity.

Second, by focusing on the top-voted commentto these posts, we again reply to content that hasalready been deemed high-quality by the comment.This comment-level criteria substantially lowersthe risk of our models commenting on inappropri-ate comments (e.g., a comment insulting anotheruser), as these comments are readily downvoted bythe community prior to our intervention.

Third, we employed extensive filtering to avoidreplying to any comment containing a potentiallysensitive topic, e.g., a discussion of race or trauma(keywords are listed in Appendix D). The initial setof keywords was developed through examining po-tentially sensitive topics and then iteratively addedto by simulating which messages our RCT wouldreply to and examining whether it would be appro-priate. During the field RCT, experimenters contin-uously monitored the comments to ensure no harmwas being done. Ultimately, only three commentswere removed during the initial two days, whichwas due to a bug in the lemmatization and thesecomments should have been filtered out by our ear-lier criteria; these comments were removed quicklyand we did not observe any notable response fromthe community.

Fourth, one risk is replying with an inappropri-

ate gif, which is mitigated by the use of Giphy toseed our initial gifs. As this platform is curatedand does not host objectively offensive gifs (e.g.,overly-violent content), our initial gif set is rela-tively free of objectionable gifs. Because our modellearns directly from gifs’ frequency of use, unlessobjectively offensive gifs are widely used, they areunlikely to be deployed from our RCT; we specu-late that few objectively offensive gifs are widelyused and, in practice, we have not identified anyduring the study period or when examining hun-dreds of random gifs in our data (or used in theRCT).

Finally, one risk is that by learning gif responsesfrom observed data, our models may reinforcecultural stereotypes that are encoded in the gifsthemselves (Erinn, 2019), e.g., the association ofAfrican American individuals with strong emotions.While our gif data is relatively clean of overtly of-fensive gifs, we acknowledge that our model likelydoes inadvertently perpetuate some of these latentbiases in the data. However, the success of ourmodel suggests a future mitigation strategy for plat-forms suggesting gifs: as biases become known,our approach can be used to suggest less-biasedgifs as potential responses to mitigate future harm.

Acknowledgments

We thank the reviewers, area chairs, and senior areachairs for their thoughtful comments and feedback.We also thank the Blablablab for helpful feedbackand letting us deploy PEPE to the group’s Slack andputting up with the ridiculous gif replies and Imgurfor being a wonderful community. This material isbased upon work supported by the National ScienceFoundation under Grant No. 2007251.

ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. In2018 IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR 2018, Salt Lake City, UT,USA, June 18-22, 2018, pages 6077–6086. IEEEComputer Society.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: visual question an-swering. In 2015 IEEE International Conference onComputer Vision, ICCV 2015, Santiago, Chile, De-cember 7-13, 2015, pages 2425–2433. IEEE Com-puter Society.

Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy,Yale Song, Paloma de Juan, and Joseph Jofish Kaye.2016. Fast, cheap, and good: Why animated gifs en-gage us. In Proceedings of the 2016 CHI Conferenceon Human Factors in Computing Systems, San Jose,CA, USA, May 7-12, 2016, pages 575–586. ACM.

Michael S Bernstein, Margaret Levi, David Magnus,Betsy Rajala, Debra Satz, and Charla Waeiss. 2021.Esr: Ethics and society review of artificial intelli-gence research. ArXiv preprint, abs/2106.11521.

Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku-niyoshi. 2012. Dialog system using real-time crowd-sourcing and Twitter large-scale corpus. In Proceed-ings of the 13th Annual Meeting of the Special Inter-est Group on Discourse and Dialogue, pages 227–231, Seoul, South Korea. Association for Computa-tional Linguistics.

Elli Bourlai and Susan C Herring. 2014. Multimodalcommunication on tumblr: “i have so many feels!”.In Proceedings of the 2014 ACM conference on Webscience, pages 171–175.

Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind WPicard. 2017. Gifgif+: Collecting emotional ani-mated gifs with clustered multi-task learning. In2017 Seventh International Conference on AffectiveComputing and Intelligent Interaction (ACII), pages510–517. IEEE.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2020. UNITER: Universal Image-TExt representation learning. In ECCV.

Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, andWei Xu. 2017. Joint training for pivot-based neuralmachine translation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial In-telligence, IJCAI 2017, Melbourne, Australia, Au-gust 19-25, 2017, pages 3974–3980. ijcai.org.

Michelle Cohn, Chun-Yen Chen, and Zhou Yu. 2019.A large-scale user study of an Alexa Prize chatbot:Effect of TTS dynamism on perceived quality of so-cial dialog. In Proceedings of the 20th Annual SIG-dial Meeting on Discourse and Dialogue, pages 293–306, Stockholm, Sweden. Association for Computa-tional Linguistics.

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin,Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, YehuaYang, Qingqing Dang, et al. 2020. PP-OCR: A Prac-tical Ultra lightweight OCR system. ArXiv preprint,abs/2009.09941.

Jason Eppink. 2014. A brief history of the gif (so far).Journal of visual culture, 13(3):298–306.

Wong Erinn. 2019. Digital blackface: How 21st cen-tury internet language reinforces racism.

Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark,Ari Holtzman, Yejin Choi, Noah A. Smith, and MariOstendorf. 2018. Sounding board: A user-centricand content-driven social chatbot. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Demonstrations, pages 96–100, New Orleans,Louisiana. Association for Computational Linguis-tics.

Jianfeng Gao, Michel Galley, and Lihong Li. 2019.Neural Approaches to Conversational AI: Ques-tion Answering, Task-oriented Dialogues and SocialChatbots. Now Foundations and Trends.

Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, andMichel Galley. 2018. A knowledge-grounded neuralconversation model. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence,(AAAI-18), the 30th innovative Applications of Arti-ficial Intelligence (IAAI-18), and the 8th AAAI Sym-posium on Educational Advances in Artificial Intel-ligence (EAAI-18), New Orleans, Louisiana, USA,February 2-7, 2018, pages 5110–5117. AAAI Press.

Carla F Griggio, Joanna Mcgrenere, and Wendy EMackay. 2019. Customizations and expressionbreakdowns in ecosystems of communication apps.Proceedings of the ACM on Human-Computer Inter-action, 3(CSCW):1–26.

Tim Highfield and Tama Leaver. 2016. Instagrammat-ics and digital methods: Studying visual social me-dia, from selfies and gifs to memes and emoji. Com-munication research and practice, 2(1):47–62.

Chung Hoon Hong, Yuan Liang, Sagnik Sinha Roy,Arushi Jain, Vihang Agarwal, Ryan Draves, ZhizhuoZhou, William Chen, Yujian Liu, Martha Miracky,et al. 2020. Audrey: A personalized open-domainconversational bot. In Alexa Prize Proceedings.

Pingping Huang, Jianhui Huang, Yuqing Guo, MinQiao, and Yong Zhu. 2019. Multi-grained attentionwith object-level grounding for visual question an-swering. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 3595–3600, Florence, Italy. Associationfor Computational Linguistics.

Michimasa Inaba and Kenichi Takahashi. 2016. Neuralutterance ranking model for conversational dialoguesystems. In Proceedings of the 17th Annual Meetingof the Special Interest Group on Discourse and Dia-logue, pages 393–403, Los Angeles. Association forComputational Linguistics.

Jialun “Aaron" Jiang, Jed R Brubaker, and CaseyFiesler. 2017. Understanding diverse interpretationsof animated GIFs. In Proceedings of the 2017 CHIConference Extended Abstracts on Human Factorsin Computing Systems, pages 1726–1732.

Jialun “Aaron" Jiang, Casey Fiesler, and Jed RBrubaker. 2018. “The Perfect One” UnderstandingCommunication Practices and Challenges with An-imated GIFs. Proceedings of the ACM on human-computer interaction, 2(CSCW):1–20.

Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, andTraci Curl. 1998. Lexical, prosodic, and syntacticcues for dialog acts. In Discourse Relations and Dis-course Markers.

Mahmoud Khademi. 2020. Multimodal neural graphmemory networks for visual question answering. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 7177–7188, Online. Association for Computational Lin-guistics.

Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. ArXivpreprint, abs/1411.2539.

Artie Konrad, Susan C Herring, and David Choi. 2020.Sticker and emoji use in facebook messenger: impli-cations for graphicon change. Journal of Computer-Mediated Communication, 25(3):217–235.

Vaibhav Kumar and Jamie Callan. 2020. Making in-formation seeking easier: An improved pipeline forconversational search. In Findings of the Associa-tion for Computational Linguistics: EMNLP 2020,pages 3971–3980, Online. Association for Compu-tational Linguistics.

Kuang-Huei Lee, X. Chen, G. Hua, H. Hu, and Xi-aodong He. 2018. Stacked cross attention for image-text matching. ArXiv preprint, abs/1803.08024.

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi-aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu,Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-languagetasks. In European Conference on Computer Vision,pages 121–137. Springer.

John Savery Madden. 2018. The PhenomenologicalExploration of Animated GIF Use in Computer-Mediated Communication. Ph.D. thesis, Universityof Oklahoma.

Luke Melas-Kyriazi, Alexander Rush, and George Han.2018. Training for diversity in image paragraph cap-tioning. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 757–761, Brussels, Belgium. Association forComputational Linguistics.

Kate M Miltner and Tim Highfield. 2017. Nevergonna GIF you up: Analyzing the cultural signifi-cance of the animated GIF. Social Media+ Society,3(3):2056305117725223.

Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.2020. BERTweet: A pre-trained language modelfor English tweets. In Proceedings of the 2020

Conference on Empirical Methods in Natural Lan-guage Processing: System Demonstrations, pages 9–14, Online. Association for Computational Linguis-tics.

Tianrui Niu, Fangxiang Feng, Lingxuan Li, and XiaojieWang. 2020. Image synthesis from locally relatedtexts. In Proceedings of the 2020 International Con-ference on Multimedia Retrieval, pages 145–153.

Silvia Pareti and Tatiana Lando. 2018. Dialog intentstructure: A hierarchical schema of linked dialogacts. In Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018), Miyazaki, Japan. European LanguageResources Association (ELRA).

Gustavo Penha and Claudia Hauff. 2021. On the cal-ibration and uncertainty of neural learning to rankmodels for conversational search. In Proceedingsof the 16th Conference of the European Chapterof the Association for Computational Linguistics:Main Volume, pages 160–170, Online. Associationfor Computational Linguistics.

Alec Radford, Jong Wook Kim, Chris Hallacy, AdityaRamesh, Gabriel Goh, Sandhini Agarwal, GirishSastry, Amanda Askell, Pamela Mishkin, Jack Clark,et al. 2021. Learning transferable visual modelsfrom natural language supervision. ArXiv preprint,abs/2103.00020.

Ashwin Ram, Rohit Prasad, Chandra Khatri, AnuVenkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,Behnam Hedayatnia, Ming Cheng, Ashish Nagar,et al. 2018. Conversational ai: The science behindthe alexa prize. ArXiv preprint, abs/1801.03604.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, ScottGray, Mark Chen, Rewon Child, Vedant Misra,Pamela Mishkin, Gretchen Kruegerand SandhiniAgarwal, and Ilya Sutskever. 2021. DALL-E: Cre-ating images from text. https://openai.com/blog/dall-e/.

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni,Ian Osband, and Zheng Wen. 2018. A tutorial onthompson sampling. Foundations and Trends® inMachine Learning, 11(1):1–96.

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan-nis Vlahavas. 2011. On the stratification of multi-label data. Machine Learning and Knowledge Dis-covery in Databases, pages 145–158.

Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2556–2565, Melbourne, Australia. Association forComputational Linguistics.

Piotr Szymanski and Tomasz Kajdanowicz. 2017. Anetwork perspective on stratification of multi-labeldata. In First International Workshop on Learn-ing with Imbalanced Domains: Theory and Appli-cations, pages 22–35. PMLR.

Mingxing Tan and Quoc V. Le. 2019. Efficientnet:Rethinking model scaling for convolutional neuralnetworks. In Proceedings of the 36th InternationalConference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol-ume 97 of Proceedings of Machine Learning Re-search, pages 6105–6114. PMLR.

Ying Tang and Khe Foon Hew. 2019. Emoticon, emoji,and sticker use in computer-mediated communica-tion: A review of theories and research findings. In-ternational Journal of Communication, 13:27.

Andree Thieltges, Florian Schmidt, and SimonHegelich. 2016. The devil’s triangle: Ethical con-siderations on developing bot detection methods. In2016 AAAI Spring Symposium Series.

Jackson Tolins and Patrawat Samermit. 2016. Gifsas embodied enactments in text-mediated conversa-tion. Research on Language and Social Interaction,49(2):75–91.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning deep structure-preserving image-text em-beddings. In 2016 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2016, Las Ve-gas, NV, USA, June 27-30, 2016, pages 5005–5013.IEEE Computer Society.

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng,Junjie Yan, Xiaogang Wang, and Jing Shao. 2019.CAMP: cross-modal adaptive message passing fortext-image retrieval. In 2019 IEEE/CVF Interna-tional Conference on Computer Vision, ICCV 2019,Seoul, Korea (South), October 27 - November 2,2019, pages 5763–5772. IEEE.

Miaomiao Wen, Nancy Baym, Omer Tamuz, JaimeTeevan, Susan T Dumais, and Adam Kalai. 2015.Omg ur funny! computer-aided humor with an ap-plication to chat. In ICCC, pages 86–93.

Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanx-iang Che, and Ting Liu. 2020. Conversational graphgrounded policy learning for open-domain conversa-tion generation. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 1835–1845, Online. Association forComputational Linguistics.

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, HanZhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.2018. Attngan: Fine-grained text to image genera-tion with attentional generative adversarial networks.In 2018 IEEE Conference on Computer Vision andPattern Recognition, CVPR 2018, Salt Lake City, UT,USA, June 18-22, 2018, pages 1316–1324. IEEEComputer Society.

Rui Yan, Yiping Song, and Hua Wu. 2016. Learningto respond with deep neural networks for retrieval-based human-computer conversation system. In Pro-ceedings of the 39th International ACM SIGIR con-ference on Research and Development in Informa-tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21,2016, pages 55–64. ACM.

Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, andZhiyuan Liu. 2021. Few-shot conversational denseretrieval. ArXiv preprint, abs/2105.04166.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.2020. The design and implementation of XiaoIce,an empathetic social chatbot. Computational Lin-guistics, 46(1):53–93.

Category Subcategory

Cartoons & Comics aqua teen hunger forceCelebrities richard pryorReactions angryEmotions happyAnime bleachArt & Design psychedelicNature sunriseTransportation bicycle

Table 5: Examples of GIF categories on GIPHY

A Additional Details on Model Training

Following, we provide additional details on howeach of the three models was trained.

A.1 Tag-based Model

EfficientNet-based Tag Classifier Gifs are re-shaped to 224 by 224 pixel while keeping the as-pect ratio by padding and normalized to a mean of0.5 and standard deviation of 0.5 for each channelbefore feeding into the EfficientNet-based model.We selected unique GIFs from the finalized datasetthat has at least one associated tag and using theiterative train test split on k-hot tag representationto select 5% of those GIFs for validation. The Effi-cientNet tag classifier was trained for 100 epochson a batch size of 32, using AdamW optimizerwith learning rate 1e-5 and weight decay 1e-3. Thebest validation performance was achieved at the40th epoch with macro-f1 of 0.30 in predicting 241multi-label classes. Early experiment shows thattransformer encoder layer (macro-f1 of 0.30) outperforms linear layer (macro-f1 of 0.19) in fusingmulti-frame gif features on the development set,therefore transformer encoder layer is used to fusefeatures of different frames in our implementation.Tweet-to-tag classifier Using the finalized datasetmentioned in §3, we use tweet as input, and thek-hot tag representation of that tweet instance asground truth label to train the multi-label classifieralong with the tweet encoder for 241 classes. Ad-ditionally, we filter out tweets from the finalizeddataset that do not have corresponding twitter tagsbefore training. The model with the best valida-tion performance is selected to perform subsequentevaluation and field experiments. The tweet en-coder was trained for 100 epochs with a batch sizeof 32. The learning rate was set to 1e-5 with 1e-3weight decay using AdamW optimizer. The best

validation macro-f1 was 0.07 achieved at the 70thepoch.

A.2 CLIP variantThe evaluation performance for model selectionis measured by nDCG. For every tweet-gif pair inthe validation set, we measure the top 30 predictedGIFs from the model using the tweet as input. Therelevance of an occurring ground truth gif in thetop 30 predictions given a tweet is set 1 for thenDCG calculation.

CLIP variant is trained on the same finalizeddataset using contrastive loss. It was trained for16 epochs with a batch size of 16 using AdamWoptimizer of learning rate 1e-5 and weight decay1e-3. Best validation performance is achieved atepoch 6 with an nDCG value of 0.015.

We replace the Transformer encoder layer witha linear Layer on Efficient GIF Encoder from Fig-ure 3, and use this as our GIF Encoder for theCLIP variant. Image inputs to the GIF encoder arenormalized following the official CLIP implemen-tation.

A.3 PEPE

The PEPE model follows most configurations fromthe CLIP variant model, but replace the EffcientNetGIF encoder with an Oscar GIF encoder basedon Oscar pre-trained multi-modal transformer (Liet al., 2020).

Extra metadata are extracted from GIFs in the fi-nalized dataset for further training. Captions withinthe GIF are extracted using PaddleOCR (Du et al.,2020), and only extracted text with probabilitygreater than 0.9 are kept as caption metadata.

Object tags and their corresponding features areextracted with bottom-up attention (Anderson et al.,2018) using py-bottom-up-attentionpackage. Object instances are filtered to onlykeep instances that have a score higher than0.5, then object tags and their correspondingfeatures are extracted from these instances. Finalobject features of dimension 2054 are obtainedby concatenating feature output with dimension2048 from Faster-RCNN with scaled box positioncoordinates of the object following (Li et al.,2020).

The PEPE model is trained on the finalizeddataset with extracted caption and object metadata.It was trained for 16 epochs with a batch size of8 using AdamW optimizer of learning rate 1e-6and weight decay 1e-3. Preprocessing for GIFs is

Match AnimatedGIF with Hash

Finalized Dataset1,562,701 pairs (605,063 pairs associated with selected tags)

......115,586 unique GIF IDs

Parent Tweet: We 're getting married today !GIPHY GIF ID: g9582DNuQppxCSelected Tags: #cheers, #drink, #congratulations

Parent Tweet: Having a super great day .GIPHY GIF ID: 3ornjLd54I3eQYmfpCSelected Tags: #high five

GIPHY GIF Collections

......2,095,993 unique GIF IDs

GIPHY GIF ID: 3o6fIQSs4BcsEbDE7STags: #real housewives #bravo tv #slice #rhonjAverageHash: fbf9f35141008c8bfbf9f35151008c89fbf9f35151008c89

GIPHY GIF ID: 3ornjLd54I3eQYmfpCTags: #high five #late night with seth meyersAverageHash: 00080c2ce4f0f46410383828c262726300040424b4f4e061

Tweet-GIF Pairs42,096,566 pairs

...... 39,401,680 unique GIF files

Parent TweetHaving a super great day .

Child animated GIF (reply)File: tweet_video/EEqo71PWkAAqGab.mp4AverageHash: 00080c2ce4f0f46410383828c262626300040424b4f4e061

GIPHY TagsSelection

Matched

Not Matched

Figure 7: A diagram of the pipeline used to collect, canonicalize, and filter gif-reply data from Twitter.

Image Average Hash: 686cfcfef2d282021f181b3d1c18f07c083e2e1f9fd08c8e

Tweet Animated GIF (D0mIkeXX0AIQt3N.mp4)

Hamming distance = 51.0Not Matched

Hamming distance = 9Matched

Image Average Hash: f0f4fcfcf8e060001018383818183018001a1b1b1a18180c

GIPHY GIF (ID: 2zR6G1VD9a1ws)

Image Average Hash: 6864f6ded2d282001f180b3d1c18f07c083e2f1fdfd08c8e

GIPHY GIF (ID: Y9pd1baXUIJTW)

Figure 8: Matching Animated GIFs from Twitter with GIPHY gifs using Image Average Hash

the same as the Tag-based model. Max sequencelength is set to 256 tokens for the Oscar transformer.Best evaluation performance is achieved at epoch12 with an nDCG score of 0.007.

B GIF categories on GIPHY

Category Subcategory

Reactions whatReactions hair flipReactions boredReactions frownReactions slow clapReactions mic dropReactions goodbyeReactions mehReactions scaredReactions do not wantReactions confusedReactions drunkReactions wowReactions madReactions awesomeReactions please

Reactions thumbs downReactions frustratedReactions oh snapReactions disgustedReactions rejectedReactions embarrassedReactions hugReactions yoloReactions interestedReactions thank youReactions sarcasticReactions shockedReactions cool story broReactions middle fingerReactions you got thisReactions whateverReactions omgReactions deal with itReactions sighReactions oopsReactions angryReactions finger gunsReactions good luck

Dependent variable:

Gif reply score

post score −0.0002∗∗∗ (0.00003)comment score 0.001∗∗∗ (0.0001)CLIP variant model −0.161∗∗∗ (0.058)Distribution-sampling model 0.057 (0.056)PEPE model 0.223∗∗∗ (0.051)Tag-based model −0.017 (0.055)number of days after reply 0.003∗∗∗ (0.0005)comment text polarity −0.039 (0.058)comment text subjectivity −0.033 (0.052)topic 0 (Politics related) 0.078 (0.155)topic 1 (Family & Pets related) 0.300∗∗ (0.148)topic 2 (Employment related) −0.119 (0.184)topic 3 (Social media related) 0.140 (0.165)topic 4 (Transportation related) −0.172 (0.188)topic 5 (Food related) 0.133 (0.194)topic 6 (COVID related) −0.082 (0.200)topic 7 (Entertainment related) −0.057 (0.161)topic 8 (People related) 0.272 (0.198)comment is a question 0.068 (0.049)length of parent comment −0.003 (0.002)intercept 0.231∗∗ (0.115)

Observations 8,369Log Likelihood −14,899.820θ 0.548∗∗∗ (0.013)Akaike Inf. Crit. 29,841.640

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 6: Negative Binomial regression on score of the gif reply. The random-gif baseline is set as the referencecategory for models.

Reactions abandon threadReactions excitedReactions suspiciousReactions winReactions applauseReactions popcornReactions sleepyReactions nodReactions awwwReactions disappointedReactions ughReactions laughingReactions oh no you didntReactions smhReactions agreeReactions seriousReactions party hardReactions shut upReactions okReactions helpReactions smileReactions incredulousReactions yawnReactions idkReactions sexyReactions fist bumpReactions dancingReactions nomReactions ewwReactions helloReactions not badReactions successReactions burnReactions proudReactions i give upReactions heartsReactions pleasedReactions fmlReactions sorryReactions arousedReactions happy danceReactions good jobReactions wtfReactions seriouslyReactions wantReactions rageReactions table flipReactions loveReactions amusedReactions flirt

Reactions judging youTransportation truckTransportation spaceshipTransportation vanTransportation submarineTransportation motorcycleTransportation bmwTransportation helicopterTransportation chevroletTransportation volkswagenTransportation boatTransportation busTransportation porscheTransportation tankTransportation audiTransportation toyotaTransportation airplaneTransportation hovercraftTransportation nissanTransportation bicycleTransportation trainTransportation rocketTransportation yachtTransportation ferrariTransportation hondaTransportation sailboatTransportation carTransportation teslaHolidays mardi grasHolidays oktoberfestHolidays kwanzaaHolidays fathers dayHolidays fourth of julyHolidays mothers dayHolidays yom kippurHolidays st patricks dayHolidays memorial dayHolidays cinco de mayoHolidays labor dayHolidays rosh hashanahHolidays new yearsHolidays passoverScience global warmingScience astronomyScience physicsScience laserScience starsScience robotScience atomsScience meteor

Science bubblesScience medicineScience nebulaScience technologyScience mathematicsScience chemistryScience biologyScience planetsScience magnetsScience moleculesScience asteroidsScience spaceScience bill nyeScience engineeringScience diyScience nuclearScience computersFashion & Beauty chanelFashion & Beauty alexander mcqueenFashion & Beauty modelFashion & Beauty victorias secretFashion & Beauty pradaFashion & Beauty karlie klossFashion & Beauty jessica stamFashion & Beauty emily ratajkowskiFashion & Beauty miranda kerrFashion & Beauty kate uptonFashion & Beauty louis vuittonFashion & Beauty makeupFashion & Beauty kate mossFashion & Beauty cara delevingneFashion & Beauty runwayFashion & Beauty jourdan dunnFashion & Beauty julia nobisFashion & Beauty jewelryFashion & Beauty beautyFashion & Beauty chanel imanFashion & Beauty christian diorFashion & Beauty marc jacobsFashion & Beauty shoesFashion & Beauty dressFashion & Beauty gucciGreetings get wellGreetings byeGreetings im outGreetings sympathyGreetings thank youGreetings new babyGreetings im sorryGreetings congratulations

Greetings happy anniversaryGreetings heyGreetings welcomeGreetings cheersGreetings best friendsTV workaholicsTV successionTV blackishTV shark tankTV big brotherTV vanderpump rulesTV afvTV twin peaksTV its always sunny in

philadelphiaTV real housewives of new

york cityTV seinfeldTV american horror storyTV modern familyTV poldarkTV stranger thingsTV law and order svuTV big mouthTV greys anatomyTV bachelor in paradiseTV i love lucyTV the voiceTV boy meets worldTV the bacheloretteTV new girlTV south parkTV saturday night liveTV saved by the bellTV real housewives of new jer-

seyFood & Drink pancakesFood & Drink sandwichFood & Drink happy hourFood & Drink sushiFood & Drink steakFood & Drink pastaFood & Drink french toastFood & Drink mimosaFood & Drink teaFood & Drink whiskeyFood & Drink pickleFood & Drink cakeFood & Drink egg rollFood & Drink broccoli

Food & Drink vodkaFood & Drink breadFood & Drink cookieFood & Drink tacoFood & Drink cheeseFood & Drink brunchFood & Drink french friesFood & Drink appleFood & Drink orange fruitFood & Drink browniesFood & Drink wineFood & Drink hamFood & Drink saladFood & Drink pieFood & Drink sodaFood & Drink beerFood & Drink burritoFood & Drink bananaGaming donkey kongGaming max payneGaming gears of warGaming streets of rageGaming starfoxGaming metroidGaming segaGaming prince of persiaGaming spritesGaming final fantasyGaming wolfenstein 3dGaming call of dutyGaming earthboundGaming tetrisGaming video game physicsGaming nintendoGaming pacmanGaming game boyGaming tomb raiderGaming super marioGaming sonic the hedgehogGaming the last of usGaming half lifeGaming dead spaceGaming nesGaming super nintendoGaming animal crossingGaming n64Gaming atariGaming the simsGaming bioshockGaming portal

Gaming destiny the gameGaming 8bitGaming galagaGaming kirbyGaming mortal kombatGaming starcraftGaming duck huntGaming skyrimGaming grand theft autoGaming modsGaming metal gear solidGaming world of warcraftGaming super smash brosInterests new york cityInterests vampireInterests balletInterests summerInterests buttInterests winterInterests tumblrInterests roller coasterInterests robotInterests iphoneInterests workInterests theme parkInterests zombieInterests partyInterests babyInterests lgbtInterests internetInterests boyInterests alienInterests girlInterests vacationInterests boobsInterests ghostInterests autumnInterests springInterests clownCelebrities jean claude van dammeCelebrities paul scheerCelebrities denzel washingtonCelebrities bryan cranstonCelebrities chris prattCelebrities johnny deppCelebrities stephen colbertCelebrities emma watsonCelebrities macaulay culkinCelebrities heath ledgerCelebrities jim gaffigan

Celebrities mr. tCelebrities danny mcbrideCelebrities michael fassbenderCelebrities seth rogenCelebrities elijah woodCelebrities jon hammCelebrities tom hanksCelebrities kate uptonCelebrities arnold schwarzeneggerCelebrities tom hiddlestonCelebrities al pacinoCelebrities sean conneryCelebrities javier bardemCelebrities ken jeongCelebrities will smithCelebrities maya rudolphCelebrities jack mcbrayerCelebrities leonardo dicaprioCelebrities clint eastwoodCelebrities robert downey jrCelebrities michael ian blackCelebrities adrien brodyCelebrities tom hardyCelebrities joseph gordon levittCelebrities mark ruffaloCelebrities adam baldwinCelebrities rebel wilsonCelebrities jim carreyCelebrities melissa mccarthyCelebrities ashley bensonCelebrities rob huebelCelebrities julianne mooreCelebrities hayden panettiereCelebrities anna kendrickCelebrities will forteCelebrities ryan goslingCelebrities andrew garfieldCelebrities nick offermanCelebrities weird al yankovicCelebrities will arnettCelebrities bruce leeCelebrities christian baleCelebrities paul danoCelebrities eddie murphyCelebrities sam rockwellCelebrities mike tysonCelebrities jude lawCelebrities rooney maraCelebrities adam sandlerCelebrities chris hemsworth

Celebrities kristen wiigCelebrities james francoCelebrities adam scottCelebrities seth greenCelebrities jeremy rennerCelebrities morgan freemanCelebrities bradley cooperCelebrities dave chappelleCelebrities rachel mccadamsCelebrities nicolas cageCelebrities megan foxCelebrities robert redfordCelebrities elizabeth banksCelebrities liam neesonCelebrities willem dafoeCelebrities jonah hillCelebrities michael ceraCelebrities charlie sheenCelebrities emma robertsCelebrities jon stewartCelebrities patton oswaltCelebrities samuel l jacksonCelebrities alison brieCelebrities matt lucasCelebrities ellen pageCelebrities amanda bynesCelebrities jake gyllenhaalCelebrities rob loweCelebrities steve carellCelebrities conan obrienCelebrities cillian murphyCelebrities mindy kalingCelebrities ben stillerCelebrities john travoltaCelebrities gary oldmanCelebrities amy poehlerCelebrities ian somerhalderCelebrities richard pryorCelebrities bruce willisCelebrities daniel day lewisCelebrities chuck norrisCelebrities ed helmsCelebrities don cheadleCelebrities michael caineCelebrities george carlinCelebrities alia shawkatCelebrities emma stoneCelebrities adam devineCelebrities larry davidCelebrities taylor kitsch

Celebrities matthew perryCelebrities dave francoCelebrities harrison fordCelebrities olivia munnCelebrities emily bluntCelebrities mila kunisCelebrities ru paulCelebrities jason batemanCelebrities anne hathawayCelebrities tracy morganCelebrities natalie portmanCelebrities brad pittCelebrities tom cruiseCelebrities sylvester stalloneCelebrities tina feyCelebrities dolph lundgrenCelebrities tony haleCelebrities donald gloverCelebrities paul ruddCelebrities angelina jolieCelebrities scarlett johanssonCelebrities david crossCelebrities alec baldwinCelebrities david duchovnyCelebrities will ferrellCelebrities chris rockCelebrities adam brodyCelebrities jennifer lawrenceCelebrities aubrey plazaCelebrities jackie chanCelebrities alexa chungCelebrities ricky gervaisCelebrities jessica walterActions cookingActions fightingActions smilingActions laughingActions dreamingActions cryingActions spinningActions tossing drinkActions sleepingActions eatingActions sneezingActions singingActions poutActions slappingActions finger gunsActions runningActions swimming

Actions fallingActions smokingActions flirtingActions dancingActions breaking upActions drinkingActions faintingEmotions shockedEmotions boredEmotions unimpressedEmotions sickEmotions stressedEmotions nervousEmotions sadEmotions relaxedEmotions sassyEmotions tiredEmotions reactionEmotions hungryEmotions scaredEmotions angryEmotions drunkEmotions lonelyEmotions painEmotions excitedEmotions happyEmotions surprisedEmotions inspiredEmotions suspiciousEmotions frustratedEmotions loveEmotions embarrassedEmotions disappointedSports hockeySports rugbySports nhlSports rock climbingSports divingSports formula oneSports rowingSports skydivingSports mmaSports lacrosseSports ufcSports volleyballSports softballSports mlbSports martial artsSports horse racingSports skiing

Sports swimmingSports roller skatingSports footballSports tennisSports nbaSports boxingSports parkourSports nascarSports golfArt & Design artArt & Design typographyArt & Design illustrationArt & Design transparentArt & Design glitchArt & Design pixelArt & Design morphArt & Design black and whiteArt & Design geometryArt & Design collageArt & Design architectureArt & Design psychedelicArt & Design 3dArt & Design mash upArt & Design photographyArt & Design loopArt & Design cinemagraphArt & Design sculptureArt & Design timelapseArt & Design designArt & Design animationMemes sips teaMemes steal yo girlMemes arthurMemes crying dawsonMemes confusedMemes deal with itMemes like a bossMemes hair flipMemes forever aloneMemes look at all the fucks i giveMemes cucaMemes judge judyMemes feelsMemes failMemes dank memesAdjectives vintageAdjectives sexyAdjectives brightAdjectives darkAdjectives hot

Adjectives slow motionAdjectives cuteAdjectives coldAdjectives funnyAdjectives weirdAdjectives trippyAdjectives black and whiteAdjectives prettyAdjectives scaryAdjectives creepyAdjectives hdAnimals lizardAnimals meerkatAnimals otterAnimals cowAnimals caterpillarAnimals koalaAnimals corgiAnimals penguinAnimals duckAnimals elephantAnimals raccoonAnimals hippoAnimals kangarooAnimals chickenAnimals monkeyAnimals ferretAnimals sealAnimals owlAnimals jellyfishAnimals bulldogAnimals crabAnimals butterflyAnimals giraffeAnimals pandaAnimals pigAnimals red pandaAnimals grumpy catAnimals sheepAnimals turtleAnimals wolfAnimals lionAnimals birdAnimals hamsterAnimals polar bearAnimals goatAnimals whaleAnimals mouseAnimals camelAnimals chihuahua

Animals skunkAnimals squirrelAnimals frogAnimals horseAnimals pugAnimals tigerAnimals unicornAnimals bearAnimals poodleMovies the fifth elementMovies the breakfast clubMovies addams familyMovies breakfast at tiffanysMovies cry babyMovies donnie darkoMovies waynes worldMovies say anythingMovies the godfatherMovies blue velvetMovies the princess brideMovies cluelessMovies ghostbustersMovies spidermanMovies sixteen candlesMovies ace venturaMovies the blues brothersMovies fight clubMovies indiana jonesMovies the notebookMovies get outMovies the matrixMovies star warsMovies night of the living deadMovies the shiningMovies 500 days of summerMovies bladerunnerMovies elfMovies the big lebowskiMovies some like it hotMovies american psychoMovies easy riderMovies reservoir dogsMovies texas chainsaw massacreMovies the avengersMovies beetlejuiceMovies labyrinthMovies scarfaceMovies spring breakersMovies rockyMovies pretty in pink

Movies the dark knightMovies citizen kaneMovies edward scissorhandsMovies kill billMovies casablancaMovies pulp fictionMovies terminatorMovies zoolanderMovies bridesmaidsMovies dodgeballMovies heathersMovies lost boysMovies the gooniesMovies hocus pocusMovies the hangoverIdentity native americanIdentity muslimIdentity love is loveIdentity bisexualIdentity asianIdentity times upIdentity queerIdentity non binaryIdentity gayIdentity lesbianNews & Politics republicanNews & Politics cory bookerNews & Politics economyNews & Politics irsNews & Politics democratNews & Politics supreme courtNews & Politics bernie sandersNews & Politics bill clintonNews & Politics kamala harrisNews & Politics julian castroNews & Politics white houseNews & Politics senateNews & Politics joe bidenNews & Politics presidentNews & Politics tax dayNews & Politics elizabeth warrenNews & Politics pete buttigiegNews & Politics protestNews & Politics climate changeNews & Politics nancy pelosiNews & Politics congressNews & Politics rbgNews & Politics taxesCartoons & Comics snow whiteCartoons & Comics peter pan

Cartoons & Comics dougCartoons & Comics mulanCartoons & Comics harvey birdmanCartoons & Comics the criticCartoons & Comics hotel transylvaniaCartoons & Comics gi joeCartoons & Comics wile e coyoteCartoons & Comics popeyeCartoons & Comics regular showCartoons & Comics aeon fluxCartoons & Comics the little mermaidCartoons & Comics fosters home for imagi-

nary friendsCartoons & Comics animaniacsCartoons & Comics gumbyCartoons & Comics adult swimCartoons & Comics the jetsonsCartoons & Comics muppet babiesCartoons & Comics beavis and buttheadCartoons & Comics archie comicsCartoons & Comics mickey mouseCartoons & Comics captain planetCartoons & Comics peanutsCartoons & Comics ren and stimpyCartoons & Comics underdogCartoons & Comics george of the jungleCartoons & Comics gravity fallsCartoons & Comics grinch who stole christmasCartoons & Comics mr magooCartoons & Comics top catCartoons & Comics dexters laboratoryCartoons & Comics tangledCartoons & Comics betty boopCartoons & Comics king of the hillCartoons & Comics pink pantherCartoons & Comics tailspinCartoons & Comics tweety birdCartoons & Comics disneyCartoons & Comics sleeping beautyCartoons & Comics aladdinCartoons & Comics toy storyCartoons & Comics alvin and the chipmunksCartoons & Comics teen titansCartoons & Comics tom and jerryCartoons & Comics minnie mouseCartoons & Comics my little ponyCartoons & Comics the incrediblesCartoons & Comics pinocchioCartoons & Comics rockos modern lifeCartoons & Comics jem and the holograms

Cartoons & Comics the flintstonesCartoons & Comics garfieldCartoons & Comics looney tunesCartoons & Comics calvin and hobbesCartoons & Comics batmanCartoons & Comics rugratsCartoons & Comics home moviesCartoons & Comics scooby dooCartoons & Comics speed racerCartoons & Comics the venture brosCartoons & Comics daffy duckCartoons & Comics wall eCartoons & Comics carsCartoons & Comics 101 dalmatiansCartoons & Comics beauty and the beastCartoons & Comics porky pigCartoons & Comics schoolhouse rockCartoons & Comics rocky and bullwinkleCartoons & Comics sealab 2021Cartoons & Comics hey arnoldCartoons & Comics josie and the pussycatsCartoons & Comics arthurCartoons & Comics aqua teen hunger forceCartoons & Comics magical game timeCartoons & Comics space ghostCartoons & Comics cartoon networkCartoons & Comics family guyCartoons & Comics the lion kingCartoons & Comics winnie the poohCartoons & Comics phinas and ferbCartoons & Comics homestuckCartoons & Comics dariaCartoons & Comics fat albertCartoons & Comics the oatmealCartoons & Comics yogi bearCartoons & Comics fantasiaCartoons & Comics bambiCartoons & Comics samurai jackCartoons & Comics the powerpuff girlsCartoons & Comics cyanide and happinessCartoons & Comics teenage mutant ninja tur-

tlesCartoons & Comics pocahontasCartoons & Comics voltronCartoons & Comics south parkCartoons & Comics finding nemoCartoons & Comics metalocaypseCartoons & Comics dreamworksCartoons & Comics alice in wonderlandCartoons & Comics johnny bravo

Decades 80sDecades vintageDecades 30sDecades 60sDecades 50sDecades 70sDecades 40sDecades 90sDecades 20sWeird 80sWeird vintageWeird ghostWeird zombieWeird morphWeird psychedelicWeird vampireWeird alienWeird 90sWeird robotWeird clownStickers cat stickersStickers excited stickersStickers love stickersStickers animatedtext stickersStickers emoji stickersStickers weird stickersStickers high five stickersStickers birthday stickersStickers party stickersStickers cheeseburger stickersStickers happy stickersStickers dinosaur stickersNature sunNature wavesNature windNature riverNature mistNature desertNature moonNature waterfallNature starsNature tsunamiNature coralNature glacierNature weatherNature beachNature sunriseNature cometNature oceanNature ice

Nature crystalsNature forestNature sunsetNature fireNature lavaNature reefNature tornadoNature northern lightsNature landscapeNature prairieNature nightNature plantsNature caveNature treesNature constellationsNature cloudsNature hurricaneNature sandNature mushroomsNature snowNature geyserNature lakeNature mountainsNature smokeNature rainbowMusic action bronsonMusic adeleMusic frank oceanMusic kendrick lamarMusic the beatlesMusic mc hammerMusic zayn malikMusic nicki minajMusic backstreet boysMusic lizzoMusic clMusic snoop doggMusic madonnaMusic usherMusic vampire weekendMusic the rolling stonesMusic g dragonMusic jennifer lopezMusic janet jacksonMusic destinys childMusic lady gagaMusic jay zMusic elvis presleyMusic bruno marsMusic cardi b

Music tlcMusic david bowieMusic coldplayMusic kpopMusic missy elliottMusic solangeMusic whitney houstonMusic carrie underwoodMusic shakiraMusic britney spearsMusic lil nas xMusic mariah careyMusic selena gomezAnime samurai champlooAnime fullmetal alchemistAnime bleachAnime spaceship battleship yam-

atoAnime mangaAnime hetaliaAnime princess mononokeAnime my neighbor totoroAnime cowboy bebopAnime kawaiiAnime kibaAnime berserkAnime evangelionAnime black lagoonAnime inuyashaAnime ninja scrollAnime sakuraAnime hayao miyazakiAnime cardcaptor sakuraAnime rock leeAnime code geassAnime kakashi hatakeAnime hinata hyugaAnime death noteAnime gundam

C List of selected tags from GIPHY

adorable, agreed, amazing, amused, angry, an-noyed, anxiety, anxious, applause, approval, ap-prove, aw, awesome, awkward, bad, beautiful, bestwishes, blank stare, blink, blush, bored, bow, bravo,but why, buy, bye, captivated, celebrate, cheeky,cheering, cheers, clap, come on, comic, compli-ment, compliments, concerned, confused, congrat-ulations, cool, crazy, creeping, cringe, crushing,cry, curtsy, cute, damn, dance, dancing, deadpanstare, debate, depressed, dickhead, disagree, dis-

appointed, disapprove, disbelief, disgust, dislike,diss, divertente, dont care, doubt, doubtful, drink,drinking, drunk, dubious, dying, eating, eating pop-corn, embarassed, engrossed, ennui, excited, facepalm, faint, fingers crossed, flirt, flushed, freakingout, frustrated, fuck, fun, funny, gagging, get well,glare, good luck, gossip, grateful, gratitude, great,great job, grin, hahahah, happy, happy dance, headshake, hide, high five, hilarious, honestly, hope,horror, hugs, hugs love, hysterical, ill, impressed,incredulous, insult, interested, interesting, judge,judging you, just, keep going, kiss, laugh, leav-ing, lets go, lies, like, looking, looking around,love you, lovely, luv u, luv you, mad, mind blown,mock, motivational, moved, muah, much appreci-ated, nah, nasty, need, nervous, nice one, no, nod,not amused, not funny, not interested, oh shit, over-whelmed, panic, partying, perfect, pissed, please,pleased, pointing, praise, pray, pregnant, proud,pumped, questioning, raises hand, realization, re-lief, respect, reunited, roast, roll eyes, sad, sadness,salute, sarcastic, savage, scared, scary, scream-ing, secret, seriously, sexy, shame, shock, shook,shrug, shut up, shy, sigh, sips tea, sitting, sleepy,sloth, smart, smile, smug, sobbing, sorpren, sorry,spit, stoked, stressed, stunned, success, sudden re-alization, surprise, suspicious, sweating, swoon,swooning, take notes, tantrum, tears, thank, think,thirsty, thumbs down, thumbs up, tired, too funny,touched, unamused, unbelievable, uncomfortable,unhappy, unimpressed, unsure, upset, vomit, wait-ing, wave, weary, weird, whatever, will, wince,wink, wrestling, yawn, yell, yes, yum

D List of filtering keywords on Imgurexperiment

depression, depressing, mental, health, death, dead,alcohol, alcoholism, weed, drugs, addiction, covid,beer, stoned, black, white, arabic, hispanic, latino,latina, latinx, police, cop, racism, racists, race, sex-ism, sexist, sexy, armed, overthrow, government,republican, democrats, maga, liberal, liberals, con-servative, conservatives, offender, victim, disabil-ity, disabled, jerking, PD, gun, shots, fired, cops,officer, officers, killing, murder, murdered, kill,kills, killed, murders, shoot, taser, bystander, trig-ger, handgun, pansexual, sexuality, homosexual,gay, lesbian, corona, virus, coronavirus, vaccine,vaccinated, viruses, vaccination, die, fascist, fas-cists, antifa, sharia, islam, islamic, christian, jewish,muslim, blasphemy, blasphemic, death, conviction,

church, priest, pastor, religious, religion, sharia,shia, sunni, judge, bible, qaran, torah, hindu, hin-dus, christians, jew, jews, muslims, islamist, ex-ecute, murder, captive, captives, malpractice, in-surance, insured, threat, threatening, war, troops,violence, fighting, conflict, medicine, prescription,drug, dying, hospice, life, doctor, hospital, nurse,pedophiles, pedophile, bitch, republicans, demo-crat, coup, tax, recession, pedo, criminal, criminals,politician, politicians, health, healthcare, america,american, voter, voting, votes, vote, voters, citizen,immigrants, immigrant, citizens, candian, canada,eu, european, trump, red, blue, cancer, slavery,slaves, slave, disease, sickness, sorry, nazi, nazis,death, pro-death, pro-life, profile, abortion, aborted,aborting, victims, jail, whore, slut, rape, raped, rap-ing, behead, beheadings, beheaded, torture, tor-tured, torturing, taliban, afghanistan, soldier, sol-diers, kabul

Tag based CLIP variant PEPE

2gG2xiMTtFwsg lfesfEtobCSbsHzC8d tnYri4n2Frnig

fnjxvV295sWEJjvwXU m9d3Xif3ShZ42CxlWP5wWf7GR2nhgamhRnEuA

BAPSj0xM1cFe8f9k1tV7HyORcngKF8v

5gw0VWGbgNm8w

iJsvRxNTAcup6DVfLP loitbnzQ1JQ8Iizx8w iXTrbbYMQBCMM

3oEjHLcg4QMU5umb9m

bfrlODgSLqXxS65ODCwM00NVmEyLsX3

aKrTvuOv4hlKM4HmjGg306HiLHWlm2f 26AHLBZUC1n53ozi8

3oKIPllDN24q8Awtwc7J26CGAahos6d5S1A6 3o8doT9BL7dgtolp7O

jIu44mYwUItSHTW3tj 8hZ9FMolyKc0X8BSr7Fq6Bdki3coEWQ

jTrWAzlFGfvVY34PSJ iqkHA3DmB8GjORY030 3oEjHAUOqG3lSS0f1C

l396L17pwHWOIJrTG OOzcnk3PzLDHqWs6TbKzyMcEfDh4Jiw

Table 7: Examples of top 10 most frequently used gifs across all models in the RCT. Click an image to view thegif on Giphy. Images are ordered from most-used (top) to tenth-most (bottom).

Dependent variable:

Cumulative number of replies received

gif reply score 0.096∗∗∗ (0.010)post score −0.0004∗∗∗ (0.0001)comment score 0.0002 (0.0002)CLIP variant model −0.196 (0.152)distribution-sampling model −0.664∗∗∗ (0.160)PEPE model −0.450∗∗∗ (0.138)Tag-based model −0.195 (0.146)number of days after reply −0.001 (0.001)comment text polarity 0.048 (0.164)comment text subjectivity −0.055 (0.147)topic 0 (Politics related) −0.275 (0.430)topic 1 (Family & Pets related) −0.264 (0.412)topic 2 (Employment related) −1.182∗∗ (0.549)topic 3 (Social media related) 1.381∗∗∗ (0.421)topic 4 (Transportation related) −0.021 (0.514)topic 5 (Food related) −0.896 (0.567)topic 6 (COVID related) −0.459 (0.564)topic 7 (Entertainment related) −0.529 (0.452)topic 8 (People related) −1.776∗∗∗ (0.647)comment is a question 0.114 (0.133)length of parent comment 0.0003 (0.007)intercept −1.877∗∗∗ (0.313)

Observations 8,369Log Likelihood −2,466.965θ 0.143∗∗∗ (0.013)Akaike Inf. Crit. 4,977.930

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 8: Negative Binomial regression on cumulative number of replies received. The random-gif baseline is setas the reference category for model comparison.

Topic Dirichlet parameter Keywords

0 0.1172 people fuck trump shit make thing country n’t vote fucking1 0.20164 good time love kid make cat dog day year guy2 0.09554 pay work money people make job year buy time company3 0.11245 post make read people good time thing imgur video work4 0.06541 car live year drive day place time road city back5 0.05672 eat make food good water drink taste cheese pizza coffee6 0.06662 people covid die vaccine life make work problem mask n’t7 0.0888 movie play game good watch show love great time song8 0.02752 wear mask red shirt woman hair white man hat black9 0.14292 back make put hand time guy car head thing big

Table 9: Topic modeling keywords for Imgur Comments