arXiv:2101.10917v1 [cs.CL] 26 Jan 2021

11
I Beg to Differ: A study of constructive disagreement in online conversations Christine De Kock & Andreas Vlachos University of Cambridge Department of Computer Science and Technology {cd700,av308}@cam.ac.uk Abstract Disagreements are pervasive in human com- munication. In this paper we investigate what makes disagreement constructive. To this end, we construct WikiDisputes, a corpus of 7 425 Wikipedia Talk page conversations that con- tain content disputes, and define the task of predicting whether disagreements will be esca- lated to mediation by a moderator. We evalu- ate feature-based models with linguistic mark- ers from previous work, and demonstrate that their performance is improved by using fea- tures that capture changes in linguistic markers throughout the conversations, as opposed to av- eraged values. We develop a variety of neural models and show that taking into account the structure of the conversation improves predic- tive accuracy, exceeding that of feature-based models. We assess our best neural model in terms of both predictive accuracy and uncer- tainty by evaluating its behaviour when it is only exposed to the beginning of the conver- sation, finding that model accuracy improves and uncertainty reduces as models are exposed to more information. 1 Introduction Disagreements online are a familiar occurrence for any internet user. While they are often per- ceived as a negative phenomenon, disagreements can be useful; as is illustrated in Figure 1, dis- agreements can lead to an improved understanding of a topic by introducing and evaluating different perspectives. Research on online disagreements has focused on mostly negative aspects such as trolling (Cheng et al., 2017), hate speech (Waseem and Hovy, 2016), harassment (Yin et al., 2009) and personal attacks (Wulczyn et al., 2017). Re- cent works by Zhang et al. (2018) and Chang and Danescu-Niculescu-Mizil (2019) study conversa- tions on Wikipedia talk pages that start out as civil but derail into personal attacks, using linguis- Figure 1: An example of a dispute on Wikipedia, show- ing the dispute tag on the article, talk page posts, and summaries of edits occurring during the discussion. tic markers and deep learning approaches, respec- tively. An alternative approach is to study good faith disagreement through debates on online platforms such as ChangeMyView (Tan et al., 2016), de- bate.org (Durmus and Cardie, 2019) and Kialo (Boschi et al., 2019), or formal Oxford-style de- bates (Zhang et al., 2016a). While this has benefits, such as readily available annotation of stances and indication of winning and losing sides, it does not mirror the way people naturally converse. For ex- ample, in formal debates, temporal and structural arXiv:2101.10917v1 [cs.CL] 26 Jan 2021

Transcript of arXiv:2101.10917v1 [cs.CL] 26 Jan 2021

I Beg to Differ:A study of constructive disagreement in online conversations

Christine De Kock & Andreas VlachosUniversity of Cambridge

Department of Computer Science and Technology{cd700,av308}@cam.ac.uk

Abstract

Disagreements are pervasive in human com-munication. In this paper we investigate whatmakes disagreement constructive. To this end,we construct WikiDisputes, a corpus of 7 425Wikipedia Talk page conversations that con-tain content disputes, and define the task ofpredicting whether disagreements will be esca-lated to mediation by a moderator. We evalu-ate feature-based models with linguistic mark-ers from previous work, and demonstrate thattheir performance is improved by using fea-tures that capture changes in linguistic markersthroughout the conversations, as opposed to av-eraged values. We develop a variety of neuralmodels and show that taking into account thestructure of the conversation improves predic-tive accuracy, exceeding that of feature-basedmodels. We assess our best neural model interms of both predictive accuracy and uncer-tainty by evaluating its behaviour when it isonly exposed to the beginning of the conver-sation, finding that model accuracy improvesand uncertainty reduces as models are exposedto more information.

1 Introduction

Disagreements online are a familiar occurrencefor any internet user. While they are often per-ceived as a negative phenomenon, disagreementscan be useful; as is illustrated in Figure 1, dis-agreements can lead to an improved understandingof a topic by introducing and evaluating differentperspectives. Research on online disagreementshas focused on mostly negative aspects such astrolling (Cheng et al., 2017), hate speech (Waseemand Hovy, 2016), harassment (Yin et al., 2009)and personal attacks (Wulczyn et al., 2017). Re-cent works by Zhang et al. (2018) and Chang andDanescu-Niculescu-Mizil (2019) study conversa-tions on Wikipedia talk pages that start out ascivil but derail into personal attacks, using linguis-

Figure 1: An example of a dispute on Wikipedia, show-ing the dispute tag on the article, talk page posts, andsummaries of edits occurring during the discussion.

tic markers and deep learning approaches, respec-tively.

An alternative approach is to study good faithdisagreement through debates on online platformssuch as ChangeMyView (Tan et al., 2016), de-bate.org (Durmus and Cardie, 2019) and Kialo(Boschi et al., 2019), or formal Oxford-style de-bates (Zhang et al., 2016a). While this has benefits,such as readily available annotation of stances andindication of winning and losing sides, it does notmirror the way people naturally converse. For ex-ample, in formal debates, temporal and structural

arX

iv:2

101.

1091

7v1

[cs

.CL

] 2

6 Ja

n 20

21

constraints are imposed. Moreover, in Oxford-styledebates, participants are not motivated to come toa consensus but rather to persuade an audience ofa predefined stance (Zhang et al., 2016a). Finally,the task of detecting disagreement in conversationshas been studied by a number of authors, e.g. Wangand Cardie (2014) and Rosenthal and McKeown(2015), without attempting to analyze it further.

We are interested instead in constructive dis-agreement in an uncoerced setting. To this end,we present WikiDisputes1; a corpus of 7 425disagreements (totalling 99 907 utterances) minedfrom Wikipedia Talk pages. To construct it, we mapdispute tags in the platform’s edit history (shownin Figure 1) to conversations in WikiConv (Huaet al., 2018) to locate conversations that relate tocontent disputes. Observing that conversations areoften conducted in the Talk pages and the edit sum-maries simultaneously, as illustrated in Figure 1,we augment the WikiConv conversations with editsummaries that occur concurrently. To investigatethe factors that make a disagreement constructive,we define the task of predicting whether a dispute iseventually considered resolved by its participants,or it was escalated by them to mediation by a mod-erator, and therefore considered unconstructive.

We evaluate both feature-based and neural mod-els for this task. Our findings indicate that bal-ancing hedging and certainty plays an importantrole in constructive disagreements. We find thatincorporating edit summaries (which have been ig-nored in previous work on Wikipedia discussions)improves model performance on predicting esca-lation. We further find that including informationabout the conversation structure aids model perfor-mance. We observe this in two ways. Firstly, weinclude gradient features, which capture changesin linguistic markers throughout the conversationsas opposed to averaged values. Secondly, we ex-periment with adding sequential and hierarchicalconversation structure to our neural models, findingthat a Hierarchical Attention Network (Yang et al.,2016) provides the best performance on our task.

We further evaluate this model in terms of its pre-dictive accuracy when exposed to the beginningsof the conversation as opposed to completed con-versations, as well as its uncertainty (more detailsin Section 6.4). Our results indicate that modelperformance is reduced from a PR-AUC of 0.29

1github.com/christinedekock11/wikidisputes

to 0.19 when exposed to only the first half of theconversation. However, this reduced model stilloutperforms toxicity and sentiment models predict-ing on full conversations. Model uncertainty de-creases roughly linearly as the model is exposed tomore information. We conduct a qualitative anal-ysis of the points in the conversation where themodel changes its prediction on constructiveness,finding that some markers from our feature-basedmodels also seem to have been picked up by theneural model.

2 Dispute resolution on Wikipedia

Wikipedia relies heavily on collaboration and dis-cussion by editors to maintain content standards,using the platform’s Talk pages (Wikipedia, 2020c).Wikipedia contributors add “dispute” templates toarticles (Wikipedia, 2020a) referring to a contentaccuracy dispute or to a violation of the platform’sneutral point of view (NPOV) policy (Wikipedia,2020b). Adding such a template to the article cre-ates a dispute tag on the article as shown in Figure 1.Variations of these tags allow contributors to flagspecific sections or phrases which relate to the issuein question, or to add details of the dispute.

All edits to Wikipedia, including the aforemen-tioned dispute tags, are retained in the site’s edithistory. Revision metadata contains a descriptionof the edit (hereafter referred to as the edit sum-mary). As observed in Figure 1, when such editsoccur while a conversation is underway, editors usethe “edit summaries” to clarify their intentions inthe context of the conversation.

The Wikipedia dispute resolution policy stipu-lates that if contributors are unable to reach a con-sensus through Talk page discussion, they can re-quest mediation by a community volunteer. A pre-requisite for this escalation step is proof of recent,extensive discussion on a Talk page2.

3 WikiDisputes

Our dataset consists of three facets: disagreementson Talk pages, edit summaries, and escalation la-bels. Wikipedia informs users that all contributionson article or Talk pages are published publicly, andprovides channels for users to permanently removepersonal information that they have shared acci-dentally or otherwise (Wikipedia, 2020d), thus weare able to release this data to the community. In

2https://en.wikipedia.org/wiki/Wikipedia:Dispute_resolution_noticeboard

Figure 2: An illustration of the relationship between the revision history and Talk page conversations, using laterutterances from the dispute in Figure 1. Two consecutive edits and their summaries are shown side by side.

the remainder of this section, we detail the processfor obtaining the conversations, the edit summariesand the labels.

3.1 Finding disagreements

We use the Wikipedia revision history dump3 from1 July 2020 to find content disputes. We locate theaddition of dispute templates in the article historyusing regular expressions that account for varia-tions in the free text dispute templates. Using therevisions where dispute tags are added, we canalso determine which user logged the dispute, thetimestamp, and the article and section in dispute.

To find the related WikiConv conversation foreach dispute tag, we select all conversations onthe relevant article’s Talk page in which the userwho logged the dispute was involved. Of thesecandidates, the conversation closest in time to thetimestamp of the dispute tag addition is selected.We filter the conversations extracted above using anumber of heuristics in the pursuit of high qualitysamples of constructive disagreements. To ensureconversations of adequate length, we set a min-imum conversation length of five utterances and250 tokens. Based on an inspection of 100 conver-sations, we remove conversations of more than 50

3https://dumps.wikimedia.org/enwiki/

utterances, as such conversations are often flamewars. We further include only conversations inwhich more than one user participated. Using thismethodology, we obtain 7 425 content disputes.

Our manual inspection further indicates that theusage of the dispute tag is not entirely consistentacross the platform; editors sometimes use the tagto report that they dispute some aspect of the textbefore discussing it, rather than to flag a currentlyoccurring dispute in the Talk page. However, theformer frequently results in a disagreement. Wefound that 88 out of 100 inspected examples werecorrectly identified disagreements, which we be-lieve is an acceptably low error rate.

3.2 Collecting edit summaries

We show in Figure 2 two further posts from theclimate dispute (continued from Figure 1), alongwith two edits that occurred during the course ofthe conversation. We note here that including theedit summaries as part of the conversation is im-portant for understanding the conversation; withoutthis, the reference to “infuding” in this examplewould be incomprehensible. We thus include alledit summaries written by users involved in theconversation, which are timestamped between thefirst and last utterance in the conversation.

3.3 Escalation tags

For the task of inferring whether a dispute was esca-lated to mediation (and therefore Talk page discus-sion was unsuccessful), we scrape mediated casesfrom the Dispute Resolution Noticeboard archivesand align them with WikiConv conversations.

Of the 2 520 archived mediation processes, 149were closed as failures and 237 as successes, withthe remaining 2 134 receiving “General closures”.The last label is frequently assigned due to insuf-ficient Talk page discussion. We only include ac-cepted mediations in our dataset and thus ignore“General closures”, resulting in 386 mediations.

We record the location of the disagreement towhich the mediation relates, the usernames of par-ticipants and the timestamp. We use a similar align-ment procedure to that described in Section 3.1 tofind the relevant WikiConv conversations; exceptthat we include the usernames of all participantslisted in the mediation. During alignment, a simplematch is recorded when there is exactly one conver-sation on the article Talk page involving all listedusers (118 cases out of 386).

There are 111 cases in which no full matchesoccur; this sometimes happens when a participantchanges their username or is participating anony-mously. We manually inspect potential matches inthis case. A further 67 cases have multiple matches;this happens when a participant changes their user-name, or is participating anonymously, or the Talkpage has been archived; in this case we try to findthe conversation most related to the mediation sum-mary.

We use only the utterances which are times-tamped from before the conversation was escalatedand apply the same filters outlined in Section 3.1.Using this methodology, we are able to find 201disagreements that are escalated to mediation. Thecomplementary class label (not escalated) is as-signed to the disagreements extracted using thedispute tags, for which no mediation was found.

Our classification task uses escalation as a proxylabel for cases where Talk page discussion was notconstructive. This relies on the assumption thatnon-escalated disagreements are more construc-tive than escalated disagreements, even though dis-agreements that were not escalated are not guaran-teed to be successfully resolved; participants maysimply not be aware of the escalation procedure,or not be willing to go through the extra effort ofparticipating in mediation. To validate this, we an-

notated 35 samples from the dataset as to whetherthey represented constructive disagreements, andour annotations agreed with the escalation labels in25 of these. Combined with the low estimates of an-tisocial behaviour on the site (Wulczyn et al., 2017),we draw the inference that the non-escalated classcontains more constructive disagreements than theescalated ones. Throughout the remainder of thispaper, we use these terms interchangeably.

4 Modelling constructive disagreement

4.1 Feature-based models

We develop our feature-based models using featuresets suggested in previous research on tasks relatedto conversational analysis. These include:

• Politeness: The politeness strategies fromZhang et al. (2018) as implemented in Con-vokit (Chang et al., 2020), which capturegreetings, apologies, directness, and saying“please”, etc.

• Collaboration: Conversation markers in col-laborative discussions (Niculae and Danescu-Niculescu-Mizil, 2016), which capture the in-troduction and adoption of ideas, uncertaintyand confidence terms, pronoun usage, and lin-guistic style accommodation.

• Toxicity: Toxicity and severe toxicity scoresas estimated by the Perspective API (Wulczynet al., 2017) and included in WikiConv (Huaet al., 2018).

• Sentiment: Positive and negative sentimentword counts, as per the lexicon of Liu et al.(2005) and implemented in Convokit (Changet al., 2020).

For each feature, we calculate the average valuethroughout the conversation, as well as the gradientof a straight line fit of the feature value through-out the conversation. Our intuition for includingthis variation is that the outcome of a disagreementis likely to be influenced by a change in the lan-guage usage through the course of a conversation,which is not reflected by the mean. For instance,a disagreement might start out very politely andend impolitely, or the other way around, and wouldhave the same mean. Logistic regression is usedto infer linear relationships between the linguisticfeatures and disagreement outcomes.

4.2 Neural models

Dialogue structure has been difficult to capture infeature based models due to sparsity. For this rea-son, we implement a number of neural models withincreasing capacities for modelling conversationstructure to assess its importance.

Averaged embeddings Our simplest variant av-erages the GloVe embeddings (Pennington et al.,2014) of all words in a conversation, and uses afully connected layer for classification. It ignoresboth utterance hierarchy and word ordering and canbe seen as a bag-of-word-embeddings approach.

LSTM Instead of averaging the GloVe embed-dings, we use a bidirectional LSTM-based model(Hochreiter and Schmidhuber, 1997) to process thesequence of words in the conversation; ignoringutterance hierarchy but preserving word order.

HAN We use a Hierarchical Attention Network(HAN) (Yang et al., 2016) to model both word orderand utterance hierarchy. HANs have recently beenused to predict emotion in conversations (Ma et al.,2020) and are useful for capturing the structure oftexts by building up utterance embeddings fromword embeddings, and then constructing a con-versation vector from these utterance embeddings.Given a sequence of words in an utterance, thewords are first mapped to vectors through an em-bedding matrix (GloVe embeddings, in our case). Abidirectional LSTM layer (Hochreiter and Schmid-huber, 1997) is used to process the sequence ofembeddings and calculate an embedding for eachword in the context of the utterance. An atten-tion mechanism (Bahdanau et al., 2015) is appliedto these embeddings to find an utterance embed-ding. A similar procedure is followed to build upa conversation vector based on the utterances em-beddings. This vector is then used to perform theclassification. To incorporate edit summaries in aconversation, we prepend each edit summary witha special token (<EDIT>) to indicate its origin.

5 Experimental setup

An initial data analysis indicated that escalated dis-agreements have a median length of 16 utterances,compared to 11 for their non-escalated counter-parts. While this may be an informative featurefor classification, we are primarily interested in theeffects of language on the conversation outcome,and therefore choose to control for this effect. We

Not escalated Escalated# Samples 1994 216# Participants 2 3# Utterances 16 16Tokens per utt. 61 53

Table 1: Dataset statistics for the task of predicting es-calation. Where applicable, median values are used.

use matching, a technique developed for causal in-ference in observational studies (Rubin, 2007), alsoemployed by Zhang et al. (2018) in the context ofconversational modelling. We pair each escalateddisagreement with a non-escalated disagreementof the same length in utterances, bearing in mindthat the dataset is heavily imbalanced, with 7 425non-escalated disagreements compared to 201 es-calated disagreements. To retain an imbalance inthe classes while conducting matching, we matchevery escalated disagreement with up to ten non-escalated disagreements (the actual number depend-ing on availability) by randomly sampling withoutreplacement from the non-escalated disagreementsof the same length. Characteristics of the dataset af-ter performing matching are shown in Table 1, withboth classes having a median of 16 utterances now.Escalated disagreements have one more participantthan non-escalated disagreements on average, andutterances in the escalated disagreements class areslightly shorter.

We split the dataset for training as indicated inTable 2. Due to the class imbalance we use thearea under the precision-recall curve (Davis andGoadrich, 2006) as metric for this task; however,for the sake of interpretability we also present thebreak-even F1 scores. We use a distribution-awarerandom class predictor as a random baseline.

Hyperparameters The logistic regression mod-els are implemented in Scikit-Learn. We use agrid search to determine the best regularisationmode (L1 or L2) per model, evaluating C-values in[0.1,1,10,100]. The neural models are implementedin Keras, using Focal Loss (Lin et al., 2017) withAdam optimisation (Kingma and Ba, 2014) (learn-ing rate=0.001) and Dropout (p = 0.3). For theHAN, we use bidirectional LSTM layers with 128nodes in both the utterance and conversation en-coders. For the LSTM model, we use only onesuch a layer.

Dataset Escalated Not escalatedTrain 125 1411Test 46 284Validation 30 299

Table 2: Test set splits for predicting escalation.

Model PR-AUC F1Baselines

Random 0.121 0.128Bag-of-words 0.213 0.239

Feature-based modelsToxicity 0.140 0.125Sentiment 0.150 0.055Politeness 0.232 0.241

+ gradients 0.275 0.243Collaboration 0.261 0.320

+ gradients 0.269 0.302Politeness and collaboration 0.255 0.256

+ gradients 0.281 0.289Neural models

Averaged embeddings 0.243 0.256LSTM 0.263 0.194HAN 0.373 0.304

+ edit summaries 0.400 0.333

Table 3: Results of the escalation prediction task.

6 Results

Results are shown in Table 3. We note that gen-erally, the neural network models perform betterthan the feature-based models. We discuss the PR-AUC scores of each model category below, notingthat the scores from the F1 metric generally followthe same trends, but that PR-AUC is more robustin the face of imbalanced data. Furthermore, weinvestigate whether it is possible to predict the out-come from the beginning, how model uncertaintychanges, or if there is often some inflection pointin a conversation that changes the outcome.

6.1 Feature-based models

A number of our feature-based models outperformsthe bag-of-words baseline, which in turn performsbetter than the random baseline. The toxicity andsentiment features do not perform better than thebag-of-words baseline, which indicates that thesefeatures by themselves do not capture constructive-ness in disagreements. Toxicity scores were foundby Zhang et al. (2016a) to be predictive of a con-versation derailing, but in our case seemingly leadto the inference of spurious correlations. An expla-nation might be that comments which seem toxicout of context are in fact spirited discussion dueto the degree of involvement of the participants;for instance, we have observed that contributorsin some cases challenge others’ credentials, which

Feature Type Coeff.2nd person pronouns, x Both +3.64# hedging terms, x Both -2.48Greetings, x Politeness +2.301st person pronouns, x Both +1.91Greetings, ∇ Politeness -1.81Deference, x Politeness -1.443rd person pronouns, ∇ Collaboration -1.38# ideas adopted + certainty, x Collaboration -1.23Use of “by the way”, x Politeness -1.04Certainty,∇ Collaboration -0.92

Table 4: Ten features with the largest coefficients ofthe best feature-based model, which uses politenessand collaboration featuresets. Feature variations are themean (x) and gradient (∇). Positive weights are associ-ated with the unconstructive class.

may seem rude in casual conversation, but can beuseful in resolving a dispute.

The best featuresets proposed in previous workare collaboration (Niculae and Danescu-Niculescu-Mizil, 2016), followed closely by politeness(Zhang et al., 2016a). Adding the gradient fea-tures we proposed improves performance for both,indicating that not only the presence of a marker isimportant, but also how its usage throughout a con-versation changes. The best feature-based model isa combination of politeness and collaboration fea-tures, with a PR-AUC of 0.281 (P < 0.05, using arandomized permutation test).

The ten features with the largest coefficientsfrom the combined model are shown in Table 4in descending order or magnitude, along with theirdirections. Positive coefficients are associated withthe escalated class, which indicates that a disagree-ment was not constructive.

Politeness markers such as deference and the useof “by the way” are found indicative of construc-tiveness, corroborating the findings of Zhang et al.(2016b). Greetings are associated with unconstruc-tive disagreements on average, but an increase ingreetings towards the end of a conversation is as-sociated with constructiveness. An explanation ofthis might be that greetings can seem overly formalor indicative of tension early on, but if used later ina conversation, they indicate that new participantshave entered the conversation or that more time istaken between replies. Indeed, longer gaps betweenreplies (a feature in the collaboration featureset) arealso associated with constructiveness.

The use of first and second person pronouns (“I”and “you”) is associated with unconstructive dis-agreements, with the latter corroborating the find-ings of Zhang et al. (2018). This is also consistent

with psychotherapy research on disagreements inrelationships (e.g. Gottman and Krokoff (1989)),which emphasises avoiding ‘you’-messages whichmight be perceived as blameful.

Hedging is associated with constructive disagree-ments, which corroborates the findings of Zhanget al. (2016a). An intuitive understanding of thisresult is that using hedging terms shows that thespeaker is more open to adjusting their opinionor compromising. The certainty gradient and theadoption of new ideas with certainty are also asso-ciated with constructiveness. Niculae and Danescu-Niculescu-Mizil (2016) found no significant cor-relation between either certainty or hedging andsuccessful collaboration. We attribute the observeddifferences in feature associations to the fact thatWikiDisputes are sourced from an uncoerced set-ting, where users feel more invested in the outcomeof a disagreement and may need to balance the useof certainty and hedging to negotiate compromises.The setting of Niculae and Danescu-Niculescu-Mizil (2016) obligates volunteers to participate ina photo geolocation game, where it is unlikely thatinterlocutors are as invested in the task as collabo-rators working on a widely read encyclopaedia.

Words associated with either class, obtainedthrough the bag-of-words model, are shown in Ap-pendix A. An interesting observation from this listis that citing Wikipedia policy (which is indicatedwith WP) is associated with escalating conversa-tions. This practice is sometimes referred to as“Wiki-Lawyering” within the community and cansignal an unwillingness to compromise.

6.2 Neural network models

The results from our neural models illustrate that in-corporating structure improves predictive accuracyon our task.

The model that averages over word embeddingsperforms the worst; a moderate increase in perfor-mance (2%) is gained from adding sequential wordprocessing by way of an LSTM model. Adding ut-terance boundary information with HAN results ina much larger improvement (11%) over the LSTM.This indicates that, while it is helpful to observethe ordering of words in a conversation, a morecritical component in the case of disagreements ishow words are arranged in utterances.

The highest scoring model, which a PR-AUCof 0.40, results from including edit summaries inthe conversations (P < 0.05, using a randomized

0.2 0.4 0.6 0.8 1

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Percentage of conversation seen

PR-A

UC

0.007

0.008

0.009

0.01

Unc

erta

inty

UncertaintyPR-AUC

Figure 3: PR-AUC scores and uncertainty values of aHAN model when exposed to partial conversations.

permutation test). This provides support for ourobservation that understanding some conversationsrequires knowledge of the edits that occurred dur-ing the conversation. However, neglecting to insertthe “EDIT” token (as explained in Section 4.2) re-sults in the PR-AUC decreasing to 0.33, which indi-cates that utterances sourced from edit summariesrequire special processing and are not completelyhomogeneous in the conversation.

Evaluating the predictions of this model, we ob-serve that false positives often co-occur with heateddebate. This is not unexpected, given that the posi-tive class is contains interlocutors who feel stronglyenough about their case to request mediation. Thefalse negatives, on the other hand, contain a num-ber of cases where it seemed as though a disagree-ment had been resolved, but then the argumentprogresses as further edits are made.

6.3 Early estimation of outcome

We are interested in how model predictions changethroughout the conversation; whether it is possi-ble to predict the outcome from the beginning, orif there is often some inflection point in a conver-sation that changes the outcome. Predicting theoutcome of a conversation based on only a few ut-terances means that the model has less informationfor its inference; however, if models can performwell under such conditions, early intervention by amediator could be recommended.

To evaluate model performance in such condi-tions, we split each disagreement into 10 buckets,chronologically (using only conversations of morethan 10 utterances). We evaluate models trained on

Figure 4: Model predictions and uncertainty throughout the course of a conversation in our dataset. The classifica-tion threshold shown is tuned on the validation set.

full conversations on these subsets to observe thechange in model performance. These results areshown in Figure 3 in blue. Model performance in-creases from 0.198 at the midway point (0.5 bucket)to 0.292 when the full conversation has been ob-served, indicating that signals later in conversationsare often important for predicting outcome. How-ever, the neural model’s performance when predict-ing on half the conversation only is still better thanthe toxicity and sentiment models on the completeconversations.

6.4 Uncertainty estimation

Given that this early estimation exposes the modelto less information, we are interested in whetherthis impacts model uncertainty. We employ MonteCarlo dropout (Gal and Ghahramani, 2016) to thisend. This approach calls for applying dropout be-fore every layer of weights during both training andprediction, allowing us to sample from the distribu-tion of predicted values for each input. Each inputis evaluated multiple times (in our case, N = 30)and the mean and standard deviation of these pre-dictions represent the prediction and its uncertainty.

Our results are shown in Figure 3 (in red). Wenote that uncertainty initially increases and thendecreases monotonically. The initial increase in un-certainty could be due to a contrary position being

introduced in the second utterance, in response tothe introductory comment (as occurs in Figure 4).From there, the dispute is either resolved or even-tually escalated, and uncertainty decreases almostlinearly as the model is exposed to new data.

6.5 Identifying inflection points

Having ascertained that model accuracy improvesand uncertainty decreases as a model is exposedto more information, we are interested in factorsthat cause the predicted class to change, and howthe model predictions relate to the feature-basedmodel coefficients. Although methods for inter-preting neural network predictions exist (Ribeiroet al., 2016, 2018), these are not easily extendableto process inter-utterance dependencies as observedwith the HAN. We instead analyse a conversationfrom our dataset to observe inflection points andwhat may have caused them. We show HAN modelpredictions and uncertainty values in Figure 4.

The HAN model initially predicts escalation,with a relatively low uncertainty value. The conver-sation remains confrontational for the first seven ut-terances with a high escalation score. There are two“I” pronouns in the second utterance and two “you”pronouns in the third utterance, which were alsoassociated with unconstructiveness in the feature-based models. We note the use of politeness cues

(“please” and “thank you” in the fourth and sixth ut-terances in this example), which reduces the scorefor escalation. The uncertainty and prediction val-ues increase when one user proposes a compromisein utterance 9, and then decreases again in utter-ance 12 when the other user accepts the compro-mise. This feature was not explicitly modelled forin our feature-based models.

7 Conclusion

In a discussion of online disagreements by Gra-ham (2008) (which Wikipedia references in itsguidelines for resolving disputes), the author re-marks that an increase in disagreements throughwidespread online interaction has the potential tocreate a surge in anger among internet users. Onthe other hand, as illustrated in our data, disagree-ments can sometimes be constructive. The successof Wikipedia shows the benefits of integrating mul-tiple perspectives. As Hahn (2020) states, “Arguingthings through is at the center of our attempts tocome to accurate beliefs about the world [and] de-cide on a best course of action.” For this reason, itis important to understand why disagreement cansometimes be constructive, and in other cases leadsto conversational failure.

In this work, we investigated constructive dis-agreements from an NLP perspective. We haveproposed a dataset of disagreements online and de-fined the task of predicting escalation as a proxylabel for cases where disagreements were uncon-structive. We analysed features that are associatedwith either class, and drew parallels with existingwork. Using neural network models, we investi-gated the effect of modeling conversation structureand found that adding utterance hierarchy lead to anincrease in performance. Finally, we validated ourneural models by evaluating their performance anduncertainty when exposed to partial conversations.Our insights on constructive disagreements are notlimited to Wikipedia and would be transferable todisagreements on other platforms. Additionally,our finding that edit summaries are an informativepart of Talk page conversations should be usefulfor researchers who work on Wikipedia Talk pagesmore generally.

Acknowledgements

Christine De Kock is supported by scholarshipsfrom Huawei and the Oppenheimer Memorial Trust.Andreas Vlachos is supported the EPSRC grant no.

EP/T023414/1: Opening Up Minds.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015.

Gioia Boschi, Anthony P Young, Sagar Joglekar,Chiara Cammarota, and Nishanth Sastry. 2019. Hav-ing the last word: Understanding how to sample dis-cussions online. arXiv preprint arXiv:1906.04148.

Jonathan P. Chang, Caleb Chiam, Liye Fu, An-drew Wang, Justine Zhang, and Cristian Danescu-Niculescu-Mizil. 2020. ConvoKit: A toolkit for theanalysis of conversations. In Proceedings of the21th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, pages 57–60, 1st virtualmeeting. Association for Computational Linguistics.

Jonathan P. Chang and Cristian Danescu-Niculescu-Mizil. 2019. Trouble on the horizon: Forecastingthe derailment of online conversations as they de-velop. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages4743–4754, Hong Kong, China. Association forComputational Linguistics.

Justin Cheng, Michael Bernstein, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2017. Anyonecan become a troll: Causes of trolling behavior inonline discussions. In Proceedings of the 2017 ACMconference on computer supported cooperative workand social computing, pages 1217–1230.

Jesse Davis and Mark Goadrich. 2006. The relation-ship between Precision-Recall and ROC curves. InProceedings of the 23rd international conference onMachine learning, pages 233–240.

Esin Durmus and Claire Cardie. 2019. A corpus formodeling user and language effects in argumenta-tion on online debating. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 602–607, Florence, Italy.Association for Computational Linguistics.

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as aBayesian approximation: Representing model uncer-tainty in deep learning. In international conferenceon machine learning, pages 1050–1059.

John M Gottman and Lowell J Krokoff. 1989. Mar-ital interaction and satisfaction: A longitudinalview. Journal of consulting and clinical psychology,57(1):47.

Paul Graham. 2008. How to disagree. http://www.paulgraham.com/disagree.html.

Ulrike Hahn. 2020. Argument quality in real world ar-gumentation. Trends in Cognitive Sciences.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Yiqing Hua, Cristian Danescu-Niculescu-Mizil, DarioTaraborelli, Nithum Thain, Jeffery Sorensen, and Lu-cas Dixon. 2018. WikiConv: A corpus of the com-plete conversational history of a large online col-laborative community. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2818–2823, Brussels, Bel-gium. Association for Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, KaimingHe, and Piotr Dollar. 2017. Focal loss for dense ob-ject detection. In Proceedings of the IEEE interna-tional conference on computer vision, pages 2980–2988.

Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.Opinion observer: Analyzing and comparing opin-ions on the web. In Proceedings of the 14th inter-national conference on World Wide Web, pages 342–351.

Hui Ma, Jian Wang, Lingfei Qian, and Hongfei Lin.2020. HAN-ReGRU: hierarchical attention networkwith residual gated recurrent unit for emotion recog-nition in conversation. Neural Computing and Ap-plications, pages 1–19.

Vlad Niculae and Cristian Danescu-Niculescu-Mizil.2016. Conversational markers of constructive dis-cussions. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 568–578, San Diego, California. As-sociation for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. “Why should i trust you?” Ex-plaining the predictions of any classifier. In Pro-ceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining,pages 1135–1144.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Anchors: High-precision model-agnostic explanations. In AAAI, volume 18, pages1527–1535.

Sara Rosenthal and Kathy McKeown. 2015. I couldn’tagree more: The role of conversational structure inagreement and disagreement detection in online dis-cussions. In Proceedings of the 16th Annual Meet-ing of the Special Interest Group on Discourse andDialogue, pages 168–177, Prague, Czech Republic.Association for Computational Linguistics.

Donald B Rubin. 2007. The design versus the analysisof observational studies for causal effects: parallelswith the design of randomized trials. Statistics inmedicine, 26(1):20–36.

Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Win-ning arguments: Interaction dynamics and persua-sion strategies in good-faith online discussions. InProceedings of the 25th international conferenceon world wide web, pages 613–624. InternationalWorld Wide Web Conferences Steering Committee.

Lu Wang and Claire Cardie. 2014. A piece of my mind:A sentiment analysis approach for online dispute de-tection. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguis-tics (Volume 2: Short Papers), pages 693–699, Balti-more, Maryland. Association for Computational Lin-guistics.

Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-bols or hateful people? predictive features for hatespeech detection on twitter. In Proceedings of theNAACL Student Research Workshop, pages 88–93,San Diego, California. Association for Computa-tional Linguistics.

Wikipedia. 2020a. Dispute tag template.https://en.wikipedia.org/wiki/Template:Disputed.

Wikipedia. 2020b. Npov tag template. https://en.wikipedia.org/wiki/Template:POV.

Wikipedia. 2020c. Wikipedia dispute resolu-tion. https://en.wikipedia.org/wiki/Wikipedia:Dispute_resolution.

Wikipedia. 2020d. Wikipedia personal secu-rity. https://en.wikipedia.org/wiki/Wikipedia:Personal_security_practices.

Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.Ex machina: Personal attacks seen at scale. In Pro-ceedings of the 26th International Conference onWorld Wide Web, pages 1391–1399.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies,pages 1480–1489.

Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian DDavison, April Kontostathis, and Lynne Edwards.2009. Detection of harassment on web 2.0. Pro-ceedings of the Content Analysis in the WEB, 2:1–7.

Justine Zhang, Jonathan P Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon, Yiqing Hua, NithumTahin, and Dario Taraborelli. 2018. Conversationsgone awry: Detecting early signs of conversationalfailure. In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics.,volume 1.

Justine Zhang, Ravi Kumar, Sujith Ravi, and CristianDanescu-Niculescu-Mizil. 2016a. Conversationalflow in Oxford-style debates. In Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 136–141, SanDiego, California. Association for ComputationalLinguistics.

Justine Zhang, Ravi Kumar, Sujith Ravi, and CristianDanescu-Niculescu-Mizil. 2016b. Conversationalflow in oxford-style debates. In Proceedings ofNAACL-HLT, pages 136–141.

A Words list

Words associated with the positive and negativeclass are shown in Table 5. Coefficients are de-termined by training a bag-of-words model withlogistic regression. The positive class is escalation.

Words Coefficientsnpov -8.421information 6.138did 5.583wp 4.969right -4.819agree -4.124discussion -4.106issue 3.864say 3.771pov -3.754want 3.610article -3.398work 3.362articles 2.755edit 2.747claim -2.232case 2.180people -2.140wikipedia 2.132use -2.013

Table 5: Words associated with the positive and nega-tive class, using a bag-of-words model to predict esca-lation.