Sarcasm Detection in Reddit Comments - UvA Scripties
-
Upload
khangminh22 -
Category
Documents
-
view
4 -
download
0
Transcript of Sarcasm Detection in Reddit Comments - UvA Scripties
Faculty of Economics and BusinessAmsterdam School of Economics
Requirements thesis MSc in Econometrics.
1. The thesis should have the nature of a scienti�c paper. Consequently the thesis is dividedup into a number of sections and contains references. An outline can be something like (thisis an example for an empirical thesis, for a theoretical thesis have a look at a relevant paperfrom the literature):
(a) Front page (requirements see below)
(b) Statement of originality (compulsary, separate page)
(c) Introduction
(d) Theoretical background
(e) Model
(f) Data
(g) Empirical Analysis
(h) Conclusions
(i) References (compulsary)
If preferred you can change the number and order of the sections (but the order youuse should be logical) and the heading of the sections. You have a free choice how tolist your references but be consistent. References in the text should contain the namesof the authors and the year of publication. E.g. Heckman and McFadden (2013). Inthe case of three or more authors: list all names and year of publication in case of the�rst reference and use the �rst name and et al and year of publication for the otherreferences. Provide page numbers.
2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All thatactually matters is that your supervisor agrees with your thesis.
3. The front page should contain:
(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Facultyas in the heading of this document. This combination is provided on Blackboard (inMSc Econometrics Theses & Presentations).
(b) The title of the thesis
(c) Your name and student number
(d) Date of submission �nal version
(e) MSc in Econometrics
(f) Your track of the MSc in Econometrics
1
Sarcasm Detection in Reddit Comments
by
Stepan Svoboda11762616
Masters in Econometrics
Track: Big Data Business Analytics
Supervisor: L. S. Stephan MPhil
Second reader: prof. dr. C.G.H. Diks
August 12, 2018
Statement of Originality
This document is written by student Stepan Svoboda who declares to take full respon-
sibility for the contents of this document.
I declare that the text and the work presented in this document are original and that
no sources other than those mentioned in the text and its references have been used in
creating it.
The Faculty of Economics and Business is responsible solely for the supervision of com-
pletion of the work, not for the contents.
1
UNIVERSITY OF AMSTERDAM
Abstract
Faculty of Economics and Business
Amsterdam School of Economics
Masters in Econometrics
by Stepan Svoboda
This thesis created a new sarcasm detection model which levies existing research in
sarcasm detection field and applies it to a novel data set from an online commenting
platform. It takes several different approaches to extract information from the com-
ments and their context (author’s history and parent comment) and identifies the most
promising approaches. The main contribution is the identification of most promising
features on the large and novel data set which should make the findings robust to noise
in data. The achieved accuracy was 69.5% with the model being slightly better at de-
tecting non-sarcastic than sarcastic comments. The best features, given the hardware
limitations, were found to be the PoS-based ones and lexical-based ones.
Acknowledgements
I would like to thank my supervisor, Sanna Stephan, for all the help and feedback she
provided me during the writing of my thesis. I’m also grateful to my friends Jan, Samuel
and Radim for all their support and help during my studies and work on the thesis.
3
Contents
Statement of Originality 1
Abstract 2
Acknowledgements 3
List of Figures 6
List of Tables 7
Abbreviations 8
Glossary 9
1 Introduction 1
2 Sentiment Analysis and Sarcasm Detection 4
2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Polarity classification . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Beyond polarity classification . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sarcasm Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Survey of sarcasm detection . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Research papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data description 15
3.1 Sarcasm detection data set . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Authors’ history . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Methodology 19
4.1 Pre-processing & Feature engineering . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Sentiment-based features . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Lexical-based features . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.4 PoS-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.5 Word similarity-based features . . . . . . . . . . . . . . . . . . . . 24
4.1.6 User embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4
Contents 5
4.2.1 Feed-forward Neural network . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Empirical part 38
5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusion 48
A Truncated SVD 51
B Neural net optimization 53
B.1 Neural net architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
C Used libraries 56
Bibliography 57
List of Figures
4.1 A basic neural net (Bishop, 2006) . . . . . . . . . . . . . . . . . . . . . . . 27
6
List of Tables
5.1 Confusion matrix of the main model . . . . . . . . . . . . . . . . . . . . . 41
5.2 Summary of the performance of individual models . . . . . . . . . . . . . 42
5.3 Confusion matrix of the model using as features: BoW and context . . . . 42
5.4 Confusion matrix of the model using as features: BoW, context and PoS-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Confusion matrix of the model using as features: BoW, context andsimilarity-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Confusion matrix of the model using as features: BoW, context and usersimilarity-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Confusion matrix of the model using as features: BoW, context andsentiment-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Confusion matrix of the model using as features: BoW, context andlexical-based features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
C.1 Libraries used in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7
Abbreviations
BoW Bag-of-Words
CM Confusion Matrix
ML Machine Learning
NLP Natural Language Processing
PCA Principal Component Analysis
PoS Part-of-Speech
TF-IDF Term Frequency-Inverse Document Frequency
SGD Stochastic Gradient Descent
SVD Singular Value Decomposition
w.r.t. with respect to
8
Glossary
k-nearest neighbor Technique which moves all observed data points into n-dimensional
space, where n is the number of features, and finds the k closest points based on
a prespecified distance measure, e.g. Euclidean distance. 11
Amazon’s Mechanical Turk Amazon’s Mechanical Turk is an internet marketplace
for tasks which computers are currently unable to do. Human users can choose
and execute these tasks for money, thereby providing the input for the subsequent
ML process. A typical ML example is acquiring set of predefined labels for a data
set, which can be later used to perform a supervised learning task.. 11, 14
bigram An augmentation of preprocesing of the BoW. Instead of a word a collocation
of two words is in the columns. An example sentence ’I went shopping’ has three
unigrams (’I’, ’went’ & ’shopping’) and 2 bigrams (’I went’, ’went shopping’). 12,
13, 45, 49
BoW This technique is applied directly to all the preprocessed documents (comments)
and creates a matrix with all the words being represented in the columns and
all the documents being represented by the rows. Then each field of this matrix
represents the number of occurences of a specific word in the specific document.
9, 10, 12–14, 20, 21, 24, 38, 42, 45, 49
deep learning Type of ML based on Neural Nets aimed at learning data representa-
tions such as shapes and objects in images. ’Deep’ in the name refers to the large
number of layers. 10, 26
determiner ’A modifying word that determines the kind of reference a noun or noun
group has, e.g. a, the, every’ (Oxford Dictionary). An example are the articles in
9
Glossary 10
the sentence ”The girl is a student”, where both of them are determiners which can
specify the definiteness, i.e. do we know the girl?, and volume of the appropriate
nouns, e.g. indefinite article is not compatible with a plural. 11
feature ML equivalent of explanatory variable. Non-numerical features can be recoded
to numerical state. 2, 3, 9–14, 17, 19, 22, 23, 42, 44
feature set A number of distinctive (yet related) features used together as one set. 2,
10, 11, 13, 19, 27, 41–44, 46
hyperparameter In ML hyperparameter is a parameter whose value is not learned but
is set before the learning begins. 10, 30, 39, 40, 53
lexical clue A specific type of expression in writing, akin to voice inflection and ges-
turing in speech, e.g. emoticons, capital letters and more than one punctuation
mark in a row. 11
online learning This method allows to train a model on data that come in sequentially,
this is commonly used when the data set is too large to handle at once and the
computation is divided into many parts to make them feasible. It is called online
learning since it can handle real-time data. 26, 38
regularized regression Augmented OLS, called Ridge regression or l2 penalty, by an
additional term in the loss function. The rewritten loss function takes follow-
ing form:∑n
i=1(yi −∑k
j=1 xijβj)2 + λ
∑kj=1 β
2j . An alternative versions of reg-
ularization exist such as l1 penalty, or LASSO regression, which takes the form∑ni=1(yi −
∑kj=1 xijβj)
2 + λ∑k
j=1 βj . 13, 14
test set Data set on which the generalization properties of the validated model are
evaluated. The generalization property is the ability of the model to perform well
on previously unseen data. 11, 39
unigram The word in the column of a BoW. 9, 12, 13, 45, 49
validation set Data set which provides evaluation of the generalization properties of
the model and allows hyperparameters to be tuned. 39
Glossary 11
verb morphology Specific form of a verb, e.g. passive tense and infinitive. 11
word embeddings A process in which a word or a phrase is mapped to a vector of
real numbers. It is a general term that can be associated with different methods.
One of them is described briefly in Section 4.1.5. 14, 18
Chapter 1
Introduction
The analysis of unstructured data is a complex and intriguing problem. Unstructured
data are usually comprised mainly from text but other information, such as dates and
numeric values, can be present. Unlike in the case of structured data there is no rigid
and uniform structure such as is common with all structured data, i.e. rows being the in-
stances of the phenomenon and columns being the properties of the individual instances.
The structured data are sometimes also called rectangular due to their structure. Struc-
tured data are commonly saved in the .csv format since that format was created to store
this kind of data. An example of structured data might be list of countries (instances)
with their GDP, area and population represented in the columns. The final table is then
a rectangle which is where the name ‘rectangular data‘ comes from.
The lack of structure in unstructured data makes an analysis difficult since it is usually
highly irregular. An example of unstructured data might be online communication of
the employees, e.g. emails or internal instant messaging platform. Many specialized
techniques are required to retrieve the desired information.
The sheer volume of this data is gigantic. It was estimated in 1998 that somewhere
between 80-90% of all business data are actually in the unstructured form (Grimes), e.g.
anything from an invoice, internal memo to a log of a machine’s performance. This is
only a rule of thumb and no quantitative research has been done but is still a somewhat
accepted ratio and is often cited. This is an incredible amount of information that could
be levied to improve virtually every aspect of the business operations. This ratio is also
1
2
supported by the Computer World magazine’s claims that more than 70-80% of data
are in unstructured form (Chakraborty & Pagolu).
One type of unstructured data analysis is the sentiment analysis which is essentially an
automated way to extract opinions from the text at hand (Oxford Dictionary). The
most common example of extracted opinion is the polarity of a statement, i.e. pos-
itive/negative. A more detailed description follows in the next chapter. This type of
analysis suffers from many drawbacks which are enumerated and discussed in this thesis.
This thesis focuses on one problem specifically, sarcasm presence in the available textual
information. This can easily mislead the sentiment analysis model since taken literally,
sarcastic statements have a meaning different from the one intended. An example is a
statement ’It’s great to be at work when there’s such a nice weather!’, this statement
can seem at first as an honest, and positive, remark since it does not include any typical
signs of a negative comment. But with some background knowledge the hidden meaning,
sarcasm, can be easily seen.
This thesis tries to build on and extend the current sarcasm detection literature and
by combining the best existing features create a working sarcasm detection model. In
the process the distinct features are also compared against each other and the most
beneficial ones are found. The comparison of the features is one of the main extensions
of the current existing literature achieved by this thesis. This goal was achieved in two
stages, in the first one data pre-processing and feature engineering took place and in
the second the found features were used as an input for a neural net, i.e. in this case a
classification algorithm. In the first step several pre-processing approaches were used to
obtain various types of features and in the second step their actual benefit was evaluated.
The benefit brought by the specific feature sets is also compared to the computational
burden and the overall usefulness is evaluated.
The difference between the models in ML and standard econometrics should be high-
lighted since this thesis focuses exclusively on the former. In ML we cannot really derive
the true underlying process and model this process as is common in econometrics and
thus we do not know exact rules that govern the model’s decision process. This lack
of interpretability is a trade-off for overall better prediction power of these ‘black-box‘
models.
3
The thesis is organized as follows, in Chapter 2 sentiment analysis is described as a whole
with special care given to the sarcasm detection and related literature. In Chapter 3 the
used data are described and in Chapter 4 both pre-processing and feature engineering
are discussed in length alongside the description of neural net. Afterwards the results
are presented in Chapter 5 and the thesis ends with some concluding remarks in Chapter
6.
Chapter 2
Sentiment Analysis and Sarcasm
Detection
Sentiment analysis is defined by the Oxford Dictionary as ’the process of computationally
identifying and categorizing opinions expressed in a piece of text, especially in order to
determine the writer’s attitude towards a particular topic or product’. This discipline
based mainly on Natural Language Processing (NLP) and Text Mining is widely used
in many fields, from marketing and spam detection (Peng & Zhong, 2014) to financial
markets forecasting (Xing et al. , 2018). The specific applications and how sentiment
analysis is used in these fields are discussed in the following sections. This chapter is
divided into two sections. First the sentiment analysis is introduced in general and
afterwards sarcasm detection is discussed in detail.
Sarcasm detection is indeed an issue that typically prevents a straightforward sentiment
analysis. The problem sarcastic or ironic (here used interchangeably) comments pose
is they often use different polarity than they express. An example might be a sentence
’Great job!’, which can be meant both sincerely and sarcastically. In both cases the
sentiment seems positive at first sight but the true meaning differs and that can throw
the analysis off.
The other issues are not discussed in depth here and are only briefly mentioned. Different
types of opinions, i.e. regular vs comparative and explicit vs implicit opinions, are
an issue since detecting comparisons or implicit opinions is complicated. Subjectivity
detection is also an issue since many applications of sentiment analysis want to discern
4
5
between a statement without emotional undertone and with some emotional charge. Last
problem sentiment analysis faces is the point of view which can be either the author’s
or the reader’s.
These problems are quite intuitive but some examples are given here nonetheless (Liu,
2012). Regular opinion is stating that Coke is good while comparative is that Coke is
better than Pepsi. Explicit is saying iPhone has bad battery life while implicit is saying
iPhone has shorter battery life than Samsung. Subjective opinion is one expressing not
a fact but my personal opinion, i.e. I like Apple products. Lastly the point of view
matters as well. For example the news of a stock-hike is good for people who own the
stock but bad for people who shorted the stock.
The way in which these problems can mislead the analysis are various. The comparative
opinion requires the researcher to determine whether Coke is better than Pepsi or the
other way around. It must be established what is the relationship between the two
compared things. Implicit opinion suffers from the same problem, the structure of the
sentence must be used to automatically determine which of the two things is compared
to which. If this is not done the sentence can be understood incorrectly. Subjective
and objective opinions depend a lot on specific use cases but differentiating between
emotion-less and emotional statements can be important in removing potential noise
which brings very little to the overall model. The point of view issue is similar to
the problems implicit and comparative opinions pose. The entity which expressed the
opinion must be determined for us to know whether there is some hidden agenda, e.g.
an analyst claiming a stock has bad fundametals might have motivation to damage a
stock since his firm could hold a short position in the stock.
2.1 Sentiment Analysis
Sentiment analysis (sometimes also known as opinion mining) aims to determine the
position of the speaker/writer towards a specific issue. This can be done with respect
to the overall polarity of the analyzed document/paragraph/sentence/... or underlying
emotional response to the analyzed event. There are many slightly different names for
slightly different tasks within this broad category of sentiment analysis or opinion min-
ing such as opinion extraction, sentiment mining, subjectivity analysis, affect analysis,
6
emotion mining and review mining (Liu, 2012). Opinion extraction and sentiment min-
ing are the same and focus only on identifying the existing sentiment/opinion in the given
document. Subjectivity analysis is concerned with discerning between subjective and
objective statements. The possible usage differs per the need of the specific researcher
but common usage is removing the objective sentences before determining polarity to
improve the model’s ability to differentiate between positive and negative statements.
Affect analysis and emotion mining are the same since they aim to detect the expressed
emotional state in the text. Review mining describes all types of sentiment analysis but
used only on one specific type of text, i.e. reviews.
In sentiment analysis both the position of the writer and the specific issue addressed
are of interest since different types of sentiment analysis deal with different problems
arising from the different specifications (Liu, 2012). The position can be expressed in
two ways, either we can describe the stance of a person towards the issue, i.e. positive
or negative, or we can try to describe the emotional response elicited, i.e. sadness,
anger, happiness etc. Beyond the emotional response the subjectivity and aspect-based
measures of the statement/document can be looked at. Aspect-based measures attempt
to find expressed sentiment toward specific entities. A document can be negative toward
one entity and at the same time positive toward another entity, i.e. a review can be
critical to a museum and give an example of a good museum.
Subjectivity analysis aims to find subjective parts of the document in order to differen-
tiate between different types of expressed opinion. Another option is to downgrade the
importance or remove the objective statements to achieve greater accuracy in simple
polarity analysis. The aspect-based sentiment analysis aims to identify the sentiment
expressed towards a specific entity mentioned in the text, which requires to determine
not only the sentiment but also the entities in the text (Liu, 2012). By looking at the
opinions in finer granularity the analysis can be more precise and deliver more informa-
tion.
The subjectivity analysis and aspect-based approach are not discussed here in depth
since they are not of main interest in this thesis and only applications of the polarity
(positive/negative) of opinion or advanced sentiment classification are mentioned later
on.
7
2.1.1 Polarity classification
A few examples of possible applications of the standard sentiment analysis, i.e. polarity
classification, are given in this chapter. The examples include mainly data from social
networks and the tasks using sentiment analysis are various. They range from mar-
keting, e.g. identifying brand sentiment, counting the number of positive and negative
reactions, and using polarity as an input for recommendation systems, e.g. Twitter, to
financial forecasting. A practical example can be Starbucks, they use sentiment analysis
to identify complaints on Twitter and answer all of these negative comments.
The first application discussed is detecting spam or fake reviews. Here the term spam
reviews is used due to the terminology of Peng & Zhong (2014). Users giving false
positive or negative reviews to either boost or ruin a score of a product/store is not
an uncommon issue. Spotting a spam review and removing it can greatly improve
the trustworthiness of a rating site and improve the consumer experience as Peng &
Zhong (2014) discuss. They managed to improve the detection of the spam reviews
by employing a three step algorithm. The first step was computing the sentiment of
the reviews. This was followed by a second step where a set of discriminative rules
was applied to detect unexpected patterns in the text. The third and final step was a
creation of time series based on the reviews and sentiment score to see and detect any
sudden anomalies.
Another possible application is using sentiment analysis as an input for a recommen-
dation system to improve the recommendation precision. Yang et al. (2013) use a
recommendation system based on two main approaches. One is using a location-based
data and the other is using sentiment analysis data to make recommendations. They
use social networks with location data and by using this information in tandem with
the information from sentiment analysis, i.e. sentiment of posts on these social media,
they manage to create and successfully apply a novel recommendation engine. An-
other example worth mentioning is the paper of Gurini et al. (2013) where they ex-
plore the possibility of better friend recommendation systems. They took the standard
content-based similarity measure, cosine similarity, and augmented it by plugging in the
Sentiment-Volume-Objective (SVO) function. SVO is based on the expressed sentiment,
the volume and the subjectivity of the reactions toward a concept by a specific user. The
contribution of the sentiment analysis here is the augmentation of the measure used to
8
recommend the specific users. Thanks to this augmentation the recommendation engine
takes into account more information and knows how similar the attitudes of the two
persons are. By using not only the shared interests but also the sentiment analysis of
the user’s content they managed to create and original model.
In a newer paper Zimbra et al. (2016) decided to use neural nets and Twitter data
to assign a brand a sentiment class in a three- or five-class sentiment system. Here the
sentiment analysis is used to find out the public perception of a certain brand, which can
be very useful for all marketing practitioners. They manually labeled tweets mentioning
Starbucks and used those in their analysis.
Poecze et al. (2018) studied the effectiveness of social media posts based on the metrics
provided by the platform and sentiment analysis of the reactions. The effectiveness was
measured by the number of different reactions, e.g. ’shares’ and ’likes’. They found
out sentiment analysis proved an invaluable complement in evaluating the effectiveness
of certain posts since it was able to look past the standard metrics and determine why
some forms of communication were less popular with the consumer base. The underlying
positive/negative reaction of the base is better measured by the sentiment analysis than
by number of views and reactions alone.
Ortigosa et al. (2014) concentrate solely on Facebook and determining the polarity of the
users’ opinions and the changes in their mood based on the changes of their comments’
polarity. By combining a lexical-based approach and a machine learning-based one they
managed to create a tool that first determines the polarity of a submission and then by
computing the user’s standard sentiment polarity detects significant emotional changes.
Lastly the survey from Xing et al. (2018) focuses on financial forecasting. The natural
language based financial forecasting (NLFF) is a strand of research focused on enhanc-
ing the quality of financial forecasting by including new explanatory variables based on
NLP with sentiment analysis having a prominent place in this area. Among the ana-
lyzed documents are corporate disclosures such as quarterly/annual reports, professional
periodicals such as Financial Times, aggregated news from sources like Yahoo Finance,
message boards activities and social media posts. The texts from these sources can then
be used for different types of sentiment analysis like subjectivity and polarity analysis.
This can be useful for many parties, an obvious example being any trader or hedge fund.
9
Any information signalling possible price movement of a stock is extremely valuable and
can be used in the overall valuation.
2.1.2 Beyond polarity classification
Mohammad (2016) discusses in his work sentiment analysis quite holistically, from stan-
dard polarity and valence of the shown sentiment to emotions. The problems of this
strand of research are discussed alongside some of the underlying theory from psychol-
ogy. One of the main issues is finding labeled data since in emotion analysis we don’t
have just a binary problem, e.g. positive/negative or sarcastic/not sarcastic. The exist-
ing work is summarized in this paper which makes it a good starting point for anyone
interested in this specific area of sentiment analysis.
One of the solutions to the problem of missing labels in data sets can be the emotion
lexicon presented in Staiano & Guerini (2014). It is an earlier work but it can serve as a
good starting point for other emotion lexicons. The authors created a lexicon containing
37 thousand terms and a corresponding emotion score for each of the terms. This can be
an invaluable source for anyone wanting to apply a sentiment analysis that goes beyond
simple polarity.
2.2 Sarcasm Detection
Sarcasm detection has become a topic in sentiment analysis due to the problems it
causes. A statement imparting negative/positive sentiment while seemingly having the
opposite polarity can quite quickly lower the accuracy of sentiment analysis models (Liu,
2012). This gave rise to a specialized field of research – sarcasm detection. Most of the
papers focused on this problem are quite new (less than 10 years old) and the progress
from first attempts using only word lexicons and lexical cues to the current literature
using context, word embeddings and other neural net-based approaches is significant.
This section is divided into two parts, in the first a survey paper is presented as a possible
introduction into this field of research. In the following section actual research papers
are discussed and the evolution of this area of research presented.
10
2.2.1 Survey of sarcasm detection
The most notable paper that concentrates on this topic is from Joshi et al. (2017). The
setup, issues, different kinds of data sets, approaches, their performances and the trends
present in the sarcasm detection literature are discussed. The problems discussed in this
paper are mentioned here but they are not analyzed in depth. In the following section
some papers dealing with sarcasm detection are discussed and when relevant the issues
are highlighted.
The linguistic background is not discussed here and can be found with proper references
in the Joshi et al. (2017) paper alongside the overall problem definition.
The different types of available data are divided based on text length – short, long,
transcripts/dialogue and miscellaneous or mixed, i.e. those that do no fit into previous
categories. The different approaches to the problem of sarcasm detection (illustrated
in the choice of papers in next section) can be divided into rule-based and feature set-
based. A rule-based system is typically a set of {IF : THEN} statements such as
IF ′red′ THEN ′stop′ rule for a car. These rules can be much more complicated and
possibly nested as well. The feature set-based one also comprises of usage of different
algorithms, i.e. standard ML and deep learning ones. The trends are also described in
the next section since the choice of papers is organized along the time axis and nicely
imparts the evolution of the field.
Then the issues with the process of sarcasm classification are discussed. The prominent
issue is the data annotation. Several options are discussed in the next chapter but the
main difference is along the axis self-labeled and annotated by third party. The quality
of these labels is also in question since humans have trouble discerning between sarcastic
and non-sarcastic utterances and no labels from a third-party can be perfect. With self-
labeling the data quality problem lies in selecting users that use the appropriate hashtag
or similar method to announce the use of sarcasm. This is coupled with cases where the
hashtag is used either by accident or in different meaning than intended, e.g. talking
about the usage of #sarcasm. Another issue is the inherent skewness in data since most
comments and tweets are not sarcastic and some measures need to be taken for the data
set to be balanced.
11
Some local/temporal specifics of lexical clue are important as well. These clues are
not constant in time and in different cultures/countries. Usage of emojis and some
abbreviations, e.g. LOL, is a somewhat newer development and is not spread evenly
over the globe. Background knowledge is required for constructing this type of feature
sets.
2.2.2 Research papers
In following paragraphs the progress in the field is illustrated and the changes in overall
approach to classifying sarcasm are shown. Some of the changes are linked to the growth
of computational power, e.g. possibility of using neural network-based methods.
One of the early papers focused on identifying sarcasm is the Carvalho et al. (2009)
paper that attempted to find irony in user-generated content by looking for lexical clues,
specifically detecting irony in sentences containing normally positive words by looking for
oral and gestural clues. Examples of these clues are among others emoticons, laughter
expressions (haha, LOL,...) and number of punctuation marks in a row. They also
found that some more complex linguistic information were quite inefficient in detecting
irony. The more complex linguistic information were constructions such as diminutives,
determiners and verb morphology. Since this research was done mainly on Portuguese
text some of the chosen features were specific to Portuguese and are not transferable to
other languages. The evaluation of their work was done by hand, some comments were
chosen at random and then manually scored with this score being then compared to the
predicted value. Their contribution is showing that the overall sentiment analysis can be
easily improved by using even simple lexical rules to identify some sarcastic comments.
Davidov et al. (2010) use just like Carvalho et al. manually labeled data. They use the
Amazon’s Mechanical Turk to label a large collection of tweets and of Amazon reviews.
This gives rise to the main data set which is then divided into training and test set.
This is the common and correct approach and later on in this section it is assumed
as standard. Only deviations from this approach are highlighted. Each instance is
rated by three independent evaluators which guarantees a high level of reliability. Their
algorithm is based on expressing each observation in training set as a vector based on the
extracted features and then using the k-nearest neighbor technique to match the testing
observations to its most similar vectors in training set based on Euclidean distance.
12
Majority voting was then used to classify the testing observation. Their features included
a separation of the sentences into different patterns identifying so called high-frequency
words and content words. They have pre-existing patterns and look how similar the
review/tweet is to their patterns and each observation gets a score between 0 and 1 for
each pattern. These patterns are then combined with some punctuation based features.
Newer papers include Riloff et al. (2013) that focuses exclusively on Twitter data. Their
data are once again labeled by human annotators and are chosen with special emphasis
on choosing tweets that are sarcastic on its own and not as a reply. By doing this
the context becomes insignificant and one of common problems in sarcasm detection
is solved. They used several feature sets, various existing sentiment lexicons, support
vector machines on BoW (created with unigrams and unigrams & bigrams) and their
own sentiment lexicon created by bootstrapping the key phrases. They have a very
specific case of sarcasm in mind, one where a positive sentiment is followed by negative
sentiment with this incongruity creating the irony, e.g. ’I love being ignored’. Then
they use this assumption to learn positive and negative phrases (using many iterations)
which then serve as their lexicon used to identify the positive and negative sentiment
present. If both are found in tandem then that is taken as evidence of sarcasm being
present. The overall model is more complex but this description is supposed to give only
a very rough overview of the idea.
Wallace et al. (2014) is the first paper mentioned here that talks about the necessity of
using context for detecting sarcasm. It claims that since humans usually need context
to detect sarcasm, machines should as well. They reach this conclusion on Reddit, social
media site described in Chapter 3, data set which is scored by humans and during the
scoring it is recorded whether the annotator asked for additional context. They used a
basic logistic regression with one explanatory dummy variable (asking for context) to
show that in case of an ironic comment the annotators are more likely to ask for context.
Their claim was substantiated also by another model, another logistic regression. They
have shown that their base-line model makes mistakes more often in the cases where
the annotators asked for context, which again points to the intuitive conclusion of both
machines and humans needing context for sarcasm detection.
Bamman & Smith (2015) are also trying to detect sarcasm by leveraging the contextual
information. Another important feature of this paper is the data they are using. They
13
use tweets and they started using self-labeled data, i.e. #sarcasm, #irony, etc., while
taking measures to exclude retweets and other submissions that are not appropriate for
their study. They used many standard approaches to create features such as unigrams
and bigrams (BoW), part of speech tags, capitalizaton, punctuation, tweet sentiment
and tweet word sentiment within one tweet. Then they also extracted features regarding
the authors such as their profile information and historical sentiment, audience features
such as communication between author and addressee and their shared interests. They
found out that combination of the information about author and those contained in the
tweet was nearly on par with the inclusion of all features at hand. They used a basic
regularized regression and managed to reach accuracy of 85%.
A similar avenue was studied by Khattri et al. (2015) who used just like Bamman &
Smith (2015) author’s history but they rather focused on sentiment contrast than many
different text-based features. They implement this idea by creating two models and
then synthesizing them. The first is they find the overwhelming historical sentiment
towards a specific entity and the second is looking at what sentiment is shown towards
the entity in the current tweet. This model using context is then combined with stan-
dard sentiment incongruity within one tweet. They develop specific rules for combining
these models and report a very good overall performance. The performance is based on
standard classification measures descibed in Section 4.3. They also acknowledge some
shortcomings of their approach such as assumption that author’s history contains the
true sentiment and some accounts being deactivated, renamed or otherwise inaccessible.
This accounted for approximately 10% of all authors.
Another paper worth mentioning is from Wallace et al. (2015) which is a follow up of
the paper Wallace et al. (2014) where it was shown that machines do need context for
detecting sarcasm jut like humans. In the newer paper a proper model implementing
the context as a feature is presented. The data set used was presented in the older
paper and here it is only reused. As their features they decided to leverage the detected
sentiment alongside the specific subreddit, which is a specific part of Reddit and is defined
in the beginning of Chapter 3, and so called ’bag-of-NNP’. This bag is constructed in
the same way as standard BoW but only noun phrases (NNP), not unigrams, are taken
into account from each comment. Then by creating an interaction feature set between
the NNP features and subreddits they acquire their fourth feature type. To properly
14
leverage these features they use a specific regularization combining both an l1 and l2
penalty (explained in regularized regression), each levied on a specific subset of features.
Ghosh et al. (2015) focused on somewhat different task, discerning between a literal
and sarcastic usage of a word. They used Twitter data and then employed Amazon’s
Mechanical Turk to crowdsource the labels on specific phrases. The tweets themselves
were self-labeled, i.e. #sarcasm, #sarcastic, ..., and the phrases in the tweets that
were meant sarcastically were labeled by the Turkers, i.e. people who perform tasks
at Amazon’s Mechanical Turk. The Turkers also came up with an alternative way to
say the same message but without sarcasm. This gave rise to a lexicon of phrases with
similar meaning to the original sarcastic utterance. These phrases were then transformed
by word embeddings and used to classify the sarcastic tweets.
Joshi et al. (2016) also worked with word embeddings but in a different fashion than
Ghosh et al. (2015). Instead of using the embeddings themselves as features they use
them to find the word similarity between respective words within comments however
they do not use contextual information in this paper. Only features discussed previ-
ously (lexical, BoW-based, sentiment incongruity, punctuation) on top of the addition
of finding similarity between different words using their word embeddings. Several dif-
ferent pre-trained instances of word embeddings were used and the overall result was
that these features bring significant additional value.
Amir et al. (2016) followed in the direction of Joshi et al. (2016) and used word embed-
dings as well. Their innovation was training user embeddings which contain context and
author’s history. Their overall model is then a neural net with convolutional layers which
processes both user embeddings and content features. They split these features into four
groups – tweet-features, author-features, audience-features and response-features. They
used an existing data set from the paper by Bamman & Smith (2015) and the used
features were also the same with the exception of the user embeddings. The proposed
model seems to be outperforming the original model from Bamman & Smith (2015).
Chapter 3
Data description
In this chapter the used data are presented. Two different data sets were used, one is a
freely available data set created for sarcasm detection tasks and the other is a data set
of authors’ history downloaded from Reddit for this thesis.
Reddit is the self-proclaimed front page of the internet. It is a social-media news ag-
gregation website where each user can post in each Reddit forum (called subreddit) a
comment, link or a video and others comment and react to it in a typical hierarchi-
cal manner. Subreddits range from topics like politics and religion to NBA (highest
US basketball league), NFL (highest US american football league) and all the way to
country-specific subreddits like the Netherlands.
Both data sets are based on Reddit comments which were all in English. Almost all
of Reddit is in English and the prepared data set was curated to include only English
comments.
3.1 Sarcasm detection data set
The main data set used in this thesis is the one prepared by Khodak et al. (2017).
The data are split into training and testing part by the authors already and on top of
that there are two versions – balanced and unbalanced one1. In this thesis the balanced
version is used due to the aim of the thesis and hardware limitations. The balanced
1Balanced data set is a data set where all classes are equally represented while an unbalanced onehas higher prevalence of some classes than other classes
15
16
version contains more than a million comments and that already presents a substantial
computational burden.
The corpus consists of several different variables beside the label. The comment itself, the
parent comment, the subreddit where the comment was published, the author, publish
date and the number of likes and dislikes. Naturally the comment itself and the context
(parent comment and subreddit) hold the most information and are expected to bring a
lot of explanatory power.
The data set is presented and described in quite a length in the original article and here
only the most important aspects are highlighted.
The most important detail that must be mentioned is the way the labels are acquired.
The sarcastic labels are added by Reddit commentators themselves by adding /s at the
end of their comments. The labels are put in during the actual writing of the comment.
They are not added at later date.
Several filters were used to exclude noisy comments, these include filtering comments
which are just URLs and special handling of some comments due to properties of Reddit
conversations, such as replies to sarcastic comments, since these are usually very noisy
according to Khodak et al. (2017).
The raw data set contains 533 million comments, of which 1.3 million are sarcastic, from
January 2009 to April 2017. This set is controlled to include only authors who know the
Reddit sarcasm notation, /s, by excluding people who have not used in the month prior
to their comment the standard notation. These are excluded to minimise the noise by
authors that do not know this notation. This sarcasm notation used on Reddit is known
as self-labeling which means only the author has any influence on the final label of the
comment. Khodak et al. (2017) devote large part of the paper to showing the noise is
minimized and the data are thus appropriate for analysis. They study the proportion of
the sarcastic and non-sarcastic comments, the rate of false positive and negatives and
the overall quality of the corpus for NLP tasks.
For a comment to be eligible for the data set it must pass several quality barriers, i.e.
• the author is familiar with the notation meaning they used the sarcastic label in
the past month prior to the comment,
17
• the comment is not a descendant, direct or indirect, of a comment labeled as
sarcastic due to noise in the labeling following a sarcastic comment,
• due to the sarcasm prevalence in the overall data set being 0.25% only very few
comments can be marked as sarcastic and actually be a reply to sarcastic comment
which was not labeled as one.
The noise from classifying sarcastic comments which were answers to non-labeled sar-
castic comments exists but in the overall scale is minimal. The impact on the overall
performance is negligible but it is important to mention its existence. We are dealing
with real-world data and obtaining proper labeling for every instance is simply impos-
sible, this is as good as it gets.
The authors also performed a manual check on a small subset of the data and arrived at
conclusion that the number of false positives is about 1% while the false negative rate is
about 2%. This is an issue in the unbalanced setting as the sarcasm prevalence is about
0.25% but in the balanced setting of the data set this noise is quite limited. Overall the
manual checks are described in length in the original paper and Khodak et al. (2017)
state the filters are working reasonably well as shown by high overlap percentage with
the manual check. The manual label was created by a majority voting scheme between
human annotators.
Khodak et al. (2017) also explain why Reddit is a better data source for sarcasm
detection than most other commonly used sources. It holds the same advantage as
Twitter in having self-labeled data while being written in non-abbreviated English unlike
Twitter. Thanks to Reddit being organized into subreddits with clear comment structure
and context it is much easier to recover the necessary context features. According to
the authors these are the main advantages which make the data set more realistic and
better suited for this type of research.
The balanced data set used in this thesis has slightly more than one million comments.
3.1.1 Authors’ history
The idea behind the author’s history is to create a typical behavior pattern of a user.
We want to create a benchmark that is relevant in explaining the user’s behavior.
18
The number of authors present in the above-mentioned data set is quite large, approx-
imately quarter of a million. A large majority has only one comment and only a small
number of people has many comments. The largest number of comments per author is
854. Thus the only way to retrieve an author’s history is to download it directly from
Reddit.
In order to create a reasonable representation of user’s behavior on Reddit 100 newest
comments per user were downloaded. These then serve as a benchmark to compare the
comment of interest to the ”standard” way the user expresses themselves. This kind
of information is again targeted at the incongruity within a person’s expressions and is
based on the original idea of Amir et al. (2016).
The specific process is in detail described in Section 4.1.6. These comments were used
to create an equivalent of word embeddings for users. Not a word but a user is mapped
into the high-dimensional vector space.
Chapter 4
Methodology
The methodology description is divided into several parts, firstly the process of feature
selection and text processing is explained and motivated. Secondly neural networks
are described and explained. Thirdly and lastly the approach to model evaluation is
explained and the chosen measures are justified.
4.1 Pre-processing & Feature engineering
Several text processing techniques were used to extract the signal from the available
comments. Context is very important for sarcasm detection for humans and the same
applies for machines as was already shown in the literature (Wallace et al. , 2014). Both
the original comment and its context (parent comment and author’s history) are used
to provide additional information for our modeling. The exact extent to which is this
information used is described for all distinctive feature sets in the following sections.
The used techniques are described in some detail in the following sections. Alongside
the described features a specific form of context is used – a simple variable specifying
the prevalence of sarcasm of the subreddit. The share of sarcastic comments is calculated
by dividing the number of sarcastic comment by the overall number of comments.
19
20
4.1.1 Bag-of-Words
This technique is applied to the text itself, firstly all unnecessary words from all docu-
ments at hand are removed and then the remaining words are used to fill a matrix, i.e.
the BoW. In this matrix the documents represent the rows and the columns represent
the words. Each field in the matrix thus marks how many times a word was used in a
document.
This is a standard NLP method that is widely used due to its strength and predictive
power.
The pre-processing of the text, i.e. removing all unnecessary words, consists of sev-
eral steps to achieve a clean text that can be easily transformed into the BoW. This
was achieved by an algorithm which was implemented using two Python libraries, i.e.
spaCy (Honnibal & Montani, 2017) & NLTK (Bird et al. , 2009), and Python’s built-in
functions. The steps taken to clean the text were:
• all words are converted into lower case letters,
• all words are stemmed – stemming refers to leaving in place only a stem of the
word, e.g. fish, fishing, fisher and fished have all the same stem,
• only words are left in place – numbers, punctuation and special characters are
removed,
• stop words are removed, i.e. words that are too general and carry no specific
meaning such as the, a, and, is etc.,
• one- and two-letter words are also removed.
The algorithm described above was implemented from scratch, meaning only core func-
tions, e.g. transformations to lower case letters and stemming, were used in the process.
The rest of the algorithm was developed specifically for this thesis.
This leaves a significantly smaller vocabulary in the documents and the preprocessed
text is then used to create the BoW. While the first steps seem reasonable even at first
sight, the last step requires some additional justification, removing the one- and two-
leter words. The reason is that many short words can in the end be stemmed to the
21
same letter(s), e.g. the word ’saw’ becomes ’s’. With the large document collection at
hand this can occur often and create a substantial noise which would significantly lower
the overall accuracy of our sarcasm detection model. This applies to all the steps above
and not just the last one.
Last step is the normalization of this matrix known as Term Frequency–Inverse Docu-
ment Frequency (TF-IDF). The goal of this normalization is to down-weigh words that
are present in many documents since that points to the word being a common occurence
and having less predictive power than those present only in a smaller subset of the
documents. This step should improve the overall performance of our sarcasm detection
model.
The TF-IDF measure is denoted as
tf -idf(t, d) = tf(t, d) ∗ idf(t, d) (4.1)
and is dependent on d, the document, and t, the terms/words. The first term, tf(t, d),
is intuitive and presents the number of occurrences of a word in a document while
the second is a bit more complicated. The inverse document frequency is denoted as
idf = log 1+nd1+df(d,t) where nd is the overall number of documents and df(d, t) describes
the number of documents where the term in question appeared. The reason for this
transformation is simple, this normalization helps bring out the more meaningful terms,
those with more influence. The output is a simple BoW with each field not containing the
number of occurrences of word in a document but the normalized equivalent described
in Equation 4.1.
The final step is applying the Truncated Singular Value Decomposition, i.e. a concept
related to PCA, to lower the dimensionality of our matrix. Truncated SVD is now briefly
described.
The Truncated SVD simply produces a low-rank approximation of the matrix in ques-
tion. X ≈ Xk = UkΣkVTk with UkΣk being the transformed training set and having
k components. This is used often due to computational complexity since it does not
calculate the full and exact matrix but only its low-dimensional approximation. The
22
low-dimensional approximation has the same number of rows and only k columns. A
more in-depth explanation can be found in literature, e.g. Bishop (2006), or in Appendix
A. The final output is then a matrix with significantly smaller number of columns and
the same number of rows, with each row still representing a specific document.
This transformation is applied to both the comment of interest and the parent comment.
4.1.2 Sentiment-based features
The sentiment-based features are extracted from both the parent comment and the
comment of interest. The comment’s sentiment and subjectivity alongside the most
positive and negative word and the standard deviation of the polarity of all the words
in the comment is also determined. Sentiment is a continuous measure on scale from −1
to 1 while subjectivity is a continuous measure on scale from 0 to 1. Negative polarity is
associated with negative numbers while positive is associated with the positive part of
the number axis. Objective statements are described by lower numbers while subjective
statements have higher values. This expresses how positive/negative the specific word
is, e.g. ’good’ is not as positive as ’great’. This scale exists so that we can order the
words based on their overall positivity/negativity.
The reasoning behind this choice is based largely on the paper from Joshi et al. (2017)
and his discussion of sarcasm and sentiment. If the comment includes at first sight an
incongruity of sentiment, i.e. one sentence is positive and second negative, it might be
an indication of an unspoken message, i.e. sarcasm. The same applies for the previous
comment, if the parent comment is in stark contrast with the comment in question
we might be able to use it to detect sarcasm. The idea behind including comment’s
subjectivity is also obvious, it is difficult to express sarcasm without using at least some
subjective terms that can carry emotional meaning.
The sentiment retrieval in this thesis was done using the Python library NLTK (Bird
et al. , 2009) and its pre-trained sentiment models. Each word has a specific sentiment
and obejctivity assigned and the pre-trained model is essentially a look-up table from
which the desired value is retrieved. This is a state-of-the-art NLP library widely used
by many researchers.
23
4.1.3 Lexical-based features
Another set of features that might be indicative of sarcasm are features based on the
textual expression of a user. There are some common ways to express sarcasm in the
online world instead of voice inflection, such as usage of punctuation, capital letters and
emojis. All of these options are studied in this thesis.
The number of capital letters divided by length is used to express how prevalent is the
usage throughout the comment, a high value should be indicative of a comment more
likely to be sarcastic. The same is done with words in all capitals, again the higher the
value, the higher the likelihood of the comment being sarcastic based on intuition and
the way people express themselves. This reasoning is used in the case of punctuation as
well, the number of punctuation marks, the number of exclamation and question marks
and the number of ellipsis cases, several dots in a row, are all retrieved and normalized
per comment by its length. Lastly the number of emojis normalized by the comment’s
length is calculated.
These six lexical measures are calculated for both parent and original comment.
The built-in Python library regex was used.
4.1.4 PoS-based features
Another approach to processing the text is using the word types present in the comments.
PoS tagging gives every word within the comment the appropriate tag, e.g. proper noun,
pronoun, adjective, adverb. There are together 34 English word types used in the NLTK
library (Bird et al. , 2009) which was used for this specific task and already mentioned
in the sentiment-based feature engineering section. By extracting all word types from
the comment we might be able to better detect sarcasm since some word types are more
likely to carry the sarcastic meaning than others. It is not easy to express sarcasm
without adjectives and/or adverbs. By exploiting this knowledge we might be able to
uncover some additional signal in the data. This is a novel approach not tried in the
previous literature. The PoS-based features were only scarcely used and usually only as
auxilliary measures and not as main explanatory variables.
24
We implemented this approach by creating the standard BoW with the columns repre-
senting all the word types. Then we used the entire matrix as an input, to see whether
the comment’s structure plays a role or not.
4.1.5 Word similarity-based features
Determining word similarity is an idea dating back decades. Many approaches and
techniques were applied to this problem but since the seminal paper from Mikolov et al.
(2013) and the subsequent research of Mikolov and his team their approach to word
embeddings became the standard. Their algorithm is called word2vec and thanks to
creating a version of BoW, the so-called continuous bag-of-words, and training a shallow
neural net over the continuous BoW, it managed to capture the underlying semantic
meaning much better than all previous attempts. The continuous BoW is based on the
idea of context, each row and column represent a word and a field is filled by the number
of times the word from the column appeared in the context of the word from the row.
Context here represents immediate surrounding of the word from the row, window’s
precise size is not predefined and can change based on the problem at hand. In the end
each word is represented by a vector in a high-dimensional space. Each vector is the
output of the shallow neural net and the dimensionality is determined by the output
parameters of the neural net. The input for the neural net is the continuous BoW. More
detailed explanation can be found in the paper from Mikolov et al. (2013). Similarity
in this kind of setting is determined by calculating the cosine distance between the two
vectors.
The implemented approach used the Python library spaCy (Honnibal & Montani, 2017)
and took advantage of their pretrained word vectors. By using a pretrained model it
is enough to simply extract the underlying vector of each word and then compare the
vectors of interest when necessary.
Joshi et al. (2016) looked as one of the first at this approach to sarcasm detection
and found that just like with sentiment certain semantic incongruity in the text can be
indicative of sarcasm presence. Thus we compare all verbs and nouns within a comment
against each other and find the highest and lowest similarity within these two groups.
This tells us whether the comment is heterogenous or homogenous meaning-wise.
25
Studying the similarities and dissimilarities within a comment can lead to better detec-
tion of sarcasm. If the comment has a comparison of a person being at something ‘as
good as fish at flying‘ then it’s not meant sincerely and the differences in the semantic
meaning can help detect the sarcastic comment. Since ‘fish‘ and ‘flying‘ are dissimilar
the cosine difference would be large.
Lastly the comment of interest and parent comment are compared and their similarity
is calculated using cosine distance as well. This is done internally by the spaCy library
by averaging the vectors within the comment and then calculating the cosine similarity
of these two final vectors.
4.1.6 User embeddings
The last step of the feature engineering process is the creation of a vector representing
a specific user. The user history downloaded from Reddit is used in tandem with the
pretrained word vectors from the spaCy library mentioned in previous section. This idea
is based on the paper from Amir et al. (2016) and takes the past utterances of a user,
i.e. their past comments, and averages the vectors of the words they used. Afterwards
this vector that represents the user is compared to the comment being categorized and
their similarity is calculated using once again the cosine distance.
This feature is meant to identify potential differences from the user’s typical way of
expression. If the user is behaving and writing differently than is common then it might
a sign of insincerity. This is simply another attempt to identify the potential incongruity
present in the text in case of sarcasm.
4.2 Neural network
Neural networks are widely used models based on a collection of connected units (called
neurons) which loosely imitate the way human brain works. The first relevant paper is
from McCulloch & Pitts (1943) and is nowadays considered as the first stepping stone
in the creation of neural nets.
A good description of neural networks is from Kriesel (2007): ”An artificial neural
network is a network of simple elements called artificial neurons, which receive input,
26
change their internal state (activation) according to that input, and produce output
depending on the input and activation. The network forms by connecting the output
of certain neurons to the input of other neurons forming a directed, weighted graph.
The weights as well as the functions that compute the activation can be modified by
a process called learning which is governed by a learning rule.” This is a very brief
and condense explanation although not the most intuitive. A very intuitive explanation
for econometricians, of a neural net used for binary classification, can be found in the
paper from Mullainathan & Spiess (2017) – ”... for one standard implementation [of a
neural net] in binary prediction, the underlying function class is that of nested logistic
regressions: The final prediction is a logistic transformation of a linear combination
of variables (“neurons”) that are themselves such logistic transformations, creating a
layered hierarchy of logit regressions. The complexity of this function class is controlled
by the number of layers, the number of neurons per layer, and their connectivity (that
is, how many variables from one level enter each logistic regression on the next).”
The deep learning approach is often applied on specific type of data, i.e. text, audio
and image, and a different type of neurons is included alongside the already described.
The main idea is to mimic some of the functions of human brain and to give the net the
ability of abstraction such as recognizing patterns in images or in text. Some examples
are automatic translations, image classification and image captioning. It has not been
applied in this thesis and we do not discuss here any longer.
The choice of the neural net as a classifier was largely based on two reasons. The first
reason is the overall strengths of the algorithm. In current applications neural nets are
the go-to algorithm and commonly manage to achieve the best performance if properly
optimized. The second reason is the ability of the classifier to use online learning.
4.2.1 Feed-forward Neural network
Several types of neural nets exist which are meant specifically to tackle image or speech
recognition, text analysis or even standard classification issues such as the one presented
in this thesis. The basic neural network used in this thesis is the one described in the
previous section, essentially a series of logistic regressions. It does not have to be a
series of logistic regressions per se, it can be only a series of binary classifiers where each
neuron acts as a classifier. In the terminology of econometrics and a context of a series
27
Figure 4.1: A basic neural net (Bishop, 2006)
of logistic regressions the link function does need to be the logistic function. The more
complex neural nets with different architectures are not discussed here at length since
they are used to solve different types of problems.
The specific issues and details regarding the way the default parameters of neural net
are set such as the number of layers and its neurons are discussed in the Chapter 5 as
it is an empirical and not a theoretical issue.
For simplicity, we illustrate the procedure with a simplified neural net that simply con-
tains 3 layers: inputs, outputs and an intermediate, so-called ”hidden layer”. The
described architecture is shown in the Figure 4.1. As can be seen in the Appendix B,
much more complicated nets were used for the actual empirical analysis at hand. The
mathematical description that follows is also based on this net to make the explanation
as simple and as informative as possible. The following example is based largely on the
well-known Bishop (2006).
The input, x, for our neural network are the feature sets described in the previous section.
These are thus the results of the application of one or several of the aforementioned pre-
processing techniques.
The output of each final neuron of the neural net can be denoted as
28
y(x,w) = f
M∑j=1
wjφj(x)
(4.2)
where φj(x) depends on parameters and is adjusted alongside the coefficients wj . The
neural network uses the Equation 4.2 as its underlying idea. Each neuron’s output is
a nonlinear transformation of a linear combination of its inputs where the weights, or
adaptive parameters, are the coefficients of the linear combination.
As per the Figure 4.1 M represents the number of hidden neurons and D represents
the number of input variables. This Figure 4.1 is then used as the basis for the overall
idea of the neural network. Firstly we take M linear combinations of the input variables
x1, ..., xD in the form
aj =D∑i=1
w(1)ji xi + w
(1)j0 , (4.3)
where j = 1, ...,M corresponds to the neurons in the hidden layer of the neural net. The
superscript (1) refers to the input layer. The term w(1)ji is referred to as weights and
w(1)j0 as biases. Bias is commonly known in econometrics as intercept. The weights are
randomly initialized, the exact approach is described in this thesis in Chapter 5 and is
taken from the paper from He et al. (2015). The quantity aj is known as activation and
each activation is then transformed by a nonlinear activation function h(·) which leads
to the final neuron output in the shape
zj = h(aj). (4.4)
The idea behind this approach is that the weighted inputs must together pass a certain
threshold to be influential. This naturally depends on the specific activation function.
Then all the neurons together influence the final output which is the threshold that
matters. The output neuron with the ‘strongest signal‘ is then taken as the output per
the one observation.
29
The neurons in the network after the first input layer are known as hidden units and
these quantities correspond to their output. The next layer (output layer in Figure 4.1)
then has output
ak =M∑j=1
w(2)kj zj + w
(2)k0 , (4.5)
where k = 1, ...,K and K represents the number of outputs, in case of binary outcome
K = 2. The output here could be just one neuron, this is not an universal choice and
the preference in the literature varies. The decision is then taken based on which of the
two output neurons has higher final activation value, the one representing sarcastic or
non-sarcastic class. This equation describes the second layer of the network (the output
layer) and the resulting activation is again transformed with a nonlinear activation
function which gives us the set of network outputs yk. The final activation function
value naturally depends on the data set at hand and its structure, for a regression it is
an identity function and for binary classification it can be a standard logistic sigmoid
function in the shape yk = σ(ak) = h(ak) where
σ(a) =1
1 + exp(−a). (4.6)
If all the enumerated steps are combined then the final output of the neural net in our
example can be written down as
yk(x,w) = σ
M∑j=1
w(2)kj h
D∑j=1
w(1)ji xi + w
(1)j0
+ w(2)k0
(4.7)
where all the weight and bias parameters are grouped together in the vector w. This
means the neural net is a nonlinear function from input variables {xi} to output variables
{yk} determined by the weights in the vector w. The process described up to this point
is also known as forward propagation. The choice of the activation functions is done
empirically and is left for Chapter 5 which is focused on the empirical part.
30
The overall training of the neural net happens in two stages, the forward propagation
and backpropagation. The forward propagation was already described and gives us the
exact value of our error function1 and is obtained in the following Equations 4.8 or 4.14.
Given the prediction tn, the error function is non-convex due to the highly nonlinear
nature of the neural net and usually looks like
E(w) =1
2
N∑n=1
{y(xn,w)− tn}2 (4.8)
in the regression setting when tn is the value of the target, i.e. dependent variable in
econometrics, or as
E(w) = −N∑n=1
{tn log yn + (1− tn) log(1− yn)} (4.9)
in the binary classification case when tn ∈ {0, 1}.
The relationship between the forward propagation and backpropagation is quite straight-
forward. The forward propagation returns the value of our error function for the cal-
culated weights and biases. The backpropagation then recalculates these weights and
adjusts them to achieve lower overall error. The specific way how this is done is now
described.
The non-convexity means there are multiple local extrema and finding the global ex-
tremum is difficult and sometimes not necessary. Finding a local extremum that is ‘close
enough‘ is often satisfactory. The problem is we never know what means ‘close enough‘
if we do not know the value of the global maximum. In practice this is done by iter-
ating over many hyperparameters and searching for the values which lead to the best
performance of the neural net on the validation set while controlling for overfitting.
We start with an initial vector w and we look for one that minimizes E(w). The obvious
first approach is to find an analytical solution but sadly in the case of our non-convex
error function this does not exist and we must look for the optimal vector w with a
1Error function is the ML equivalent of the loss function in economentrics.
31
numerical approach. When the vector w is changed to w + δw it causes the error to
change δE ' δwT∇E(w) where ∇E(w) represents the gradient of the error function.
The optimization of such functions is a common problem and is done in several steps.
Usually a starting point is chosen, w(0) and then in each step the weight vector is
updated using
w(τ+1) = w(τ) + ∆w(τ) (4.10)
where τ represents the iteration step. Each algorithm approaches this problem differently
and this problem is not described here at length. The algorithms usually differ in the
way they update the weights at each step and how each step is made, i.e. whether only
the gradient is added or if some other term based on the gradient is added as well. Some
details can be found in Bishop (2006). Every library implementing neural nets includes
these algorithms, including the PyTorch library (Paszke et al. , 2017) used here.
4.2.2 Error Backpropagation
Backpropagation is a method of finding the optimal parameters of the neural net. Before
we can describe the process of backpropagation in detail we must first discuss the overall
approach to training the neural net.
Error backpropagation is an efficient technique for evaluation of the gradient of the error
function E(w). The goal is to evaluate the derivative of the loss function and then move
in the direction of steepest descent as is common in nonlinear optimization. The derived
backpropagation formulas are quite general. Bishop (2006) showed it on a typical error
function of maximum likelihood for a set of i.i.d. data which is denoted as
E(w) =
M∑n=1
En(w). (4.11)
32
The formulas are shown only for the evaluation of ∇En(w) which is enough since se-
quential optimization or batch evaluation can be used. For the explanation of back-
propagation let us use a simple linear model with output yk and input xi in the form
of
yk =∑i
wkixi. (4.12)
This gives rise to the regression error function for the particular input pattern n,
En =1
2
∑k
(ynk − tnk)2, (4.13)
where ynk = yk(xn,w). The binary classification error function takes the form of
En = −∑k
{tnk log ynk + (1− tnk) log(1− ynk)}. (4.14)
We also see the gradient of this error function w.r.t. wji is in case of regression
∂En∂wji
= (ynj − tnj)xni (4.15)
and in case of binary classification the result is almost identical
∂En∂wji
= (tnj − ynj)xni. (4.16)
One can interpret the gradient as a ’local’ computation of the product of the ’error signal’
ynj − tnj associated with the output end of the link wji and the variable xni associated
33
with the input end of the link. This can be easily generalized to setting with multilayer
network. Each unit (neuron) calculates in general a weighted sum of its inputs as
aj =∑i
wjizi (4.17)
where zi is the activation of a neuron, in case of the first layer an input, that sends a
connection to unit j and wji is the appropriate weight. The activation from the Equation
4.17 is also transformed by a nonlinear transformation function h(·) which gives us the
following activation of unit j
zj = h(aj). (4.18)
This can be extended to evaluate the derivative of En w.r.t. wji. The output of various
units in the following equations naturally depends on the particular input pattern n but
it is omitted to achieve uncluttered notation. To get the derivative of En w.r.t. wji we
must apply the chain rule and we get
∂En∂wji
=∂En∂aj
∂aj∂wji
. (4.19)
For cases when the network is somewhat more complex a useful notation is presented
δj ≡∂En∂aj
(4.20)
where the δ’s are commonly known as errors. Now if we use Equation 4.17 we can say
∂aj∂wji
= zi. (4.21)
34
Then we can substitute, as is shown in Bishop (2006), into the Equation 4.19 the Equa-
tions 4.20 and 4.21 and arrive at resulting equation
∂En∂wji
= δjzi. (4.22)
From this equation we can see that the required derivative, to search for optimal solution,
is equal to multiplying the value of the gradient of the unit at the output end, δ, by the
value of the gradient of its input signal, z. This allows us to calculate only δs for all the
hidden and output units in the network which allows for efficient backpropagation.
For the output units δ takes the following form
δk = yk − tk (4.23)
and for hidden units we can write down δ as
δj ≡∂En∂aj
=∑k
∂En∂ak
∂ak∂aj
(4.24)
where the sum runs over all units k to which unit j sends signal. By combining all the
previous equations we can arrive at the final form
δj = h′(aj)∑k
wkjδk (4.25)
which allows us to get the value of δ for any hidden unit by simply propagating δ’s
backwards from the units which precede the unit in question in the network.
The overall process was summed up by Bishop (2006) in 4 simple steps
35
1. Forward propagate the input vector xn through the network to obtain activations
of all hidden and output units using Equations 4.17 & 4.18.
2. Evaluate δk for all the output units using Equation 4.23.
3. Backpropagate δ’s using Equation 4.25 to get δj for each hidden unit in the net-
work.
4. Evaluate the needed derivatives using Equation 4.22.
This 4-step algorithm is repeated until our stopping criterion kicks in.
Naturally this explanation does not cover everything but it is a sufficient introduction
to neural networks for this thesis to be comprehensible without any prior knowledge.
4.3 Evaluation
The evaluation methods presented here are quite straightforward and well known. They
are standard and widely used metrics in the field of classification. The metrics are
accuracy, confusion matrix and F1 score. Firstly we define these terms and then their
choice is discussed. The chosen metrics are also a common sight in the sarcasm detection
literature which is one of the reasons for their selections.
Accuracy (De Bievre, 2012) is a simple ratio of correctly classified instances and all the
instances. This can be written down as
accuracy =true positives+ true negatives
all instances(4.26)
where true positives are correctly classified instances of sarcastic comment and true
negatives are correctly classified instances of not sarcastic comments.
In the case of binary classification confusion matrix is a simple 2x2 table with 4 fields.
These fields are true positives, false positives, false negatives and and true negatives. The
true positive and negative have already been explained. False positives are in our case
non-sarcastic comments which were predicted to be sarcastic while false negatives are
36
sarcastic comments which were tagged as not sarcastic. This gives us more information
than just plain accuracy since we can see where the model makes most mistakes or
whether the model has same prediction strength for both classes. This can be very
helpful especially in a case where the use case is not clear from the beginning and it
is not known how harmful are false positives and false negatives. We can get better
understanding of the model and know how to augment the model to improve it in an
appropriate way.
An example of a situation where false negatives are more harmful might be a case of
a company which is subcontractor for a large automobile company. It has a system in
place which automatically detects flawed pieces of hardware. Their contract states that
they pay hefty fines for all flawed pieces which are delivered, this means that if the flaw
is not found, i.e. the piece of hardware was flagged as without a problem – negative
result, it would hurt the company more than simply discarding a perfectly fine piece
of hardware. With this kind of knowledge the model can be fine tuned to have zero
false negatives and a relatively high number of false positives, i.e. pointlessly discarded
items. Another example might be a system for sarcasm detection. If the firm tracks
the reviews of its products it might falsely believe customers are quite satisfied with its
product even though some of the comments are sarcastic and the product is well bellow
average in the overall rating.
Lastly the F1 score was mentioned which is a simple harmonic mean of the precision
and recall, i.e.
F1 = 2precision ∗ recallprecision+ recall
(4.27)
Precision is defined as
precision =true positives
true positives+ false positives(4.28)
and recall is defined as
37
recall =true positives
true positives+ false negatives. (4.29)
Good F1 score usually means that the model manages to discover most of the instances
of interest, i.e. sarcastic comments, while also being quite precise, meaning it does
not discover these cases by saying that all comments are sarcastic, which would deliver
perfect recall.
In this thesis all these measures are reported so that we get better idea of the underlying
model dynamics. The best way to improve the model in the future is to see where it
does well and where it simply fails to perform at a desirable level.
Chapter 5
Empirical part
In this chapter the empirical part is discussed. This includes the setup, i.e. quick
description of the used tools to obtain the results and discussion of the neural net
tuning, and the actual results achieved. The results concentrate on the added value of
each feature set after the very basic BoW which is taken to be the baseline prediction
set.
5.1 Setup
Several times during the thesis hardware limitations were mentioned. This is caused by
the fact a laptop with an i3 – 1.8GHz CPU and 8GB of RAM (and no graphic card)
was used to train the neural net and do the feature engineering. This lead to many
issues since the limited RAM does not allow the usage of several methods. Example
being any learning algorithm without the online learning option, i.e. random forest or
a dimensionality reduction technique such as PCA/Truncated SVD. The advantage of
online learning lies in the fact that while methods like random forest or PCA need to
work with the whole data set at once, they must take in all the data and then they
output the final results, the models with the online learning option can take the data in
sequentially, e.g. if the data set has one million rows they can process it in ten batches of
hundred thousand observations which are quite easy to process even on slower machines
and after each batch the model parameters are updated.
38
39
Another thing which should be discussed before the results are presented is how the
neural net was tuned. Firstly the high-level overview is given and is followed by the
description of the final net. The chosen approach is quite typical, the data were divided
into train set, validation set and test set. This division affects only the observations, all
features are always kept. The neural net was trained on the train set which accounts
for 70% of the data. The validation set and test set each consist of 15% of the data
and serve their usual role as well. The validation data set is used for finding optimal
hyperparameter values by iterating over different parameters of neural net – learning
rate, regularization, different architectures and different optimizers. The specific details
of this process are discussed in Appendix B and briefly mentioned in the following
paragraphs as well. This hyperparameter tuning is necessary for two reasons, firstly
different architectures of the net create different loss functions and different achievable
performance and secondly since there is no easily found global maximum we must use
numerical procedures to efficiently find some satisfatory maximum of the loss function.
Tuning the appropriate hyperparameters (learning rate, weight decay, etc.) is necessary
to achieve this goal. Then the final results are obtained by applying the best model on
the test set This final result is compared with the retrained net’s performance on the
subsets of all the features and these are reported alongside the overall performance.
Firstly the structure of the net is discussed and then the hyperparameters related to
the learning process are mentioned. All of the chosen hyperparameter of the neural
net were determined using grid search. This is the standard approach in all of ML,
several different combinations of the hyperparameters are tried and the one with the
best performance is chosen. There is no theoretical foundation for these choices, the
best combination is found via computing a large number of models and choosing the
one which fits data the best. The division of the data to the train set, validation set
and test set is there to prevent overfitting. The problem of overfitting is the unknowing
extraction of some of the noise present in the data and assuming it is part of the signal.
This division allows testing of the model on unseen data which tells us whether the
model has generalization property. Whether it can perform at the same level on unseen
data with different noise or whether the model is over-taught on the one specific data
set and incorporated the noise into the model. A brief description of the used neural
net follows now but the grid search and tested architectures are listed in Appendix B.
The neural net used in the end has 3 hidden layers, the first hidden layer has 256
40
neurons, second 128 and third 64 neurons. The dropout applied, i.e. percentage chance
each neuron in the layer is dropped in the training, is 0.25 for the first layer and 0.2 for
all other layers. Dropout is introduced to prevent overfitting of the net. By dropping
random neurons within the layers the model cannot rely solely on a few connections
since those are sometimes removed and must learn to use also other signals in the data.
The activation functions, defined in Equation 4.4, used in the learning were Parametric
Rectified Linear Unit (PReLU) (He et al. , 2015) which can be written down as
f(x) =
x if x ≥ 0
ax otherwise.
(5.1)
The parameter a is then trained alongside the weights of the neural net. The activation
function in the output layer is standard sigmoid which is commonly used in the case
of binary classification. The weight initialization should be mentioned as well, a robust
method from the paper by He et al. (2015) is used. This initialization takes form of a
random draw for each weight from normal distributionN (0, std) where the std =√
21+a2
.
The a being the parameter from the Equation 5.1. Lastly the standard cross-entropy
loss function was used which takes the form of
L(w) = − 1
N
N∑n=1
[yn log yn + (1− yn) log(1− yn)] . (5.2)
The used hyperparameters related to learning are mainly those of the Adam optimizer
(Kingma & Ba, 2014) which is essentially an improved version of the standard SGD.
Its default learning rate of 0.001 is used which when coupled with a multi-step learning
rate scheduler turned out to deliver the best performance. The multi-step learning rate
scheduler is simply an algorithm that decides after how many epochs, an epoch is one
run through the entire training data set, should the learning rate decrease by a factor
of 0.1 so the numerical optimization becomes more precise in later epochs when the
algorithm is already close to the ”good” solutions. The last used hyperparameter was
weight decay which is used as a way of regularizing the network, the best performing
value was 1e−33 .
41
The grid search tried almost 30 different combinations with each run taking at least 2
and up to 3 hours of computation time depending on the overall number of parameters.
Different activation functions were tried (PReLU, ReLU, Leaky ReLU), various levels
of dropout in all layers (from 0.5 to 0.2), different number of hidden layers (1, 2, 3 and
4) with different number of neurons in them (all were some power of two). This was
combined with different optimizers than Adam (Adagrad and SGD), different learning
rates (1e − 3 and its multiples, i.e. times two, one half and one third) and different
weight decays (1e− 3, 1e−32 , 1e−3
3 ).
The neural net was trained and optimized using the PyTorch deep learning framework
(Paszke et al. , 2017).
5.2 Results
The results of the overall model are presented first and are followed by the results of the
specific feature subsets. The results of the partial models are shown and the performance
differences are discussed.
The overall model managed to achieve a decent 69.5% accuracy on the test set. The
benchmark of human performance (average accuracy achieved by three humans) given
with the original data in the paper from Khodak et al. (2017) is 81% accuracy (no
other scores were given), which means that our model performs at a reasonable level
given all the hardware limitations. The F1 score was 0.687 and the confusion matrix
is indicative of a model that is pretty balanced in terms of specificity (true positive
rate/recall) and sensitivity (true negative rate). It is slightly skewed towards sensitivity
but the difference is not major.
Main model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 50760 21231
value Not Sarcasm 25034 54615
Table 5.1: Confusion matrix of the main model
To see what is the impact of individual feature sets the model was applied to some
specific feature sets as well, the studied combinations are as follows:
42
1. All features,
2. BoW and context,
3. BoW, context and PoS-based features,
4. BoW, context and similarity-based features,
5. BoW, context and user similarity-based features,
6. BoW, context and sentiment-based features,
7. BoW, context and lexical-based features.
Model Accuracy F1-score1 69.5% 0.6872 65.7% 0.6393 67.5% 0.6644 66.6% 0.6585 65.9% 0.6416 66.05% 0.6447 67% 0.662
Table 5.2: Summary of the performance of individual models
These 6 different results based on the incomplete feature set are now presented, discussed
and compared to the overall model.
The first partial model, model 2, is the one using the BoW and context alone. Just a
quick reminder, context is the prevalence of sarcasm in the specific subreddit. The final
accuracy of this model on the test set was 65.7% and the F1 score was 0.639. From the
confusion matrix we can see that there is less false positives so our model is better at
recognizing non-sarcastic cases than sarcastic ones but the difference is not large. This
is in line with our expected results based on the main model – worse results overall and
same weaknesses/strengths.
Basic model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 46188 22381
value Not Sarcasm 29606 53465
Table 5.3: Confusion matrix of the model using as features: BoW and context
The second partial model, model 3, is using the PoS-based features alongside the BoW
and context. This set of additional features proved to be the most valuable and brought
43
the maximum explanatory power. The overall accuracy was 67.5% with F1 score of
0.664. The confusion matrix again is very similar to the ones we have seen before. It
is slightly more sensitive than specific which can be also seen in accuracy being higher
than the F1 score. This is also true for all previous cases.
PoS-based model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 48797 22289
value Not Sarcasm 26997 53557
Table 5.4: Confusion matrix of the model using as features: BoW, context and PoS-based features
The next feature set, model number 4, that is discussed is the similarity-based feature
set and it achieved accuracy of 66.6% with F1 score of 0.658. This seems to be the third
most beneficial feature set after the PoS-based and lexical-based features. Considering
the high-dimensionality of the underlying vectors it would be very interesting to see how
other approaches to utilizing the word vectors perform.
Similarity-based model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 48758 23650
value Not Sarcasm 27036 52196
Table 5.5: Confusion matrix of the model using as features: BoW, context andsimilarity-based features
The next feature set, model number 5, is the user similarity-based feature set. It barely
brings any additional explanatory power with accuracy at 65.9% and F1 score at 0.641.
The confusion matrix is very similar to the baseline model and only marginally better.
This is surprising since these features seemed very promising. It would be a good
idea to process these features possibly in a different way although that might be very
computationally intensive and not possible in the case of this thesis.
User similarity-based model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 46258 22219
value Not Sarcasm 29536 53627
Table 5.6: Confusion matrix of the model using as features: BoW, context and usersimilarity-based features
44
Next case, model number 6, is using the sentiment-based features as additional ex-
planatory variable which leads to accuracy 66.05% and F1 score of 0.644. This is an
improvement on the baseline model but it performed worse than the PoS-based set of
features. The confusion matrix follows the same pattern as all the previous ones.
Sentiment-based model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 46666 22349
value Not Sarcasm 29128 53497
Table 5.7: Confusion matrix of the model using as features: BoW, context andsentiment-based features
The lexical-based feature set, model number 7, is studied next and its results are quite
promising. It performs better than the sentiment-based set of features but somewhat
worse than the PoS-based set of features. It obtained accuracy of 67% and F1 score
of 0.662. Once again the confusion matrix shows the model is better at discovering
non-sarcastic remarks than the sarcastic ones.
Lexical-based model’s CMActual value
Sarcasm Not SarcasmPredicted Sarcasm 49129 23385
value Not Sarcasm 26665 52461
Table 5.8: Confusion matrix of the model using as features: BoW, context and lexical-based features
5.2.1 Discussion
Overall the results seem very interesting and bring us many surprising findings. The
PoS-based feature set proved to be by far the most telling which is a very interesting
finding that did not seem likely based on the existing literature. The previously chosen
approaches utilizing PoS-based features were different and varied in the existing litera-
ture but were never the key factor, the reason for the publication of the paper. Perhaps
it might have been the nature of the data set since most of the sarcasm detection studies
have been done using Twitter. Comments from Reddit are more ’English-like’, they
resemble more standard English and are less abbreviated, which might be a reason why
this feature set turned out to be more important than literature suggests.
45
Another interesting observation is the overall strength of the standard BoW. This was
obviously expected since it is an incredibly strong predictive feature used throughout the
NLP literature but it is still nonetheless good to mention it. Also the comparison of BoW
formed from unigram, as was done in this thesis, with a BoW made from bigram/trigrams
would be very interesting. It would be an intriguing comparison that unfortunately was
not done in this thesis since handling the larger BoW (of bigram/trigrams) was not
feasible on the available machine. Another option to improve the overall model would
be to do less restrictive dimension reduction, the BoW was shrinked to 150 columns,
which would preserve more variance and might tell us more. This was not done since
the used machine could not handle the larger final matrix.
The strength of the lexical-based features is also worth mentioning. This set of features
which is a very basic approach, and the oldest, to detecting sarcasm was the second
most useful in tandem with the standard BoW. It can be easily presumed that the way
the comment is written is the most important feature in the final sarcasm detection
since both lexical-based and PoS-based features describe the structure of the comment.
Lexical-based features focus on the specific signs of sarcasm, e.g. ellipsis, and PoS-based
features focus on the overall structure, i.e. what types of words are used and what
patterns of different types of word usage are more indicative of sarcasm or lack of it.
An example is usage of adjective/adverbs, it is very difficult to express sarcasm without
using them.
The third most important set of features is the one based on the word similarity. This is
more or less in line with our expectations since the features obtained from the powerful
word2vec algorithm tend to be strong. Here they might have been overshadowed by the
other features due to some hardware limitations. These vectors tend to be very high-
dimensional (the pre-trained model used here projected the words into 384-dimensional
vector space) and that makes the computations much more difficult. Perhaps using these
vectors in a different way might lead to better results. One use case that comes to mind
is using the whole vector, e.g. the comment vector, and see the result. That is likely
to be very influential but in the case of this thesis not feasible due to computational
complexity.
Very similar reasoning can be applied to the user similarity-based feature set. This set
turned out to be the least influential which was very surprising. The limitation of this
46
feature set might have been the shrinkage of the user vectors to too few dimensions
which may have caused a large loss of information. This could have been prevented by
preserving more of this information, e.g. by using the whole dimension of the vector
space as was suggested in previous paragraph. Unfortunately in this thesis that was not
feasible but that can be easily remedied by having access to a much stronger machine.
The last feature set that was not discussed is the sentiment-based one. This one turned
out to be quite weak predictor. This was not expected but once compared to the stronger
feature sets the explanation offers itself. Both the lexical-based and PoS-based features
describe the structure of the comment and the type of expressions commonly associated
with sarcasm, e.g. ellipsis or high usage of adjectives/adverbs, and similarity-based
measures describe the semantic incongruity within a comment or within the context of
the comment. Both can be often a strong predictor of several types of sarcasm. The
sentiment based feature set can be more often associated with a simple disagreement
than a sarcastic utterance than the three strongest feature sets which might explain the
overall lower predictive power. Polar incongruity can simply be related to a less often
occurring type of sarcasm which would explain the weaker performance very well.
The room for improvement is present and has already been mentioned. The hardware
limitations blocked potential usage of several feature sets. And naturally using wider
data could be very beneficial to the overall results since neural nets tend to perform well
at extracting signal from large data sets. That might be one of the reasons for strong
performance of the PoS-based and lexical-based feature sets since they were complete
and not limited by the lack of computational power. Repeating this analysis with a
better setup and extending it this way is the most obvious possibility for future research.
There are other interesting avenues for future research as well such as studying all the
combinations of the feature sets. That is quite enticing but the time it takes to train a
neural net on a CPU when large data set is in question is quite prohibitive. Last area
of possible improvement connected to hardware limitations is the actual tuning of the
neural net. A grid search approach was used to see which model performed best but
this search is simply not exhaustive and there might exist a better configuration. If a
GPU and enough time is available than better search techniques can be used and better
result could be achieved. Again this is conditional on having a GPU since using only
a CPU (as in case of this thesis) is much slower and makes the optimization extremely
lengthy.
47
Another possible extension, unrelated to hardware limitations, might be using neural
nets on the raw text. There is every possibility that a deeper network could preprocess
the text by itself better than the current existing approaches can. This can be extremely
interesting comparison to see and an exciting area for future research. This approach,
Recurrent neural nets (Hochreiter & Schmidhuber, 1997), is not discussed or explained
here since it is quite a complicated topic.
The last thing discussed are the generalization capabilities of the presented model to
other data sets. The gap between the performance on this and other data sets should
not be high since the large volume of data should ensure quite robust results. The
number of observations is larger by a factor of 10 than the largest comparable studies
and this kind of difference in data set magnitude is indicative of a more robust results.
In combination with the features being quite robust choices which are not sensitive to
small deviations the results should be reliable.
Chapter 6
Conclusion
The findings and the way they were obtained are summarized now alongside a short
discussion about the findings. Firstly the thesis is summarized very shortly, secondly
the goal of the thesis and the actual results achieved are discussed, thirdly some discus-
sion about the limitations and possible future work follows and lastly a few concluding
remarks about the thesis and topic itself are given.
In this thesis the aim was to build on the existing sarcasm detection literature by sum-
marizing it and synthetizing the preferred and most promising approaches to extracting
the information from the text at hand. Several possible ways were identified in the lit-
erature and they were applied here. The goal then was two-fold, finding an overall good
model for this kind of modeling and deciding which feature sets are the most important.
The goal set at the beginning of this thesis was to create a sarcasm classification model
trained on the novel data set from Khodak et al. (2017) and then compare the feature
sets used and see which are the most and least important. This was achieved and a
lengthy discussion of the detailed results can be found in previous section. The results
overall were promising. The accuracy reached was 69.5% slightly skewed towards speci-
ficity, i.e. the model discovers non-sarcastic comments more successfully than sarcastic
ones. The results regarding the impact of the different feature sets were intriguing and
also somewhat distorted by lack of computational power. We arrived at a conclusion that
both PoS-based and lexical-based feature sets are relatively easy to obtain (in terms of
computational power) and deliver substantial improvements while the similarity-based
48
49
measures (based on word2vec algorithm) are weaker, which might be a result of the com-
putational constraints and inability to fully leverage the high-dimensional vectors. Quite
interestingly the sentiment-based features seemed to simply perform badly in compar-
ison with all other feature sets. This cannot be attributed to possible shortcomings in
feature engineering due to computational cost since these features are not transformed
into lower dimensions. All of these feature sets are used in tandem with a standard BoW
formed from unigram, comparing its performance with a BoW from bigram or trigrams
can be beneficial and lead to more surprising insights.
From the short description of the achieved results above some of the limitations and
potential improvements are quite obvious. The computational constraints blocked some
possible improvements of the model, mainly different way of treating the similarity-based
and user similarity-based features due to their high-dimensionality. This is also true for
the possible extensions of BoW – one consisting from either bigram or trigrams. Access
to a machine capable of handling these different BoW’s would be another possible avenue
for exploration. All of these limitations are hardware limitations and the future research
recommendations essentially boil down to using a more powerful machine. There are
naturally other possible extensions such as combining more feature sets and looking for
some interactions among them. This also might be coupled with applying statistical
tests on the predictions and seeing whether one model is outperforming some other or
not. An example of this might the Diebold-Mariano test (Diebold & Mariano, 1995).
The issue of the large data set would have to be addressed, since statistical tests on
large data sets have always low p-values, but it can be an interesting research task to
statistically compare the prediction accuracy. Lastly applying the recurrent neural nets
on raw text, as mentioned at the end of previous chapter, could be quite exciting.
In this thesis it was tried to apply, and possibly improve, many typical approaches to
sarcasm detection and combine them on a novel data set to see the prediction strength
of this model and the different feature sets. This was achieved with a varying degree of
success due to the computational limitations. Some feature sets managed to properly
leverage the information within the data, e.g. the PoS-based and lexical-based feature
sets, while some other were not as successful. These were mainly the word2vec-based
feature sets which most likely underperformed due to loss of too much information by
the transformations of the high-dimensional vectors. This can be remedied by access
50
to more computational power and experimenting with some other forms of representing
these vectors.
Appendix A
Truncated SVD
Truncated SVD is a special version of the standard SVD. The normal SVD is explained
and described and then its augmented version, the truncated SVD, is presented.
The main idea behind the SVD is the factorization of a matrix. It can be thought of as a
generalization of the eigendecomposition of positive semi-definite matrix to an arbitrary
m× n matrix. The SVD can be written down as
M = UΣV∗, (A.1)
where M is the m×n to be factorized, U is m×m matrix, Σ is a diagonal m×n matrix
with non-negative real numbers and V* is an n× n matrix. The main difference of the
SVD from the eigendecomposition of a matrix is that it can be applied to an arbitrary
m× n. The eigendecomposition is limited only to some square matrices.
The truncated SVD is then a simple augmentation of the SVD. It essentially shrinks the
dimensions of the matrices Σ and V* from m×n respectively n×n to m×k respectively
k × k and creates a low rank approximation of the original matrix.
M ≈Mk = UkΣkVTk (A.2)
51
52
The reason for this approximation is purely economical, in terms of memory and com-
putation time. If a matrix is too large to handle properly it can be shrinked by the
truncated SVD while preserving most of the information found in the data. The way
k is chosen is based either on how much variance the approximation explains or by the
savings in terms of computation time and overall feasibility of the calculations.
Appendix B
Neural net optimization
The search for the optimal architecture of the neural net and setting all the hyperpa-
rameters is briefly discussed in the main body of the thesis. More details are presented
here. The search for the optimal architecture is reminiscent of robustness checks and
validation in econometrics. There large differences as ML and econometrics are very
different in spirit but it can serve as a good intuitive explanation.
The tested hyperparameters can be divided into two groups. The first group is the
actual setup of the neural net which includes among other things the number of hidden
layers, number of neurons in each layer and the weight initialization. The second group
are the hyperparameters of the learning algorithm such as learning rate. All of them
were trained on 80 epochs with multi-step scheduler kicking in at epoch 60.
The approach was divided into two stages, firstly the best architecture with default
learning parameters was found and then the best learning parameters were identified.
This choice was based on the idea that we can identify the most promising error function
and then after we find the best one we find the ideal optimization parameters which
should give us the best solution.
B.1 Neural net architecture
Overall 14 different neural net architectures were tried and now they are described. The
best accuracy was 69.5% as reported in the main body of the thesis. With each net
several learning parameters were tried but only the best result per net is reported.
53
54
The initial training run of all of these nets used the Adam optimizer with default learning
rate of 1e−3 and no weight decay. The output layer of all the nets had always 2 neurons
and used a sigmoid activation function. Different learning rates have been tried with
each net as well (1e−32 ) but those results were significantly worse.
Weight decay was tried after the two initial runs on few of the most promising nets.
All 14 nets are now described one by one.
1. One hidden layer: 256 neurons & PReLU as activation function & dropout of 0.2
– 66.7%
2. Two hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.25, second with 64 neurons & PReLU as activation function & dropout
of 0.25 – 67.5%
3. Two hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.25, second with 128 neurons & PReLU as activation function &
dropout of 0.2 – 67.6%
4. Two hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.3, second with 128 neurons & PReLU as activation function & dropout
of 0.25 – 67.3%
5. Two hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.25 followed by normalization of outputs, second with 128 neurons &
PReLU as activation function & dropout of 0.2 – 67%
6. Three hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.25, second with 128 neurons & PReLU as activation function &
dropout of 0.2, third with 64 neurons & PReLU as activation function & dropout
of 0.2 – 69.5% (including weight decay)
7. Three hidden layers: first with 256 neurons & Leaky ReLU as activation function
& dropout of 0.25 followed by normalization of outputs, second with 128 neurons
& Leaky ReLU as activation function & dropout of 0.2, third with 64 neurons &
Leaky ReLU as activation function & dropout of 0.2 – 68.4% (with weight decay)
55
8. Three hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.3, second with 64 neurons & PReLU as activation function & dropout
of 0.25, third with 16 neurons & PReLU as activation function & dropout of 0.2 –
68.7% (with weight decay)
9. Three hidden layers: first with 256 neurons & ReLU as activation function &
dropout of 0.3, second with 128 neurons & ReLU as activation function & dropout
of 0.25, third with 64 neurons & ReLU as activation function & dropout of 0.2 –
68.7% (with weight decay)
10. Four hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.3, second with 128 neurons & PReLU as activation function & dropout
of 0.25, third with 64 neurons & PReLU as activation function & dropout of 0.2,
fourth with 16 neurons & PReLU as activation function & dropout of 0.2 – 67.8%
(with weight decay)
11. Four hidden layers: first with 256 neurons & ReLU as activation function &
dropout of 0.3, second with 128 neurons & ReLU as activation function & dropout
of 0.25, third with 64 neurons & ReLU as activation function & dropout of 0.2,
fourth with 16 neurons & ReLU as activation function & dropout of 0.2 – 67.9%
12. Five hidden layers: first with 256 neurons & PReLU as activation function &
dropout of 0.3, second with 128 neurons & PReLU as activation function & dropout
of 0.25, third with 64 neurons & PReLU as activation function & dropout of 0.2,
fourth with 32 neurons & PReLU as activation function & dropout of 0.2 – 67.1%
13. Five hidden layers: first with 256 neurons & Leaky ReLU as activation function &
dropout of 0.3 followed by normalization, second with 128 neurons & Leaky ReLU
as activation function & dropout of 0.25, third with 64 neurons & Leaky ReLU as
activation function & dropout of 0.2, fourth with 32 neurons & Leaky ReLU as
activation function & dropout of 0.2 – 67.2%
14. Five hidden layers: first with 256 neurons & ReLU as activation function & dropout
of 0.3, second with 128 neurons & ReLU as activation function & dropout of 0.25,
third with 64 neurons & ReLU as activation function & dropout of 0.2, fourth with
32 neurons & ReLU as activation function & dropout of 0.2, fifth with 16 neurons
& ReLU as activation function & dropout of 0.2 – 66.9%
Appendix C
Used libraries
The used language was Python 3.6 and all libraries used are now listed. Basic compo-
nents of the Python language are not included.
Libraries
pandas data manipulationnumpy data manipulationNLTK NLPspaCy NLPscikit-learn feature extractionPRAW comment download from RedditPyTorch neural net
Table C.1: Libraries used in this thesis
56
Bibliography
Amir, Silvio, Wallace, Byron C, Lyu, Hao, & Silva, Paula Carvalho Mario J. 2016.
Modelling context with user embeddings for sarcasm detection in social media. arXiv
preprint arXiv:1607.00976.
Bamman, David, & Smith, Noah A. 2015. Contextualized Sarcasm Detection on Twitter.
Pages 574–577 of: ICWSM.
Bird, Steven, Klein, Ewan, & Loper, Edward. 2009. Natural Language Processing with
Python. 1st edn. O’Reilly Media, Inc.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning (Information
Science and Statistics). Berlin, Heidelberg: Springer-Verlag.
Carvalho, Paula, Sarmento, Luıs, Silva, Mario J, & De Oliveira, Eugenio. 2009. Clues
for detecting irony in user-generated contents: oh...!! it’s so easy;-. Pages 53–56 of:
Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for
mass opinion. ACM.
Chakraborty, Goutam, & Pagolu, Murali Krishna. Analysis of Unstructured Data: Ap-
plications of Text Analytics and Sentiment Mining.
Davidov, Dmitry, Tsur, Oren, & Rappoport, Ari. 2010. Semi-supervised recognition
of sarcastic sentences in twitter and amazon. Pages 107–116 of: Proceedings of the
fourteenth conference on computational natural language learning. Association for
Computational Linguistics.
De Bievre, Paul. 2012. The 2012 International Vocabulary of Metrology: “VIM”. Ac-
creditation and Quality Assurance, 17(2), 231–232.
Diebold, Francis X., & Mariano, Roberto S. 1995. Comparing Predictive Accuracy.
Journal of Business & Economic Statistics, 13(3), 253–263.
57
Bibliography 58
Ghosh, Debanjan, Guo, Weiwei, & Muresan, Smaranda. 2015. Sarcastic or not: Word
embeddings to predict the literal or sarcastic meaning of words. Pages 1003–1012
of: Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing.
Grimes, Seth. A Brief History of Text Analytics. http://www.b-eye-network.com/
view/6311. Accessed: 2018-07-03.
Gurini, Davide Feltoni, Gasparetti, Fabio, Micarelli, Alessandro, & Sansonetti,
Giuseppe. 2013. A Sentiment-Based Approach to Twitter User Recommendation.
RSWeb@ RecSys, 1066.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, & Sun, Jian. 2015. Delving deep into rec-
tifiers: Surpassing human-level performance on imagenet classification. Pages 1026–
1034 of: Proceedings of the IEEE international conference on computer vision.
Hochreiter, Sepp, & Schmidhuber, Jurgen. 1997. Long short-term memory. Neural
computation, 9(8), 1735–1780.
Honnibal, Matthew, & Montani, Ines. 2017. spaCy 2: Natural language understanding
with Bloom embeddings, convolutional neural networks and incremental parsing. To
appear.
Joshi, Aditya, Tripathi, Vaibhav, Patel, Kevin, Bhattacharyya, Pushpak, & Carman,
Mark. 2016. Are Word Embedding-based Features Useful for Sarcasm Detection?
arXiv preprint arXiv:1610.00883.
Joshi, Aditya, Bhattacharyya, Pushpak, & Carman, Mark J. 2017. Automatic sarcasm
detection: A survey. ACM Computing Surveys (CSUR), 50(5), 73.
Khattri, Anupam, Joshi, Aditya, Bhattacharyya, Pushpak, & Carman, Mark. 2015.
Your sentiment precedes you: Using an author’s historical tweets to predict sarcasm.
Pages 25–30 of: Proceedings of the 6th Workshop on Computational Approaches to
Subjectivity, Sentiment and Social Media Analysis.
Khodak, Mikhail, Saunshi, Nikunj, & Vodrahalli, Kiran. 2017. A large self-annotated
corpus for sarcasm. arXiv preprint arXiv:1704.05579.
Kingma, Diederik P, & Ba, Jimmy. 2014. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Bibliography 59
Kriesel, David. 2007. A Brief Introduction to Neural Networks.
Liu, Bing. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies, 5(1), 1–167.
McCulloch, Warren S., & Pitts, Walter. 1943. A logical calculus of the ideas immanent
in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133.
Mikolov, Tomas, Yih, Wen-tau, & Zweig, Geoffrey. 2013. Linguistic regularities in contin-
uous space word representations. Pages 746–751 of: Proceedings of the 2013 Confer-
ence of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies.
Mohammad, Saif M. 2016. Sentiment analysis: Detecting valence, emotions, and other
affectual states from text. Pages 201–237 of: Emotion measurement. Elsevier.
Mullainathan, Sendhil, & Spiess, Jann. 2017. Machine learning: an applied econometric
approach. Journal of Economic Perspectives, 31(2), 87–106.
Ortigosa, Alvaro, Martın, Jose M, & Carro, Rosa M. 2014. Sentiment analysis in Face-
book and its application to e-learning. Computers in Human Behavior, 31, 527–541.
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward, De-
Vito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, & Lerer, Adam. 2017.
Automatic differentiation in PyTorch. In: NIPS-W.
Peng, Qingxi, & Zhong, Ming. 2014. Detecting Spam Review through Sentiment Anal-
ysis. JSW, 9(8), 2065–2072.
Poecze, Flora, Ebster, Claus, & Strauss, Christine. 2018. Social media metrics and sen-
timent analysis to evaluate the effectiveness of social media posts. Procedia Computer
Science, 130, 660–666.
Riloff, Ellen, Qadir, Ashequl, Surve, Prafulla, De Silva, Lalindra, Gilbert, Nathan, &
Huang, Ruihong. 2013. Sarcasm as contrast between a positive sentiment and nega-
tive situation. Pages 704–714 of: Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing.
Staiano, Jacopo, & Guerini, Marco. 2014. Depechemood: a lexicon for emotion analysis
from crowd-annotated news. arXiv preprint arXiv:1405.1605.
Bibliography 60
Stevenson, Angus. Oxford Dictionary of English.
Wallace, Byron C, Kertz, Laura, Charniak, Eugene, et al. . 2014. Humans require context
to infer ironic intent (so computers probably do, too). Pages 512–516 of: Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), vol. 2.
Wallace, Byron C, Charniak, Eugene, et al. . 2015. Sparse, contextually informed mod-
els for irony detection: Exploiting user communities, entities and sentiment. Pages
1035–1044 of: Proceedings of the 53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), vol. 1.
Xing, Frank Z, Cambria, Erik, & Welsch, Roy E. 2018. Natural language based financial
forecasting: a survey. Artificial Intelligence Review, 1–25.
Yang, Dingqi, Zhang, Daqing, Yu, Zhiyong, & Wang, Zhu. 2013. A sentiment-enhanced
personalized location recommendation system. Pages 119–128 of: Proceedings of the
24th ACM Conference on Hypertext and Social Media. ACM.
Zimbra, David, Ghiassi, Manoochehr, & Lee, Sean. 2016. Brand-related twitter sen-
timent analysis using feature engineering and the dynamic architecture for artificial
neural networks. Pages 1930–1938 of: System Sciences (HICSS), 2016 49th Hawaii
International Conference on. IEEE.