An Approach to Extract Sentimental Messages from Twitter and its Application to Categorization of...

7
An Approach to Extract Sentimental Messages from Twitter and its Application to Categorization of Human Group - Trial Visualization of Relation between Sentimental Tweeting and Behaviour of School Days - Masakazu Kyokane * Yoshiro Imai ** Graduate School of Engineering, Kagawa University 2217-20 Hayashi-cho, Takamatsu, 761-0396, Japan * [email protected] , ** [email protected] ABSTRACT Recent years have found that social media analysis can bring several kinds of effective results to us in the domains of industries, education, social science, economics, and so on. Such approaches, sometimes called sentiment analysis and/or opinion mining, are potentially applicable from trend analysis of in- dustrial products or many services to visualization of human thinking, behaviour and/or emotion. This study focuses on Twitter used by students of uni- versities, analyzes tweeting data(messages) by stu- dents from Twitter, and calculates sentimental val- ues from the according data. It also investigates ex- istence of some relations between calculated senti- mental values and practical thought of students. At the same time, it tries to discuss whether the above procedure and analysis can visualize convention- ally hidden relationship between contents of tweet- ing messages and characteristic behaviour of stu- dents in categorized universities. One of the aims of this research is to help young persons to choose their suitable universities and to provide some use- ful example for extraction of sentimental values from categorized groups. KEYWORDS Sentiment Analysis, Data Mining, Twitter, Visual- ization of Hidden Relation. 1 INTRODUCTION Data mining is one of the useful information processing techniques to extract characteristics from a huge mount of data from several phe- nomena. Some special-purpose filtering can extract useful pattern and information from the given data in a relatively short period. With suitable tuning of the filtering parameters, we will be able to obtain necessary characteristics through data mining processes. As you know, nowadays, we can watch very much various kinds of messages from Social Media, such as Facebook, Twitter, Blogs and so on. We can investigate several meanings embedded in the messages from Social Me- dia through data mining approach. Informa- tion processing technique can bring us inter- esting meanings of messages in Social Media by means of a suitable translation scheme from words in the above messages into occurrence probability/distributions This study has been challenging to realize some kind of data mining as an example of visualization of sentimental words extracted from messages of Twitter and relation be- tween sentimental value and the relevant hu- man group. We have defined translation scheme from words in message into sentimen- tal values. And we apply this scheme into data mining, calculation of sentimental values for messages extracted from Twitter, and demon- stration of relation between sentimental values and behaviour of the categorized human group. The paper will describe an approach to extract sentimental messages from Twitter and its ap- plication to categorization of Human Group. And it will mention trial visualization of rela- tion between sentimental tweeting and the rel- evant behaviour of their school days. It intro- duces related works about analysis of message from Social media, calculation of sentimen- tal values and visualization of relation between message and behaviour in the next section. It illustrates our system configuration and struc- ture of information processing for data min- ing in the third section. It explains our results of analytical data with our defined translation scheme from message to sentimental values in the fourth section. It reports how to visualize ISBN: 978-1-941968-16-1 ©2015 SDIWC 68 Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Transcript of An Approach to Extract Sentimental Messages from Twitter and its Application to Categorization of...

An Approach to Extract Sentimental Messages from Twitter and itsApplication to Categorization of Human Group

- Trial Visualization of Relation between Sentimental Tweeting and Behaviour of School Days -

Masakazu Kyokane∗ Yoshiro Imai∗∗

Graduate School of Engineering, Kagawa University2217-20 Hayashi-cho, Takamatsu, 761-0396, Japan

[email protected] , ∗∗ [email protected]

ABSTRACT

Recent years have found that social media analysiscan bring several kinds of effective results to us inthe domains of industries, education, social science,economics, and so on. Such approaches, sometimescalled sentiment analysis and/or opinion mining,are potentially applicable from trend analysis of in-dustrial products or many services to visualizationof human thinking, behaviour and/or emotion. Thisstudy focuses on Twitter used by students of uni-versities, analyzes tweeting data(messages) by stu-dents from Twitter, and calculates sentimental val-ues from the according data. It also investigates ex-istence of some relations between calculated senti-mental values and practical thought of students. Atthe same time, it tries to discuss whether the aboveprocedure and analysis can visualize convention-ally hidden relationship between contents of tweet-ing messages and characteristic behaviour of stu-dents in categorized universities. One of the aimsof this research is to help young persons to choosetheir suitable universities and to provide some use-ful example for extraction of sentimental valuesfrom categorized groups.

KEYWORDS

Sentiment Analysis, Data Mining, Twitter, Visual-ization of Hidden Relation.

1 INTRODUCTION

Data mining is one of the useful informationprocessing techniques to extract characteristicsfrom a huge mount of data from several phe-nomena. Some special-purpose filtering canextract useful pattern and information from thegiven data in a relatively short period. Withsuitable tuning of the filtering parameters, wewill be able to obtain necessary characteristicsthrough data mining processes.

As you know, nowadays, we can watch verymuch various kinds of messages from SocialMedia, such as Facebook, Twitter, Blogs andso on. We can investigate several meaningsembedded in the messages from Social Me-dia through data mining approach. Informa-tion processing technique can bring us inter-esting meanings of messages in Social Mediaby means of a suitable translation scheme fromwords in the above messages into occurrenceprobability/distributionsThis study has been challenging to realizesome kind of data mining as an example ofvisualization of sentimental words extractedfrom messages of Twitter and relation be-tween sentimental value and the relevant hu-man group. We have defined translationscheme from words in message into sentimen-tal values. And we apply this scheme into datamining, calculation of sentimental values formessages extracted from Twitter, and demon-stration of relation between sentimental valuesand behaviour of the categorized human group.The paper will describe an approach to extractsentimental messages from Twitter and its ap-plication to categorization of Human Group.And it will mention trial visualization of rela-tion between sentimental tweeting and the rel-evant behaviour of their school days. It intro-duces related works about analysis of messagefrom Social media, calculation of sentimen-tal values and visualization of relation betweenmessage and behaviour in the next section. Itillustrates our system configuration and struc-ture of information processing for data min-ing in the third section. It explains our resultsof analytical data with our defined translationscheme from message to sentimental values inthe fourth section. It reports how to visualize

ISBN: 978-1-941968-16-1 ©2015 SDIWC 68

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

relation between sentimental Tweeting and be-haviour of school days, and discusses the ex-tracted relation and its visualized “fact” of theirbehaviour in the fifth section. And finally itsummarizes our conclusion in the last section.

2 RELATED WORK

This section introduces some suitable relatedworks to design and implement our data min-ing approach for message from Social Media.Yang Yu and Xiao Wang [1] of RochesterInstitute of Technology USA, collected real-time tweets from U.S. soccer fans during five2014 FIFA World Cup games using Twittersearch API and used sentiment analysis to ex-amine U.S. soccer fans’ emotional responsesin their tweets, particularly, the emotionalchanges after goals. They found that duringthe matches that the U.S. team played, fear andanger were the most common negative emo-tions and in general, increased when the op-ponent team scored and decreased when theU.S. team scored. Anticipation and joy werealso generally consistent with the goal resultsand the associated circumstances during thegames. Their project revealed that sports fansuse Twitter for emotional purposes and that thebig data approach to analyze sports fans’ senti-ment showed results generally consistent withthe predictions of the disposition theory whenthe fanship was clear and showed good predic-tive validity.Apoorv Agarwal and his team [2] of ColumbiaUniversity USA, presented results for senti-ment analysis on Twitter. They used previ-ously proposed state-of-the-art unigram modelas their baseline and reported an overall gain ofover 4% for two classification tasks: a binary,positive versus negative and a 3-way positiveversus negative versus neutral. They presenteda comprehensive set of experiments for boththese tasks on manually annotated data that is arandom sample of stream of tweets. They ten-tatively concluded that sentiment analysis forTwitter data was not that different from senti-ment analysis for other genres.Mike Thelwall et. al [3] of University ofWolverhampton UK, described “An analysis of

Twitter may give insights into why particularevents resonate with the population.” Their ar-ticle reported a study of a month of EnglishTwitter posts, assessing whether popular eventsare typically associated with increases in sen-timent strength. Using the top 30 events, de-termined by a measure of relative increase in(general) term usage, the results gave strongevidence that popular events were normally as-sociated with increases in negative sentimentstrength and some evidence that peaks of in-terest in events had stronger positive sentimentthan the time before the peak. It seemed thatmany positive events were capable of generat-ing increased negative sentiment in reaction tothem.Kumamoto and Tanaka [4][5] in Japan, fo-cused on the impressions people got fromnews articles, and propose a method for de-termining impressions of these news articles.Their proposed method consisted of two mainparts; One part involved building an ‘impres-sion dictionary’ that described the relation-ships among words and impressions. Anotherof the method involved determining impres-sions of input news articles using such an im-pression dictionary. The impressions of a newsarticle were represented as scale values in user-specified impression scales, like ‘sad - glad’and ‘angry - pleased’. Each scale value wasa real number between 0 and 1, and was calcu-lated from the words (common nouns, actionnouns, verbs, adjectives, and katakana charac-ters) extracted from an input news article usingthe above impression dictionary.

3 SYSTEM CONFIGURATION

This session illustrates a scheme of data pro-cessing flow of our system and important ideahow to select sample accounts as well as ac-quire tweeting messages from Twitter.

3.1 Data Processing Flow of System

Our system is configured with three ma-jor parts, namely acquisition of data fromWeb (Social Media), information processingscheme (data mining for acquired message

ISBN: 978-1-941968-16-1 ©2015 SDIWC 69

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

from Web), and generation of documents byour system. Figure 1 shows schematic blockdiagram for our system.

Figure 1. Schematic Diagram for System Configuration

The leftmost part is to connect Web-based So-cial Media(Twitter as our case) and to acquirea mount of data to be analyzed. “twpro API”symbolizes selection facility to decide specialgroup among the whole world. “Twitter API”plays a role of periodic data acquisition forthe selected group. In our case, target data istweeting message from Twitter.

The middle part is to present a series of manip-ulations for information processing from datainto generated results which include some tem-porary files. From Top to bottom, it will beperformed sequentially sometimes by automat-ically and otherwise by manually. This part isthe main body of our system and can accom-plish some kind of sentiment analysis and gen-erate our interesting results.

The rightmost part is to explain temporarilyand/or permanently generated document filesfrom the middle part. Some of them are out-put destination for one manipulation as well asinput source for another one. All of them areto be analyzed in our study and possibly createuseful evidence for trial visualization of certainrelationship.

3.2 Sample Data Decision

At first, we have decided the necessary ac-counts obtained from Twitter, which is the tar-get of Social media for our sentiment analysis.We utilized an API in order to obtain account-ing information and selected target accountsthrough which we obtain Tweeting messagefrom Twitter. Figure 2 shows an example of re-trieval results by using “twpro API” in Figure1 (very sorry because of Japanese expression).

Figure 2. Retrieval Results by means of “twpro API”(in Japanese)

Our study is partly to focus on Natural Lan-guage Processing, so we cannot avoid Japaneseexpression sometimes.Social media, especially Twitter, will be play-ing a typical role in Young generation cultures.We have selected the students of Japanesenational universities as our target to acquiretweeting messages from Twitter. In such acase, we can consider that users of the relevantaccounts must be approximately limited from18 years old to 22 years old. The merit of ourlimitation is to assume that the according usersused to tweet almost similar meaning for thesame word and/or short sentence.We can efficiently define a common dictionaryto translate from word into sentimental valuesfor target message acquired from Twitter basedon the previously selected account users. Thisis why we have introduced selection of accountfor users of Twitter by “twpro API”. We can,therefore, accomplish effective sentiment anal-ysis for tweeting messages of selected accountof Twitter.

ISBN: 978-1-941968-16-1 ©2015 SDIWC 70

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

4 SENTIMENT ANALYSIS

This section demonstrates that our system canperform some kind of sentiment analysis.Thefirst half of the section describes designingdedicated translation mechanism for our ap-proach from extracted message to suitablewords for sentiment analysis. The second oneprovides calculating sentimental values for se-lected tweeting message by means of utiliza-tion of translation mechanism.

4.1 Translation Table

After acquisition of the target data to be an-alyzed from Twitter, suitable manipulation isnecessary to do filtering words from sentencesin acquired message in order to perform thefirst step of sentiment analysis. Such filteringmanipulation can be carried out in the follow-ing ways;

1. getting one line(= taking a sentence froma message)

2. retrieving predefined character-string(=search parameter) in that line with patternmatching mechanism

3. accumulating times of detection forsearched patterns from line to line

4. making a correlation between the total re-sults during accumulation and the targetmessage(= one tweeting)

5. distinguishing each message into catego-rized group according to amount of totalresults

It is very important to predefine search patternsbased on their sentimental meaning and to cat-egorize tweeting messages according to occur-rences of search pattens and their sentimentalmeanings. Table 1 shows category of senti-mental meanings and their examples of searchpatterns. WN specifies set of words whichdo clearly Negative Sentiment, conventionally,and WP also specifies set of words which ex-press clearly Positive Sentiment, conversely.(Sorry, Table 1 includes Japanese expressions.

Table 1. Category and Search Patterns

category sentiment search patternsanger 殺意,叱る,怒り,かっかsadnes 寂しい,愁い,断腸の思い

WN shame 恥ずかしい,面目ないdisgust 塞ぐ,失望,重苦しい,消沈

fear 心細い,不安,不気味joy にこにこ,快感,喜び,輝く

WP good 敬愛,見とれる,好感,惚れるcomfort 晴れ晴れ,さっぱり,のんびり

excitement 自棄,焦れったい,絶句,蒼白surprise 息をのむ,茫然,うっとり

But it is necessary for Natural Language Pro-cessing. )We must define SN and SP according to fre-quency of appearance in a tweet message aboutWN and WP . We assume that wN ∈ WN andwP ∈ WP . SN is a set of tweeting messageswhich include wN and/or wP if and only if thenumber of wN is greater than the number ofwP , and SP is another set of tweeting messageswhich include wN and/or wP if and only if thenumber of wN is less than the number of wP .n(SP ) is number of SP , while n(SN) is num-ber of SN . Figure 3 shows Venn Diagram forRelation between SN and SP .

Figure 3. Venn Diagram for Relation between SN andSP

Now, we can calculate conditional probabili-ties: PrN(w) and PrP (w) for a give word, re-spectively, as follows;

ISBN: 978-1-941968-16-1 ©2015 SDIWC 71

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

PrN(w) =n(SN , w)

n(SN)(1)

PrP (w) =n(SP , w)

n(SP )(2)

n(SN , w) is the number for “frequency of ap-pearance” about the word: w in the above set:SN , and n(SP , w) is also the number for “fre-quency of appearance” about the word: w inthe above set: SP . With conditional prob-abilties (1)(2), we specify the relevant senti-mental value: sv(w) as the expression (3) us-ing special additional weights: weightN andweightP , where weightN = log10 n(SN) andweightP = log10 n(SP ). These weights aresimply defined to grow as n(SN) and n(SP )increase.

sv(w) =

2∗PrP (w)∗weightPPrN(w)∗weightN+PrP (w)∗weightP

−1 (3)

So we can define the sentimental values forgiven words. In the next subsection, we willshow an example of sentimental value dictio-nary in Table 2.

4.2 Sentimental Value Calculation

Calculation of sentimental values for extractedtweeting message is one of the most importantmanipulations in our system to perform somekind of sentiment analysis approach. Based onarguments in the previous section, we have de-fined dictionary for sentimental values to makea correlation between selected words and theirsentimental values. Table 2 shows an exampleof sentimental value dictionaries. It presentsJapanese words at the leftmost, their sentimen-tal values at the middle, and their meanings atthe rightmost.With references of Table 2, calculation of sen-timental value for one sentence will be carriedout, for example, in the following ways(shownin Figure 4);

• At the lower case, each Japanese word(“Ryoushin”, “shinro”, “koto”, and

“kenka”) can be converted into senti-mental value (-0.008, -0.198, -0.022, and-0.15). And a total result of sentimentalvalues for the relevant tweeting messageis -0.428.

• At the upper case, each Japanese word(“Watashi”, “shinnyu”, “tenisu”, and“shita”) can be converted into sentimen-tal value (-0.07, 0.276, 0.352, and -0.05).And a total result of sentimental values forthe relevant tweeting message is 0.508.

Table 2. Sentimental Value Dictionary

word value meanryoushin -0.008 parentsshinro -0.198 coursekoto -0.022 about

kenka -0.15 quarrelshita -0.05 did

watashi -0.07 Ishinyu 0.276 best friendtenisu 0.352 tennis

Ryoushin to shinro no koto de kenka shita.

-0.008 -0.198 -0.022 -0.15 -0.05

-0.428

Watashi ha shinyu to tenisu wo shita.

-0.07 0.276 0.352 -0.05

0.508

Figure 4. Calculation Example of Sentimental Valuebased on Dictionary shown in Table 2

Just like the ways shown in Figure 4, our sys-tem can calculate sentimental values for eachtweeting message with Table 2. And more-over we can categorize users of accounts basedon the criterion into some groups. After that,we calculate the average sentimental values for

ISBN: 978-1-941968-16-1 ©2015 SDIWC 72

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

each group for the sake of visualization of therelation for sentimental values and character-istics of human behaviour in the same catego-rized group.

5 DISCUSSION

This section presents discussion about resultsfor sentiment analysis and trial visualizationof relation between sentimental value and be-haviour of categorized human group.

5.1 Analyzed Results

This subsection focuses on tweeting messagefrom young persons, who are the students ofJapanese National University in the approxi-mately west side of Japan.

Table 3. Categorized Sentimental Values based on Uni-versity

University Name Categorized Sentimental ValuesKyoto 0.109Saga 0.218

Hiroshima 0.229Fukui 0.268

Nagoya 0.284Toyama 0.306

Tokushima 0.311Mie 0.313

Kanazawa 0.321Kyushu 0.325

Gifu 0.328Shimane 0.338

Kobe 0.352Ryukyu 0.357

Okayama 0.362Kagawa 0.364Osaka 0.374Ehime 0.398

Kumamoto 0.418Kochi 0.428

Nagasaki 0.431Kagoshima 0.439Miyazaki 0.447

Yamaguchi 0.462Tottori 0.495Oita 0.672

Table 3 shows the results of Categorized Sen-timental Values based on University using nor-malized summation of sentimental values (de-scribed in (4) ), each of which are delivered

from expression (3) and calculated to averageof their values of the relevant students catego-rized by university. SV (u) is categorized sen-timental value, defined for each university: uand calculated by expression (4). S(u)i is aver-age of sentimental values per one tweet on oneday: i by the students of university: u. Inter-val of summation is specified from “SamplingStart Date: s” to “Sampling End Date: e”. Andn is total number of tweets per one day.

SV (u) =

∑ei=s S(u)i

n(4)

So we can provide an example of base for“Rank Correlation” of University by means ofsentimental values from Twitter. Table 3 alsoshows the above example using our expression(4) in the mode of ascending order.With such a order of university based oncategorized sentimental values, we canprobably some capability to visualize nor-mally unrecognized relationship betweentrends/phenomena/behaviour and the relevantuniversity through the intermediation ofsentimental values.

5.2 Trial Visualization of Relation

We can obtain the following data book ofJapanese Universities, named “Daigaku noJitsuryoku[6]”(in Japanese, meaning is capa-bility and status of university).

Figure 5. Daigaku no Jitsuryoku

So we have tried to challenge some perfor-mance to visualize normally unrecognizedrelationship about Japanese universities,

ISBN: 978-1-941968-16-1 ©2015 SDIWC 73

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

using this data book. And we are veryinteresting in one of special informationabout ranking of university, namely order fordropout ratio of students at each university.(http://kyoiku.yomiuri.co.jp/torikumi/jitsuryoku/yomitoku/contents/1-4.php)

Figure 6. Scatter Graph for Relation between Cate-gorized Sentimental Value based on University and itsDropout Ratio

Figure 6 shows a scatter graph for relation be-tween Categorized Sentimental Value based onUniversity and its Dropout Ratio, using ourabove approach. From Figure 6, we can prob-ably recognize some uncertain (we cannot de-scribe suitably) relation between dropout ratioand sentimental value. It seems that youngpeople, students of university, who presenttheir tweeting message with higher sentimentvalues, belong to university which has higherscores of dropout ratio. So this result can prob-ably demonstrate some relation between sen-timental value and behaviour of categorizedgroup of users.Of course, we have been investigating somefeasible reason for such a relation, but we can-not find globally useful and reasonable inter-pretation until now. So we donot definitelyreport that this is a meaningful example for“visualization of hidden relation” by means ofsentiment analysis using message from Twit-ter. It must be one of our future problems tofind the feasible reason for the above relation.

6 CONCLUSION

This paper described our approach to extractsentimental messages from Twitter, to calcu-

late sentimental values and to provide rank cor-relation based on sentimental values. And itshows its application to trial visualization of re-lation between sentimental value and very spe-cial behaviour of categorized human group.Our aim of this research is to visualize somenormally unrecognized relation, namely hid-den relation in other words, using approachby big data analysis. This study can bring usamount of information, useful experience andsuitable techniques for sentiment analysis forvisualization. Simultaneously, it has provideda very important problems for us to resolve inthe future.

REFERENCES

[1] Yang Yu, Xiao Wang,“World Cup 2014 in the Twit-ter World: A big data analysis of sentiments in U.S.sports fans’ tweets,” Computers in Human Behav-ior, Vol.48, pp. 392 - 400, 2015.

[2] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, OwenRambow, Rebecca Passonneau, “Sentiment Analy-sis of Twitter Data,” Proceedings of the Workshopon Language in Social Media (LSM 2011), pp. 30 -38, Portland, Oregon, 23 June 2011.

[3] Mike Thelwall, Kevan Buckley, and Georgios Pal-toglou, “Sentiment in Twitter events,” Journal ofthe American Society for Information Science andTechnology Volume 62, Issue 2, pp. 406 - 418,February 2011.

[4] Tadahiko Kumamoto, Yukiko Kawai, KatsumiTanaka, “Design and Evaluation of a Method forMining Impressions from Text(in Japanese),” TheIEICE Transaction D on Information and Sys-tems(Institute of Electronics, Information and Com-munication Engineers), Vol.J94-D(3), pp. 540-548,March 2011.

[5] Tadahiko Kumamoto, Katsumi Tanaka, “Pro-posal of Impression Mining from News Articles”,Knowledge-Based Intelligent Information and En-gineering Systems Lecture Notes in Computer Sci-ence Volume 3681, pp 901-910, 2005

[6] Yomiuri-Shinbun-sya eds. (published byChuokouron-sya). “Daigaku no Jitsuryoku”(136pages 2015 ISBN978-4-12-004656-8)http://www.chuko.co.jp/tanko/2014/09/004656.htmlhttp://kyoiku.yomiuri.co.jp/torikumi/jitsuryoku/yomitoku/contents/1-4.php

ISBN: 978-1-941968-16-1 ©2015 SDIWC 74

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015