One Day in Twitter: Topic Detection Via Joint Complexity

8
One Day in Twitter: Topic Detection Via Joint Complexity erard Burnside Dimitris Milioris Philippe Jacquet Bell Labs, Alcatel-Lucent, Centre de Villarceaux, 91620 Nozay, France {gerard.burnside, dimitrios.milioris, philippe.jacquet}@alcatel-lucent.com Abstract In this paper we introduce a novel method to perform topic detection in Twitter based on the recent and novel technique of Joint Complexity. Instead of relying on words as most other existing methods which use bag-of- words or n-gram techniques, Joint Complex- ity relies on String Complexity which is de- fined as the cardinality of a set of all distinct factors, subsequences of characters, of a given string. Each short sequence of text is decom- posed in linear time into a memory efficient structure called Suffix Tree and by overlap- ping two trees, in linear or sublinear average time, we obtain the Joint Complexity defined as the cardinality of factors that are common in both trees. The method has been exten- sively tested for Markov sources of any order for a finite alphabet and gave good approxi- mation for text generation and language dis- crimination. One key take-away from this ap- proach is that it is language-agnostic since we can detect similarities between two texts in any loosely character-based language. There- fore, there is no need to build any specific dic- tionary or stemming method. The proposed method can also be used to capture a change of topic within a conversation, as well as the style of a specific writer in a text. In this paper we exploit a dataset collected by using the Twitter streaming API for one full day, and we extract a significant number of topics for every timeslot. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, published at http://ceur-ws.org 1 Introduction During recent years the online social media services have seen a huge expansion, and as a result, the value of information within these networks has in- creased dramatically. Interactions and communication between users help predict the evolution of informa- tion in online social networks, and the ability to study these networks can provide relevant information in real time. The study of social networks has several research chal- lenges: searching in blogs, tweets and other social me- dia is still an open problem since posts are indi- vidually small in size but there is a tremendous quantity of them delivered in real time, with little and sometimes transient contextual information. Real time search has to balance between quality, authority, relevance and timeliness of the content, the information of the correlation between groups of users can help predict media consumption, net- work resources and traffic. This can be used in order to improve the quality of service and expe- rience, analyzing the relationship between members of a group or a community within a social network can reveal the important teams in order to use them for companies marketing plans, spam and advertisement detection is another im- portant challenge, since with the growth of so- cial networks we also have a continuously grow- ing amount of irrelevant information over the net- work. In order to address these challenges, we need to extract the relevant information from online social media in real time. In this paper we use the theory of Joint Com- plexity to detect bursty trends among short pieces

Transcript of One Day in Twitter: Topic Detection Via Joint Complexity

One Day in Twitter: Topic Detection Via Joint

Complexity

Gerard Burnside Dimitris Milioris Philippe Jacquet

Bell Labs, Alcatel-Lucent, Centre de Villarceaux, 91620 Nozay, France{gerard.burnside, dimitrios.milioris, philippe.jacquet}@alcatel-lucent.com

Abstract

In this paper we introduce a novel methodto perform topic detection in Twitter basedon the recent and novel technique of JointComplexity. Instead of relying on words asmost other existing methods which use bag-of-words or n-gram techniques, Joint Complex-ity relies on String Complexity which is de-fined as the cardinality of a set of all distinctfactors, subsequences of characters, of a givenstring. Each short sequence of text is decom-posed in linear time into a memory efficientstructure called Suffix Tree and by overlap-ping two trees, in linear or sublinear averagetime, we obtain the Joint Complexity definedas the cardinality of factors that are commonin both trees. The method has been exten-sively tested for Markov sources of any orderfor a finite alphabet and gave good approxi-mation for text generation and language dis-crimination. One key take-away from this ap-proach is that it is language-agnostic since wecan detect similarities between two texts inany loosely character-based language. There-fore, there is no need to build any specific dic-tionary or stemming method. The proposedmethod can also be used to capture a changeof topic within a conversation, as well as thestyle of a specific writer in a text. In thispaper we exploit a dataset collected by usingthe Twitter streaming API for one full day,and we extract a significant number of topicsfor every timeslot.

Copyright c© by the paper’s authors. Copying permitted onlyfor private and academic purposes.

In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedingsof the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,published at http://ceur-ws.org

1 Introduction

During recent years the online social media serviceshave seen a huge expansion, and as a result, thevalue of information within these networks has in-creased dramatically. Interactions and communicationbetween users help predict the evolution of informa-tion in online social networks, and the ability to studythese networks can provide relevant information in realtime.The study of social networks has several research chal-lenges:

• searching in blogs, tweets and other social me-dia is still an open problem since posts are indi-vidually small in size but there is a tremendousquantity of them delivered in real time, with littleand sometimes transient contextual information.Real time search has to balance between quality,authority, relevance and timeliness of the content,

• the information of the correlation between groupsof users can help predict media consumption, net-work resources and traffic. This can be used inorder to improve the quality of service and expe-rience,

• analyzing the relationship between members of agroup or a community within a social network canreveal the important teams in order to use themfor companies marketing plans,

• spam and advertisement detection is another im-portant challenge, since with the growth of so-cial networks we also have a continuously grow-ing amount of irrelevant information over the net-work.

In order to address these challenges, we need toextract the relevant information from online socialmedia in real time.

In this paper we use the theory of Joint Com-plexity to detect bursty trends among short pieces

of information. Our method is simple, context-free,with no grammar and no language assumptions, andit does not use semantics.

The Suffix Tree contains the algorithmic signa-ture of the text, and we use that to have a fast andmassive – without human intervention – trend sensingin social communities.

In a prior work, we proposed a method basedon Joint Complexity along with pattern matchingon URLs extracted from tweets. That method wascompared with a version of a Document-Pivot methoddescribed in [APM+13], which was modified in orderto perform classification of tweets, i.e. assigningtweets to categories such as Sports, Politics, Eco-nomics, etc. The results showed a good performanceof the method compared to Document-Pivot andmotivated us to use a more simplified version ofit based on Joint Complexity for the Snow DataChallenge 2014.

The paper is organized as follows: Section 2 in-troduces the Joint Complexity method, whileSection 3 briefly discusses the Snow Data Challengedataset. Section 4 describes the implementation ofthe proposed method for topic detection in Twitter,while Section 5 evaluates the performance with realdata obtained from the mentioned dataset. Finally,Section 6 summarizes our main results and providesdirections for future work.

2 Joint Complexity

In the last decades, several attempts have been madeto mathematically capture the concept of complexityof a sequence, i.e, the number of distinct factors con-tained in a sequence. In other words, if X is a sequenceand I(X) its set of factors, then |I(X)| is the com-plexity of the sequence. For example, if X = “apple”then I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl,ple, appl, pple, apple, v} and |I(X)| = 15 (v de-notes the empty string). The complexity of a stringis also called the I−complexity [BH11]. The notionis connected with deep mathematical properties, in-cluding the rather elusive concept of randomness in astring [IYZ02, LV93, Nie99].

In general, the information contained in a stringmay be revealed by comparing with a reference string,since its entropy or I-complexity does not give muchinsight. In a previous work [Jac07] we introduced theconcept of Joint Complexity, or J-complexity, of twostrings. The J-complexity is the number of commondistinct factors in two sequences. In other words the J-complexity of sequence X and Y is equal to J(X,Y ) =

|I(X) ∩ I(Y )|.The J-complexity is an efficient way to estimate

similarity degree of two sequences. Two texts writtenin similar languages have more sequences in commonthan texts written in very different languages. Fur-thermore, texts in the same language but on differenttopics have smaller Joint Complexity than texts on thesame topic.

The analysis of a sequence in subcomponents isdone by Suffix Trees, which come with a simple, fastand low complexity method to store and recall themfrom memory, especially for short sequences. SuffixTrees are frequently used in DNA sequence analysis.The construction of the Suffix Tree for the wordsapple and maple, as well as the comparison betweenthem is shown in Fig. 1.

Figure 1: Suffix Tree of words apple, maple.JC(apple,maple) = 9.

In our previous work [Jac07] it is proved that theJ-complexity of two texts of size m built from twodifferent binary memoryless sources grows like

γmκ

√α logm

, (1)

for some κ < 1 and γ, α > 0 which depend on theparameters of the two sources. When the sourcesare identical, then the J-complexity growth is O(m),hence κ = 1. When the texts are identical (i.e,X = Y ), then the J-complexity is identical to the

I-complexity and it grows as m2

2 [JLS04]. Indeed thepresence of a common factor of length O(m) wouldinflate the J-complexity by a term O(m2).

We should point out that experiments show thatthe complexity estimated as above for memory-lesssources converges very slowly. Furthermore, memory-less sources are not appropriate for modeling text gen-eration. In a prior work, we extend the J-complexity

estimate to Markov sources of any order on a finitealphabet. Markov models are more realistic and havebetter approximation for text generation, i.e. for DNAsequences, than memory-less sources [JMS13, MJ14].

Joint string complexity has a variety of applications,such as detection of the similarity degree of two se-quences, for example the use of “copy-paste” in textsor documents or the identification of authors and era oftexts. We want to show here that it can also be used inanalysis of social networks (e.g. tweets which are lim-ited to 140 characters) and classification. Therefore itmay be a pertinent tool for automated monitoring ofsocial networks. Real time search in blogs, tweets andother social media has to balance between quality andrelevance of the content, which due to the very shortsize of the texts and the huge amount of those, is stillan unsolved problem.

In a prior work [JMS13, MJ14] we derive a secondorder asymptotics for J-complexity of the followingform

γmκ

√α logm+ β

, (2)

for some β > 0. This new estimate converges morequickly, and usually works for texts of order m ≈ 102;In fact, for some Markov sources our analysis indicatesthat J-complexity oscillates with m. This is outlinedby the introduction of a periodic function Q(log m) inthe leading term of our asymptotics. This additionalterm even further improves the convergence for smallvalues of m.

In view of these facts, we can use the J-complexityto discriminate between two identical/non-identicalMarkov sources [Ziv88]. We introduce the discrimi-nant function as follows

d(X,Y ) = 1− 1

logmlog J(X,Y ), (3)

for two sequences X and Y of length m. This dis-criminant allows us to determine whether X and Yare generated by the same Markov source or not byverifying whether

d(X,Y ) = O(1/ logm)→ 0 or

d(X,Y ) = 1− κ+O(log logm/ logm) > 0,

respectively when the length of X and Y is equal to m.

In [JMS13, MJ14] we show some experimental evi-dence of usage of our discriminant function on real andsimulated texts by different Markov orders. We com-pare the J-complexity of a simulated English text withsame length texts generated in French, Polish, Greekand Finnish by a third order Markov model to our the-oretical results. Even for texts of lengths smaller thana thousand characters, one can easily discriminate be-tween these languages. Therefore, Joint Complexity is

an efficient method to capture the similarity degree ofshort texts.

3 Snow Data Challenge Dataset

According to the Social Sensor dataset for the SnowData Challenge, we collected tweets for 24 hours;between Tuesday Feb. 25, 18:00 and WednesdayFeb. 26, 18:00 (GMT). The result of crawlingwas over 1, 041, 062 tweets between the Unix times-tamps 1393351200000 and 1393437600000 and wasconstructed through the use of the Twitter Stream-ing API by following 556, 295 users and also lookingfor four specific keywords: Syria; terror; Ukraine; bit-coin. The dataset used for the experiments consists of96 timeslots, where each timeslot contains tweets forevery 15 minutes, starting at 18:00 on Tuesday 25th2014. The challenge then consisted in providing a min-imum of one and a maximum of ten different topics pertimeslot, along with a headline, a set of keywords anda URL of a relevant image for each detected topic. Thetest dataset activity and the statistics of the datasetcrawl are described more extensively in [PCA].

4 Topic Detection Method

Until the present, the main methods used for textclassification are based on keywords detection andMachine Learning techniques. Using keywords intweets has several drawbacks because of wrongspelling or distorted usage of the words – it alsorequires lists of stop-words for every language to bebuilt – or because of implicit references to previoustexts or messages. Machine Learning techniques aregenerally heavy and complex and therefore may notbe good candidates for real time text processing,especially in the case of Twitter where we havenatural language and thousands of tweets per secondto process. Furthermore, Machine Learning processeshave to be manually initiated by tuning parameters,and it is one of the main drawbacks for the kindof application, where we want minimum if anyhuman intervention. Some other methods are usinginformation extracted by visiting the specific URLson the text, which makes them a heavy procedure,since one may have limited or no access to the infor-mation, e.g. because of access rights, or data size/rate.

In our method we use an analysis of the messages(tweets) based on Suffix Trees [THP04] and we use theJoint Complexity as a metric to quantify the similaritybetween these messages.

According to the dataset described in Section 3 andin [PCA] we have N = 96 timeslots with n = 1 . . . N .For every tweet ti, where i = 1 . . .M , with M beingthe total number of tweets, in the n-th timeslot, i.e tni ,

we build a Suffix Tree, ST (tni ), as described in Section2. Building a Suffix Tree is an operation that costsO(m logm) and takes O(m) space in memory, wherem is the length of the tweet.

Then we compute the Joint Complexity metric,JC(tni , t

nj ) of the tweet tni with every other tweet tnj

of the n-th timeslot, where j = 1 . . .M , and j 6= i(by convention we choose JC(tni , t

ni ) = 0). The Joint

Complexity between two tweets is the number of thecommon factors defined in language theory and canbe computed efficiently in O(m) operations (sublinearon average) by Suffix Tree superposition. For the Ntimeslots we store the results of the computation inthe matrices T1, T2, . . . , TN of M ×M dimensions.

Figure 2: Representation of the first row of the n-thtimeslot via weighted graph.

We represent timeslots by fully-connected weightedgraphs. Each tweet is a node in the graph and the two-dimensional array Tn holds the weight of each edge, asshown in Fig. 2. Then, we calculate the score for eachnode in our graph by summing all the edges that areconnected to the node. The node that gives the highestscore is the most representative and central tweet ofthe timeslot.

Tn =

0 JCtn1 ,tn2 JCtn1 ,tn3 · · · JCtn1 ,tnM

JCtn2 ,tn1 0 JCtn2 ,tn3 · · · JCtn2 ,tnMJCtn3 ,tn1 JCtn3 ,tn2 0 · · · JCtn3 ,tnM

......

. . ....

JCtnM ,tn1JCtnM ,tn2

JCtnM ,tn3· · · 0

Most of the timeslots have M = 5, 000 tweets,

so matrices T1, T2, . . . , TN have approximately 25, 000entries for every timeslot. Since they are symmet-ric, only half of these entries could be used, i.e theupper triangular or the lower triangular of matricesT1, T2, . . . , TN , as shown below, which save space andmake the structure more efficient.

Algorithm 1 Method to retrieve row i of an uppertriangular matrix

// data is the internal TIntArrayList object

int[] row = new int[data.length+1];

// read the k-th column up to i:for k = 0 to i− 1 do

row[k]← data[k].get(i− k − 1);

end for

row[i]← 0; // by convention, not computedif i < data.length− 1 then

// read the i-th row until the end:for j = 0 to data.length− i do

row[i+ j + 1]← data[i].get(j);

end forelse do nothing; // the last tweet does not have arow!end ifreturn row;

Tup triangn =

JCtn1 ,t

n2

JCtn1 ,tn3· · · JCtn1 ,t

nM

JCtn2 ,tn3· · · JCtn2 ,t

nM

. . ....

JCtnM−1,tnM

The actual implementation of the upper triangular

data structure is based on an efficient TIntArrayListdata structure provided by the Trove4j library (it usesarrays of int instead of Java objects) encapsulated in aclass which provides methods to access individual ele-ments thanks to their row and column index or a wholerow by providing its index. Algorithm 1 illustrates thegetRow() method of this class.

The whole Cross Joint Complexity computation wasrun in a multithreaded way on a 24 processor machine:k = 23 threads are started and each thread works ona disjoint set of rows. Note that because each row ofthe triangular matrix does not have the same num-ber of elements, the number of rows assigned to eachthread should not be divided evenly, otherwise the firstthreads would have much more work to do than thelast ones (remember that the input is an upper tri-angular matrix); instead, because the total number ofnon-zero elements of the n∗n matrix is n(n−1)/2, eachthread should work on approximately n(n− 1)/(2 ∗ k)elements so the implementation kept assigning rows toa thread until that number of elements was reached.

This implementation allowed the program to runin on average 95 seconds in order to compute a 15-minutes timeslot.

These computations were only run once, as soon as

the data was properly divided into 15-minutes times-lots, and the results were saved in files which weresubsequently used to perform the rest of implementa-tion.

When we finally get the scores of the Joint Com-plexity metric, we try to find the R most repre-sentative and central tweets of every timeslot. Atfirst we get the sum of the Joint Complexity, Sni =∑j=1...M,j 6=i JCtni ,tnj , of the i-th tweet with every other

tweet j in the n-th timeslot, and finally we get the vec-tor Sn = [Sn1 , S

n2 , . . . , S

nM ] for all timeslots.

We sort the elements of each vector Sn in descend-ing order and we get the R most representative andcentral tweets in the following way: The best-rankedtweet is chosen unconditionally, the second one ispicked only if its JC score with the first one is be-low a chosen threshold Thrlow, otherwise it is addedto the list of related tweets of the first tweet; similarly,the third one is picked only if its JC score with the firsttwo is below Thrlow, etc. This ensures that the top-ics are dissimilar enough and it classifies best rankedtweets into topics at the same time.

Then by removing punctuation, special characters,etc., of each selected tweet, we construct the headlinesof each topic and we run through the list of relatedtweets to keep only tweets that are different enoughfrom the selected one (we do not want duplicates), wedo so by keeping only the tweets whose JC score withthe selected tweet and all previous related tweets isabove a chosen threshold Thrmax. We first chose em-pirical values 400 and 600 for Thrlow and Thrmax re-spectively, then a few hours before submission time wenoticed that many topics had only one related tweet(all the others were retweets), so we decided to lowerthat threshold to 240 but did not have time to recom-pute the results on the whole dataset so only a handfulof timeslots benefited from this better Thrlow = 240value. The final plan is to come up with a formula tohave the system determine those thresholds automat-ically depending on the number of characters of eachtweet.

While running through the list of related tweets wecompute the bag of words used to construct the listof keywords and we also check the original json datato find a URL pointing to a valid image related to thetopic.

We chose to print the first top 8 topics for eachtimeslot.

4.1 Keywords

In the bag of words constructed from the list of relatedtweets, we remove articles (stop-words), punctuation,special characters, etc., and we get a list of words, andthen we order them by decreasing frequency of occur-

rence. Finally we report the k most frequent words,in a list of keywords K = [K1

1...k,K21...k, . . . ,K

N1...k], for

the N total number of timeslots.

4.2 Media URLs

The body of a tweet (in the .json file format), con-tains a URL information for links to media files suchas pictures or videos, when available this informationis stored in the following subsection of the json object:entities→ media→ media url.

While reporting the most representative and centraltweets, we scan the original json format in order to re-trieve such a URL, from the most representative tweetor any of its related tweets, pointing to valid photosor pictures in a .jpg, .png or .gif format. Then, wereport these pictures along with the headlines and theset of keywords, as shown in Algorithm 2 .Almost half of the headlines (47%) produced by ourmethod had an image retrieved from the original tweet.When no such URL is provided within the collectedjson objects we planned to visit the URLs of websitesand retrieve images from the website but did not havesufficient time to implement this in a way that enforcedthat the retrieved image was indeed relevant for thetopic. It is important to point out that the goal of thechallenge was to detect newsworthy items before theyhit mainstream news websites, so it was decided thatparsing images from such websites was not interestingin that context.

5 Evaluation

Twitter is one of the most popular social networks andmicro-blogging service in the world, with more than540 million users connected by 24 billion links [GL12].It allows the information spread between people,groups and advertisers, and since the relation betweenits users is unidirectional, the information propagationwithin the network is similar to the way that the in-formation propagates in real life.

Apart from the specific implementation for theSnow Data Challenge, the main benefits of ourmethod are that we can both classify the messagesand identify the growing trends in real time, withouthaving to manually identify lists of keywords for everylanguage. We can track the information and timelinewithin a social network and find groups of users thatagree or have the same interests, i.e, perform trendsensing.

The official evaluation results of our method inthe Data Challenge are included in [PCA], there-fore here we will only discuss a quick comparisonof our method with simply counting the numberof retweets within a chosen timeslot, namely Feb.

Algorithm 2 Topic detection based on Joint Com-plexity

// N = # timeslots, M = # tweets in the n-thtimeslot

for n = 1 to N dofor t = 1 to M do

t← tjson.getText();

tST ← suffixTreeConstruction(t);

JCScores← JCMetric();

end for

// Find the most representative & central tweetsSn ← sum(JCScores);

// Get headlines for the central tweetsRn ← descendingOrder(Sn);

// Get set of keywordsKn ← keywords(Rn);

// Get URLs of pictures from the .json filePn ← mediaURL(Rn);

// Print the results in appropriate formatPrint(Rn);

Print(Kn);

Print(Pn);

end for

25th from 8:15 pm until 8:30 pm, comprised of justunder 16, 000 tweets, which is about 50% higher thanthe average number of tweets for the whole 24h period.

The ranking produced by our method for thetimeslot 25-02-14 20:15 is listed below:

(JC1) BarackObama: LIVE: President Obama is

announcing a new plan to boost job growth

for middle-class Americans.

(JC2) WhiteHouse: "I’m here to announce that

we’re building Iron Man | Not really. Maybe.

It’s classified." President Obama ActOnJobs

(JC3) Madonna: Fight for the right to be free!

Fight Fascism and discrimination everywhere!

Free Venezuela the Ukraine...

(JC4) Karinabarbosa: civicua: Free Venezuela !

Ukraine is with you! euromaidan Photo by

Liubov Yeremicheva

(JC5) TheEllenShow: I’m very excited Pharrell’s

performing his big hit "Happy" at the Oscars.

Spolier alert: I’ll be hiding in his hat.

(JC6) BBCSport: Olympiakos take the lead vs

Manchester United, a hideous deflection off

Alejandro Dominguez beats De Gea.

(JC7) WhatTheBit: It’s going to be pretty damn

near impossible to top this as photo of the week.

(JC8) Al Qaeda branch in Syria issues ultimatum

to splinter group: The head of an al

Qaeda-inspired militia fighting...

A small program was written to compare the num-ber of retweets observed on the first occurence of aretweet and substract that to the number of retweetsobserved on the last occurence of the same retweet.All tweets were then sorted in descending order of thenumber of retweets obtained, and the result is shownbelow:

(RT1) RT @OllieHolt22: Ashley Young berating

someone for going down too easily - that’s funny

(RT2) RT @WhiteHouse: "I’m here to announce that

we’re building Iron ManNot really. Maybe. It’s

classified." @President Obama #ActOnJobs

(RT3) RT @BarackObama: LIVE: President Obama is

announcing a new plan to boost job growth for

middle-class Americans. http://t.co/d6x1Bous1S

(RT4) RT @TheEllenShow: I’m very excited @Pharrell’s

performing his big hit "Happy" at the #Oscars.

Spolier alert: I’ll be hiding in his hat.

(RT5) RT @Madonna: Fight for the right to be

free! Fight Fascism and discrimination everywhere!

Free Venezuela the Ukraine...

(RT6) RT @civicua: Free #Venezuela ! #Ukraine is

with you! #euromaidan Photo by Liubov Yeremicheva

(RT7) RT @russellcrowe: Holy Father @Pontifex , it

would be my deepest pleasure to bring the

@DarrenAronofsky film to you to screen. [...]

(RT8) RT @DjKingAssassin: $myry recovering

on #bitcoin market update on #mtgox

The following differences can be observed betweenthe two rankings:

• The best ranked retweet does not appear in thelist produced by our method! In fact a plausibleexplanation for RT1 and RT8 not being pickedup by our method is a weak centrality due to ashorter size of text, as briefly explained in Sec-tion 6. For RT7 however the size of the text is not

a factor and the weak centrality is most probablyexplained by the unlikely co-occurrence of termswithin the sentence. Although this is highly sub-jective it is worth noting that those tweets do notappear to be very newsworthy.

• Five out of eight items produced by our method(JC1 to JC5) are listed in the retweet ranking,this is rather expected as there is a mechanicallink between our method and the number of oc-currences of the same text.

• The eighth item produced by our method (JC8)seems to be very newsworthy but appears onlypast the 100th position in the retweet countmethod. Again, although this is subjective it ap-pears to be an interesting result.

Finally, although the dataset that was used for thischallenge did not allow to show this properly, the au-thors strongly believe that one key advantage of usingJoint Complexity is that it can deal with languagesother than English [JMS13, MJ14] without requiringany additional human intervention.

6 Conclusion and Future Work

In this short paper we presented an implementationof a topic detection method applied to a datasetof tweets emitted during a 24 hour period. Theimplementation relies heavily on the concept of JointString Complexity which has the benefit of beinglanguage agnostic and not requiring humans to dealwith list of keywords. The results obtained seemedsatisfactory and a final evaluation in the context ofthe SNOW Data Challenge 2014, can be found in[PCA].

As the timeframe available to participate in thechallenge was quite short, a - probably inexhaustive- number of items have been identified as possiblefuture work, and these are listed below:

• As mentioned in Section 4, although the currentempirical values chosen as thresholds to determinewhether two texts are similar and whether twotexts are near-duplicates seemed to be quite satis-factory on the dataset that our method was testedagainst, it would be more useful to compute val-ues that would adjust to the size of the texts.

• The current implementation systematically picksthe top 8 topics from each timeslot whereas a bet-ter approach might be to identify a threshold un-der which topics may be considered not significantenough, and thus allowing some not very activetimeslots to contain less than 8 topics while othervery active timeslots may contain more than 8.

• Dividing the timeslots every quarter of an hour isarbitrary as some topics may be cut in half if theystarted after the middle of the previous timeslotand may therefore not acquire enough significanceon the current timeslot. Before aggregating thetweets into 15-minutes timeslots, our implementa-tion first divided the data into 5-minutes timeslotsand we considered building 20 minutes timeslots(each time including the 5 minutes preceding theofficial 15-minutes slot) to remedy the above is-sue, but this is now left as future work to check ifit provides significant improvements.

• The current implementation which sorts the top-ics by finding the most central tweets is a bitunfair to shorter tweets because longer tweetshave mechanically a better chance of obtaining ahigher centrality score (Joint Complexity countsthe number of common factors), therefore a fu-ture work should be to somehow mitigate this.One idea may be to modify the Joint Complexitymetric in order for it to behave like a distance andto perform clustering based on this distance.

Acknowledgment

This work is supported by the SocialSensor and RE-VEAL FP7 project under contract number 287975and 610928 respectively. Dimitris Milioris would liketo thank INRIA, Ecole Polytechnique ParisTech andLINCS.

References

[APM+13] L. Aiello, G. Petkos, C. Martin, D. Corney,S. Papadopoulos, R. Skraba, A. Goker,I. Kompatsiaris, and A. Jaimes. Sens-ing trending topics in twitter. 15(6):1268–1282, October 2013.

[BH11] V. Becher and P. A. Heiber. A better com-plexity of finite sequences. 8th Int. Conf.on Computability and Complexity in Anal-ysis and 6th Int. Conf. on Computabil-ity, Complexity, and Randomness, page 7,2011.

[GL12] M. Gabielkov and A. Legout. The com-plete picture of the twitter social graph. InACM CoNEXT Student Workshop, 2012.

[IYZ02] L. Ilie, S. Yu, and K. Zhang. Repetitioncomplexity of words. pages 320–329, 2002.

[Jac07] P. Jacquet. Common words between tworandom strings. In IEEE InternationalSymposium on Information Theory, Nice,France, 2007.

[JLS04] S. Janson, S. Lonardi, and W. Sz-pankowski. On average sequence complex-ity. pages 213–227, 2004.

[JMS13] P. Jacquet, D. Milioris, and W. Sz-pankowski. Classification of markovsources through joint string complexity:Theory and experiments. 2013.

[LV93] M. Li and P. Vitanyi. Introduction to Kol-mogorov Complexity and Its Applications.Springer–Verlag, Berlin, August, 1993.

[MJ14] D. Milioris and P. Jacquet. Joint sequencecomplexity analysis: Application to socialnetworks information flow. Bell Laborato-ries Technical Journal, 18(4), 2014. Issueon Data Analytics.

[Nie99] Harald Niederreiter. Some computablecomplexity measures for binary sequences.

In C. Ding, T. Helleseth, and H. Nieder-reiter, editors, Sequences and their Appli-cations, Discrete Mathematics and The-oretical Computer Science, pages 67–78.Springer London, 1999.

[PCA] S. Papadopoulos, D. Corney, and L. Aiello.Snow 2014 data challenge: Assessing theperformance of news topic detection meth-ods in social media. In Proceedings of theSNOW 2014 Data Challenge.

[THP04] S. Tata, R. Hankins, and J. Patel. Practi-cal suffix tree construction. In Proceedingsof the 30th VLDB Conference, 2004.

[Ziv88] J. Ziv. On classification with empiricallyobserved statistics and universal data com-pression. IEEE Transactions on Informa-tion Theory, 34:278–286, 1988.