Can blog communication dynamics be correlated with stock market activity

23
November 5, 2008 @ACM HyperText and HyperMedia 2008, Pittsburgh, PA 1 Can Blog Communication Dynamics be correlated with Stock Market Activity? Munmun De Choudhury Hari Sundaram Ajita John Dorée Duncan Seligmann November 5, 2008 Arts, Media & Engineering Arizona State University, Tempe Collaborative Applications Research Avaya Labs, New Jersey

Transcript of Can blog communication dynamics be correlated with stock market activity

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 1

Can Blog Communication Dynamics be

correlated with Stock Market Activity?

Munmun De Choudhury

Hari Sundaram

Ajita John

Dorée Duncan Seligmann

November 5, 2008

Arts, Media & Engineering

Arizona State University, Tempe

Collaborative Applications Research

Avaya Labs, New Jersey

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 2

Introduction

� Why is it important?

• Provides insights into understanding communication patterns of people.

• The communication dynamics yield correlations with certain external events, justifying their predictive power.

• Useful to corporations interested in identifying the ‘moods’ of people in response to company related events.

A model to: (a) study and analyze communication dynamics in the blogosphere, and (b) use

these dynamics to determine interesting correlations with stock market movement.

November 5, 2008

Figure: Normalized predicted stock movements of four

companies . Red represents negative movement, and blue

positive movement.

Graph for Blog communication Stock Movement data

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 3

Our Approach

� Characterize the communication dynamics in www.engadget.com through several contextual features for a particular company.

� Use these features and stock market movement in an SVM regressor to predict the movement.

� Our technique yields satisfactory results with errors of 22 % and 13 % in predicting the magnitude and direction of movement respectively.

B1 B2 SVR Actual B1 B2 SVR Actual B1 B2 SVR Actual

APPLE GOOGLE MICROSOFT NOKIA

B1 B2 SVR Actual

Positive stock movement

Negative stock movement

Time

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 4

Related Work

� Work on determining correlations between activity in Internet message boards (through frequency counts of relevant messages) and stock volatility and trading volume by Antweiler et al.

� Gruhl’s work on determining if blog data exhibit any recognizable pattern prior to spikes in the ranking of the sales of books on Amazon.com.

� Limitation:• Modeling information roles and communication dynamics in the prior work

have been done in a context-independent manner.

November 5, 2008

Antweiler et al.

Gruhl et al.

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 5

Introduction / Related work

Problem StatementModeling Information (Activity) Roles

Modeling Communication Dynamics

Determining Correlation

Experimental Results

Conclusions

Outline

November 5, 2008 5

• What is online communication

activity?

• Data Model

Posts

Comments

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 6

What is online communication activity?

� Blogs serve as rich sources of active discussion on a wide variety of events.

� In a technically blog community like Engadget, editors post news stories, and only authorized readers can comment, thus creating a high quality information source.

� Communication activity in a blog refers to the communication participation of agroup of users who perform specific actions with respect to certain events.

• In this paper, the events are denoted simply by the posts related to a company name.

� We are interested in observing how these activity dynamics could be predictors of the events they relate to.

Time

Event is supposed to occur Blog communication activity Event occurs

iPhone release

Posts

Comments

Stock movement for ‘Apple’

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 7

Data Model

� The data model can be described by a 3-tuple {V, Σ, Ψ} having the following entities with respect to a query:• V: A set of individuals – bloggers and commenters, V = {i | i = 1 … N},

• Σ : A set of posts Σ; each post p is associated with an author (blogger) i, a time stamp of posting ts and a tag cloud Φp, giving the following tuple, {(i , ts, Φp)} for each p,

• V: A set of comments Ψ on each post p in Σ; each comment has an author (commenter) j and time stamp tm tuple {(j,tm)}.

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 8

Introduction / Related work

Problem Statement

Modeling Information

(Activity) Roles

Modeling Communication Dynamics

Determining Correlation

Experimental Results

Conclusions

Outline

November 5, 2008 8

• Roles due to responsivity

• Roles due to measure of

participation

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 9

What are Information roles due to activity?

� People regularly involved in communication develop certain structures which characterizes their communication behavior. This we call an information role due to activity.

� The information roles of people characterize the nature of communication dynamics in the community.

� We define roles along two dimensions:• Roles due to response behavior

o Early Responders

o Late Trailers

• Roles due to measure of communication activity

o Loyals

o Outliers

People respond quickly A highly responsive

community; showed by short

distance between nodes

People write several

comments frequentlyA highly active community;

showed by thickness of edges

Significance of analysis of information roles of people

Response time

Number of comments

Nu

mb

er

of

pe

op

leN

um

be

r o

f p

eo

ple

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 10

Activity Role due to Responsivity

� Response time of a comment is the time elapsed between the publishing of the original blog and the publishing of the comment.

� The normalized response time of a comment depends on

• the time at which it was published and

• the rank of the comment

� The response time rc(x) of post px is defined as,

� We classify a person P as early responder Eor late trailer T as,

Figure : Importance of rank of a comment.

Response Distribution of a Blog Post

0

0.2

0.4

0.6

0.8

1

1.2

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Comment IDs

Res

pon

se T

ime

( )

( )1 2( ) 1 1

( )

m sc

c

e s

t tr x

t t n x

κθ θ −

= − + − −

Rank of comment cκ

Number of comments on px

nc(x)

Normalizing weightsθ1, θ

2

Time of last comment on px

te

Time of comment ctm

Time of posting of px

ts

If , then

otherwise

r P E

P T

ρ≥ ∈

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 11

Activity Role due to measure of participation

� An activity distribution is constructed per topic where the measure of participation of an individual is the number of comments she wrote over a period of time.

� People are classified into loyals L and outliers O based on a suitably chosen threshold θ over their measure of participation across several weeks in the topic activity distribution.

� Let a person P post C comments over time [0, t].

Activity Distribution

0

20

40

60

80

100

120

osh

ean

Ch

arl

es

2ti

mer

AD

MIN

V

Cla

rio

n

CW

Bri

an

Bh

aal

Bo

zo

Bain

s

An

toin

e

Bre

nd

an

# C

om

men

ts

Figure : Activity distribution.

If , then

otherwise

C P L

P O

θ≥ ∈

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 12

Introduction / Related work

Problem Statement

Modeling Information (Activity) Roles

Modeling Communication

Dynamics

Determining Correlation

Experimental Results

Conclusions

Outline

November 5, 2008 12

• Contextual features of

communication activity

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 13

Modeling Communication Dynamics

� For the set of all features, we assume that that the stock movement yt on a certain weekday depends on the communication activity over a period of time (we consider past week).

• Number of posts

• Number of comments

• Length, response time of comments

o Mean and standard deviations of the length and response times of comments:

ki: total number of posts on day (t – i)

lt-ic(x): mean length of comments on post px at day (t – i)

• Strength of comments

o In Engadget, every comment is marked for its significance by other users: highest ranked,

highly ranked, neutral, low ranked and lowest ranked.

• Size of early responders / late trailers set

• Size of loyals / outliers set

1

1( )

ikl c

t i t i

xi

l xk

µ − −

=

= ∑2

1

1( )

ikl c c

t i t i t i

xi

l x lk

σ−

− − −

=

= −

,

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 14

Introduction / Related work

Problem Statement

Modeling Information (Activity) Roles

Modeling Communication Dynamics

Determining CorrelationExperimental Results

Conclusions

Outline

November 5, 2008 14

• Dataset

• Determining stock movement

of a company

• SVR Prediction of stock

movement

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 15

Dataset

1,94826010718Nokia

4,46943219422Microsoft

2,73134714219Google

5,09257327624Apple

#Comments#Commenters#Posts#EditorsTopics

� The stock movement of a company c at a day t to be the change in stock price from the closing value of the past day, normalized by the return of the past day.

, where φt is the stock price of the company at day t.

� Note: It is important to take into account the effect of the overall stock market sentiment as well.

• For example, a negative movement of the stock prices of a particular company may be attributed due to negative movements in the overall stock market index (e.g. NASDAQ, ISE, S&P500 etc).

� We use NASDAQ index (http://www.finance.google.com) for the four companies.

( )1

1

t tc

t

t

yϕ ϕ

ϕ

−=

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 16

SVR Prediction of Stock Movement

� Features vectors are determined from the crawled dataset http://www.engadget.com.

� We use these features and stock market movement of the company over N weeks for training an SVM regressor.

� The trained parameters are used to predict the movement at the (N+1)th week.

where X are the input features at each day. The convex optimization problem is given as,

where ε is the allowed precision.

, with , c

ty w x b w b= + ∈ ∈X ℝ

21minimize

2

,subject to

,

i i

i i

w

y w x b

w x b y

ε

ε

− − ≤

+ − ≤

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 17

Introduction / Related work

Problem Statement

Modeling Information (Activity) Roles

Modeling Communication Dynamics

Determining Correlation

Experimental Results

Conclusions

Outline

November 5, 2008 17

• Baseline Methods

• Prediction Results

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 18

Baseline Methods

� Comment Frequency (B1): This method uses the frequency of comments per day. We use a linear regressor to learn the correlation coefficients incrementally based on stock movements and number of comments:

,

where ŷct is the stock movement of company c at day t and fc

t-i is the frequency of comments i days before t.

� Linear Relationship among features (B2): In this method, we assume a linear relationship between the contextual features anduse a linear regressor to learn the correlation coefficients incrementally.

• The regression fit is given as, Y = α ∙ X, where α is the vector of α1, α2, …, αn, the correlation coefficients and Y and X are the vectors of movements and communication features.

^

1 1 2 2 7 7

c

c c c

t t t t t tty f f fα α α− − − − − −= + + +…

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 19

Figure : Visualization of stock movements with time on vertical scale. B1

and B2

are the two baseline techniques. The SVR

prediction is found to follow the movement trend very closely with an error of 13.41 %.

Open Text

Corporation extends

alliance with

Microsoft (Aug 20)

Apple Beatles settles

trademark lawsuit

(Feb 5)

B1 B2 SVR Actual B1 B2 SVR Actual B1 B2 SVR Actual

APPLE GOOGLE MICROSOFT NOKIA

B1 B2 SVR Actual

Ipod classic, ipod

touch (Sep 5)

US STOCKS-

Tech drag trips

Dow, Nasdaq

extends drop (Nov

29)

iPhone release

(Jun 29)

Google outbids

Microsoft for Dell

bundling deal

(May 25)

Google getting

into e-books (Jan

22)

Nokia E61i, E65

(Jun 25)

Microsoft Xbox 360

Arcade release (Oct

23)

Microsoft loses EU

anti-trust appeal (Sep

17)

Nokia NGage gaming

service (Aug 29)

Nokia recalls 46M

batteries (Aug 14)

Microsoft buying

Yahoo rumor (May

4)

Microsoft To Acquire

Parlano (Aug 30)

The FTC to conduct

antitrust review of

Google and

DoubleClick deal

(May 29)

Experimental Results

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 20

Introduction / Related work

Problem Statement

Modeling Information (Activity) Roles

Modeling Communication Dynamics

Determining Correlation

Experimental Results

Conclusions

Outline

November 5, 2008 20

• Summary

• Future Work

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 21

� A simple model

• to study and analyze communication dynamics in blogosphere and

• use those dynamics to determine interesting correlations with stock market movement.

� Characterize the communication dynamics in a blog through several contextual features for a particular company.

� Excellent results on stock movement prediction of four companies using a widely read technology blog dataset www.engadget.com

November 5, 2008 21

Summary

Blog communication + stock movement

correlation

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 22

Conclusions

� Consequences

• Blog communication exhibits remarkable predictive power.

� Future Work

• Characterize communication activity at multiple scales.

• Identifying an implicit macro property that underlies communication dynamics on the blogosphere – accountable by a vocal minority or majority.

Individuals Groups Community

Characterizing communication at multiple scales

November 5, 2008@ACM HyperText and HyperMedia 2008, Pittsburgh, PA 23November 5, 2008 23

Thanks!

November 5, 2008 23