Trading volume at Avanza - DiVA-Portal

69
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2019 Trading volume at Avanza GRETA KNUTSSON KAMYAR ESPAHBODI KTH SKOLAN FÖR TEKNIKVETENSKAP

Transcript of Trading volume at Avanza - DiVA-Portal

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2019

Trading volume at Avanza

GRETA KNUTSSON

KAMYAR ESPAHBODI

KTHSKOLAN FÖR TEKNIKVETENSKAP

Trading volume at Avanza GRETA KNUTSSON KAMYAR ESPAHBODIROYAL

Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2019 Supervisor at Avanza, Robert Ingemarsson Supervisors at KTH: Camilla Landén, Julia Liljegren Examiner at KTH: Jörgen Säve-Söderbergh

TRITA-SCI-GRU 2019:153 MAT-K 2019:09

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

Producing a model explaining the trading volume can be attractive for companies

who’s main revenue resides on it. Previous studies have shown that factors such

as stock returns, volatility and uncertainty affects the trading volume.

The purpose of this work is to clarify the consensus that prevails and determine

the factors that impact Avanza’s customers trading volume. Factors such as daily

stock returns and economic, political and financial uncertainty are analyzed through

amultiple linear regression analysis with a daily time period between 2000-2019.

The work is thus designed within the framework of mathematical statistics and

industrial economics.

To be able to draw a conclusion, further investigation is required in the form of

a time series analysis in combination with a deeper understanding of the applied

area and the mathematical methods that have been used.

Keywords

Bachelor Thesis, Regression Analysis, Trading Volume, Economic Politic Uncer-

tainty, Stock Price, Avanza

i

Abstract

Att ta fram en modell som förklarar handelsvolymen kan vara eftertraktat hos

företag vars huvudintäkter beror av den. Tidigare forskning visar att faktorer

som prisförändringar på aktiemarknaden, volatilitet och osäkerhet påverkar han-

delsvolymen.

Syftet med arbetet är klargöra den konsensus som råder och fastställa de faktorer

som har störst påverkan gällande handelsvolymen för Avanza’s kunders. Fak-

torer som dagliga förändringar inom börsmarknaden och ekonomisk, politisk och

finansiell osäkerhet har genom en multipel linjär regressionsanalys analyserats

med en daglig tidsperiod mellan 2000-2019. Arbetet är således utformat inom

ramen för matematisk statistik och industriell ekonomi.

För att kunnadra en slutsats krävs vidare undersökning i formav en tidsserieanalys

och en djupare förståelse av det tillämpade området och metoderna som har an-

vänds.

Nyckelord

Kandidatexamensarbete, Regressionsanalys, Handelsvolym, Politisk Osäkerhet,

Börsindex, Avanza

ii

Acknowledgements

Wewould like to thankour supervisor at theRoyal Institute of Technology, Camilla

Johansson Landén from the Department of Mathematics, and Robert Ingemars-

son and Rasmus Åkerblom fromAvanza Holding Bank AB, whom assisted us with

relevant data and supported us throughout this thesis.

iii

Authors

Kamyar Espahbodi <[email protected]> and Greta Knutsson <[email protected]>Industrial Engineering and ManagementKTH Royal Institute of TechnologyApplied Mathematics TMAI

Place for Project

Stockholm, Sweden

Supervisors

Camilla Johansson LandénLindstedtsvägen 25, 114 28, StockholmKTH Royal Institute of Technology

Julia LiljegrenLindstedtsvägen 30, 114 28, StockholmKTH Royal Institute of Technology

Robert IngemarssonRegeringsgatan 103, 111 39, StockholmAvanza Bank Holding AB

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Economic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Goal and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Stakeholder Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.7 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Regression Analysis 92.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Ordinary Least Squares (OLS) . . . . . . . . . . . . . . . . . . . . . 10

2.3 OLS Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Leverage and Influence . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Data 273.1 Trading Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Equity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Economic Policy Uncertainty (EPU) . . . . . . . . . . . . . . . . . . 29

4 Result 314.1 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Leverage and influential points . . . . . . . . . . . . . . . . . . . . . 32

4.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Discussion 465.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

v

5.2 Model Adequacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Previous studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5 Avanza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusion 51

References 52

vi

1 Introduction

A traditional stockbroker is a person who executes buy and sell orders on the be-

half of their client. Nowadays, the influence of internet has enabled the stockbro-

ker to carry out orders at a lower commission rate than before, hence the birth

of the discount broker. The discount broker’s business is as such online based,

which implies low overhead costs and thus reduced commission rates and fees

for the client. The chosen strategy for battling the market competition being high

volume and low cost, the broker’s main income, in a broad sense, mainly depends

on the number of client transactions. Therefore, it lies in the broker’s interest

to fully understand the behaviour of the driving factors behind their customers

trading volume, since it directly correlates with their aggregated commission rev-

enue.

1.1 Background

Avanza BankHolding AB, formally founded 1999, is Sweden’s largest online stock

brokerwith over 800,000 customers.1 Its business idea is to, through low charges,

offer a broad variety of saving products and educational and supporting services

within the area of personal saving in Sweden. They also offer market competi-

tive mortgage loans and pension solutions 2 and have been ranked as having the

happiest customers nine times in a row for each year within the industry.3

As any other online stock broker, Avanza takes a share of each customers trade, a

commission, which constitutes their primary source of income. Hence, their ag-

gregated revenue is directly dependent on their trading volume, i.e. the number

of shares transacted every day. By understanding the underlying mechanisms af-

fecting their customers trading volume, it is possible for Avanza optimize their

resources through forecasting or developing new products based on the special

behaviour, streamlining their organization.

1Avanza©,History2Avanza Bank Holding AB, Årsredovisning 20183Svenskt Kvalitetsindex, Personlig service utmanar digitala tjänster

1

1.2 Problem

However, Avanza’s customers trading behaviour does not solely depend on inter-

nal factors, such as new user interface improvements or new functionality fea-

tures, in fact, their financial result is affected by market cyclical effects such as

stock market developments, volatility and federal funds rate. Avanza states, in

their annual report, that the previous year 2018 was characterized by political un-

certainty and that SIX return index (an index representing all of the stocks on the

Swedish market) decreased by 8% and that the number of transactions rose by

15% and the revenue by 8%, compared to the previous year [2017].4

It therefore lies inAvanza’s interest to knowhowmuchpower they have in control-

ling their own revenue stream, i.e. is there still room for internal innovations and

new product features that could potentially further raise the commission revenue,

or does it, to some extent, only depend on exogenous factors? In other words, how

much can they affect their revenue stream, and to what extent?

1.3 Economic Theory

Although no previous studies within the area was found, some key insights were

gained by studying previous literature. The trading activity of a population is a

complex phenomenon comprised of several factors both physiological, such as in-

vestor overconfidence, and systematic, such as calendar effects. The following sub

sections further investigates some of these factors.

1.3.1 Trading Volume

It is hard to predict future trading volumes, i.e. the number of shares being trans-

acted at a certain time point, because of its complexity and dependency on human

behaviour. In general, the trading volume is driven by at least two forces; changes

in heterogeneity of beliefs and the disposition effect.5 Summarized, these empir-

4Avanza Bank Holding AB, Årsredovisning 2018, pp. 49-505Shefrin, A Behavioral Approach to Asset Pricing

2

ically supported theories states that the investors overconfidence about their val-

uation and trading skills can explain the high observed trading volume.6

Since these drivers are relatively subjective and hard to measure, from a quanti-

tative approach, the psychological and individual factors of human being, such as

overconfidence, personal beliefs, irrationality, are disregarded in this study.

1.3.2 Factors

Focusing on more numerically obtainable factors, several stand out.

Several studies have shown that there exists a positive correlation between the

price changes of stocks and trading volume in financial markets.7, 8, 9 A study ex-

amining potential factors affecting trading volume in European markets found

the asymmetric price-volume relation in over 70 percent of the analyzed stocks.10

Based on these studies, one could assume that the price is of great importance

when analyzing trading volume. Another study that analyzed the daily trading vol-

ume volume of the Swedish stockmarket found that themarketplace of the stock,

contrary to belief, was irrelevant for explaining the volume, while the shareholder

structure, free float and the number of outstanding shares in a company were

relevant.11

Perfect information availability is a key aspect of a perfect competition, therefore

it is reasonable to assume that it affects the trading volume as well, since, for

instance, positive information might persuade investors to buy a stock and vice

versa. Economic Policy Uncertainty (EPU) is an index developed by researchers

to generally estimate the uncertainty about fiscal or monetary policies, tax regu-

lations, regime changes and uncertainty over electoral outcomes, in other words

—everything that might raise the uncertainty regarding economic and policy out-

comes. One study found that the EPU-index impacts the markets and that it has

6Statman, Vorkink, and Thorley, “Investor Overconfidence and Trading Volume”7Ying, “Stock Market Prices and Volumes of Sales”8Cornell, “The Relationship between Volume and Price Variability in Future Markets”9Clark, “A Subordinated Stochastic Process Model with Finite Variance for Speculative Prices”10Batrinca, Hesse, and Treleaven, “Examining drivers of trading volume in European markets”11Sevelin, “Swedish Stock market: Explaining trade volumes in single stocks”

3

a positive correlation with stock price volatility and reduced investment.12

The transaction cost is another factor to consider. One study found that that trad-

ing volume is measurably responsive to changes in transaction costs.13

Clearly, there are several factors that could help to explain the trading volume —

even calender days affect it.14

1.4 Goal and Purpose

In conclusion, the change in trading volume is complex and depends on several

factors. The challenge is to find an explanation of the behaviour of Avanza’s clients

trading activity in order to answer the question whether the effects on the trading

activity are exogenous or endogenous, i.e. can Avanza affect their customers trad-

ing volume — or does it mainly depend onmacroeconomic, financial and political

factors?

In summary, this thesis aims to answer the following:

1. Does uncertainty, in a financial, political and macroeconomic sense, along

with market indicators such as stock prices and volatility indices, impact

Avanza’s customers trading activity?

(a) Which factors should be studied and why?

2. What are the possible relationships between the factors and what do they

look like?

(a) Can a model, given certain a significance level, explaining the fluctua-

tions of the commission revenue be formulated?

3. How should possible relationships be implemented?

(a) What can Avanza do with this information to increase their revenue?

12Baker, Bloom, and Davis, “Measuring Economic Policy Uncertainty”13Epps, “TheDemand for Brokers’ Services: TheRelation between Security Trading Volume and

Transaction Cost”14Sakalauskas and Krikščiūnienė, “The Impact of Daily Trade Volume on the day-of-theweek

Effect in Emerging Stock Markets”

4

If the stated questions above are answered and assuming that there exists possi-

ble relationships, Avanza will have more information about their customers com-

pared to other competitors in the same business. Therefore, the information can

be seen as a tool for competitiveness.

The competitiveness for Avanza’s business can, for example, be analyzed trough

Michael Porter’s Five Forces-model, where an increase in a company’s competi-

tiveness possibly also implicates an increasing of the revenue. The five forces in

Porter’s model are:

1. Competition in the industry

2. Power of customers

3. Power of suppliers

4. Threats of substitutes

5. Potential of new entrants in the industry

By analyzing each and one of the forces, the purpose of the project can further be

clarified. The first aspect of rivalry, the competition in the industry, depends on

how diversified the company’s offer is, compared to other companies in the same

business. The power of customer and suppliers represents customers ability to

decrease prices on the output and the suppliers ability to bargain for increased

prices. Threats of substitutes can be a rivalry if there exists substitutes on the

market that can replace the service and goods provided by the company. Last

but not least, the potential of new entrants entering the market also affects the

competitiveness for the analyzed company.15

Given that the research questions are answered, Avanza will have more informa-

tion on what the trading volume depends on. Therefore, they can differentiate

their offer which possibly will lead to a stronger position relative to the competi-

tors within the industry. The power of customers can decrease as the knowledge

about Avanza’s customers trading behaviour increases, since the negotiation abil-

ity is usually stronger for the part with the most information.

The power of suppliers, threats of substitutes and potential new entrants in the

15Chappelow, Porter’s 5 Forces

5

industrywill probably remain the same, since the informationmainly concerns the

customers behaviour, which mainly affects the customers and the rivalry within

the industry.

1.5 Methodology

By examining relevant data, such as stock price, volatility, the EPU-index etc., the

purpose is to answer whether these chosen parameters are relevant in explaining

the trading volume. Future research can then apply these parameters without

having to investigate if they are relevant. Further, the result could help to more

thoroughly define the company’s ability in manipulating its revenue stream if it

mostly depends on exogenous factors.

The suggested statistical tool is multiple linear regression modeling. Extensively

used within finance, regression is a statistical method to establish of the relation-

ship between variables. In multiple regression analysis, a relationship between

the response variable y, i.e. the variable of interest, and a set of predictor vari-

ables x = [x1, x2, ..., xp] is examined.

By fitting the model with the response variable and the plausible relevant factors

affecting the response variable a multiple linear model explaining the response,

to some degree, can be obtained.

The literature used in the section of presenting the regression analysis is mainly

Introduction toLinearRegressionAnalysisbyMontgomery, Peck, andVining.16

1.6 Stakeholder Analysis

To investigate the stakeholders for this project, a stakeholder analysis is made

where both internal stakeholders within the company as well as external stake-

holders are included. The stakeholders can both be affected and affect the project,

depending on which stage the project is in and what result comes with it.

First of all, one internal stakeholder is the management of Avanza, since it lies in

16Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis

6

their interest to investigate whether the trading volume depends on exogenous or

endogenous behaviour. The management influences the implementation of the

result, and therefore need to be informed whether any conclusions can be made.

The result of this degree project can therefore be used as a tool for themanagement

to analyze and evaluate the organization.

Second of all, the supervisors at Avanza can be seen as internal stakeholders. On

one hand, the supervisors provides the project with data and knowledge which

shows the importance and influence that the supervisors have on the project, while

on the other hand, the project can supply the supervisors with information re-

garding the trading volume. This information is interesting for the supervisors at

Avanza since the result can be further investigated and presented to the manage-

ment of the company.

The external stakeholders does not affect the project because of the lack of aware-

ness regarding this project. One possibly external stakeholder is the management

in other companies. Even though the analyzed data is given by Avanza, there

might be a general pattern representing investors all over the world and therefore

it could also be useful in the management in other companies. Another possibly

external stakeholder is Avanza’s customers. The customers can be affected of the

project depending on what the result indicates and whether any implementations

are made, but the customers will not affect the project.

1.7 Scope

Some aspects, that possibly could affect the outcome if they were examined, have

been excluded in this report.

First, this study is geographically limited to Avanza’s customer that mainly reside

in Sweden. Therefore, the conclusions can not be applied to trading activity be-

haviour in general. Second, the seasonal variation of the financial market have

not been taken into consideration in this report. For example, it has been shown

that there is exists a seasonal variation which affects Avanza’s trading volume (see

figure 1.1). 17

17Johanna Kull and Nicklas Andersson, Tips inför sommarbörsen

7

Figure 1.1: Plot of OMX against months for the time period 2000-2017

Furthermore, other factors that possibly could affect the trading volume and that

have been left out are the development of rents, taxes and inflation. For example,

taxes and rents affect the disposable income for the households18 and therefore

also the amount consumers can trade.

The inflation can affect the trading volume. If there is an increasing inflation dur-

ing a specified period of time, themoneywill be worth less which leads to the same

conclusion as for the rents and taxes, that the consumers will have less money to

trade and therefore trade less than usual.

Moreover, the time frame that this thesis is delimited to is from the 3rd of January

to the 29th of April 2019. This specified time period could affect the output of the

project, but since Avanza was founded in year 1999,19 this aspect does possibly not

have a great impact on the output.

18Carlgren,Hushållens inkomster19Avanza©,History

8

2 Regression Analysis

Regression analysis is a statistical technique for investigating and modeling the

relationship between variables.20 The word regression means ”a return to a pre-

vious and less advanced or worse state, condition, or way of behaving”, and may

be the most widely used statistical technique.21

The variables of interest are called response and predictor variables. The response

variable is often denoted as y and represents the observed outcome, given the pre-

dictor variable(s) x. When investigating a response variable that may be related

to several regressors, a multiple regression model is appropriate.

2.1 Multiple Regression

The multiple regression model is given by the equation

y = Xβ + ϵ (1)

where the response y consists of a vector of the response variables for each ob-

servation. The error ϵ consists of a vector of the errors for each observation. The

estimation coefficients, the ”betas”β, consists of a vector with beta coefficients for

each predictor, where β0 represents the intercept and p the number of predictor

variables. X is a matrix consisting of the predictor variables.

X =

1 x11 x12 . . . x1p

1 x21 x22 . . . x2p

......

.... . .

...

1 xn1 xn2 . . . xnp

, β =

β0

β1

...

βp

, y =

y1

y2...

yn

, ϵ =

ϵ1

ϵ2...

ϵn

(2)

By estimating the betas, given the data, the fitted model produces a line along the

data points through the equation

20Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis pp. 121Definition of “regression” from the Cambridge Advanced Learner’s Dictionary & Thesaurus

©Cambridge University Press

9

y = Xβ (3)

where y represent the fitted values and β the estimated beta coefficients.

2.2 Ordinary Least Squares (OLS)

The aim of the estimation of the beta coefficients is to make it as accurate as pos-

sible. The OLS-method is a popular and widely used method for obtaining the

estimates β. OLS minimizes the sum of the squares of the differences between

the observations yi and the projected straight line and can be used to produce the

estimators of the multiple regression model in Equation (1). The least-squares

function is given by

S(β) =n∑

i=1

ϵ2i = ϵTϵ = (y−Xβ)T (y−Xβ) (4)

Developing the right hand side of the equation through multiplication and tak-

ing the derivative with respect to β and setting it equal to zero to find the mini-

mum, the least-squares normal equations stated inmatrix notation can be written

as

XTXβ = XTy (5)

Or equivalently, by solving the normal equations, as

β = (XTX)−1XTy (6)

Equation (6) is called the least squares estimation equation.

The Gauss-Markov Theorem states that the OLS estimator β is the best linear

unbiased estimator (BLUE). In other words, β has the smallest variance in the

class of all unbiased estimators that could be produced as linear combinations of

10

the data.22

2.3 OLS Assumptions

When fulfilled, the OLS assumptions allow the OLS method the create the best

possible estimates, best in the sense of having smallest variances. According to the

Gauss-Markov theorem, OLS produces the best estimators when the assumptions

presented down below hold. Further, the coefficients estimates β converge to the

actual population parameters when the sample size increases to infinity and given

that the assumptions are fulfilled.

2.3.1 Strict Exogenity

The randomerrors ϵ are assumed to havemean zero,meaning that the value of one

error does not depend on the value of any other regressor. This consequence fol-

lows directly from the strict exogenity assumption and can be expressed as

E[ϵ|X] = 0 (7)

Equation (7) implies that the independent predictor x is not dependent on the

dependent variable y, i.e. an exogenous variable is able to influence the system

without being influenced by it.

2.3.2 Homoscedasticity

Besides having a zero mean, the errors ϵ are assumed to have the same unknown

variance σ2 in each observation, i.e. the variance is constant for each observation.

The opposite is named heteroscedasticity and implies that the variance changes

for different observations, which reduces the precision of the OLS generated esti-

mators.22Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis. pp. 587-588

11

Investigation of the residual plots is useful for controlling the OLS assumptions

and often recommended for other reasons as well, such as detecting outliers.23

Homoscedasticity can be investigated by plotting the residuals against the fitted

values yi. By plotting the externally studentized residuals ti against the corre-

sponding fitted value, the behaviour of the variance can be observed.

2.3.3 No Autocorrelation

The errors ϵ are assumed to be uncorrelated between the observations.

E[ϵiϵj|X] = 0 (8)

The opposite holds if, for instance, the error term of one observation is positive

and that it systematically increases the probability that the following error is posi-

tive (positive correlation). Autocorrelation is common in the context of time series

data, i.e. measurements over a certain amount of time.

If autocorrelation exists the OLS estimator is still unbiased but not BLUE and the

usual OLS standard errors and test statistics are no longer valid.

2.3.4 Normality

Thenormality assumptions assumes that the errors are normally distributed given

the regressors, i.e.

ϵ|X ∼ N (µ, σ2In) (9)

The OLSmethod does not require that the normality assumption is fulfilled, how-

ever fulfillment of the assumption enables the usage of hypothesis testing, gener-

ating reliable confidence and prediction intervals.

By constructing a normal probability plot of the residuals the normality assump-

tions can be checked. The cumulative normal distribution is constructed such that23Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis. pp. 136

12

it will plot as a straight line against the externally studentized residuals, ranked

in increasing order. Since the errors are assumed to be normally distributed, the

desired observation should show the points as close as possible to the straight

line.

2.3.5 Multicollinearity

When multicollinearity occurs it causes lower precision of the OLS estimators.

Perfect correlation occurs when one of the variables changes by a fixed propor-

tion, the other also changes by the same fixed proportion. This indicates that the

two variables are linearly dependent. Since the OLS method cannot distinguish

the perfectly correlated variables a high enough correlation causes problem, i.e.

multicollinearity. One could imaginemulticollinearity as a two dimensional plane

residing on the data points. If the data points are not orthogonal the plane will be

unstable or ”wiggly” and therefore vary greatly.

Since the matrix XTX consists of the dimensions p × p (p being the number of

predictor variables) and the jth column of theXmatrix is given byXj, the matrix

can be represented asX = [X1,X2, ...,Xp].24 Multicollinearity is said to exist when

the vectors X1,X2, ...,Xp are linearly dependent, i.e. if there is a set of constants,

t1, t2, ..., tp not all zero, such that

p∑j=1

tjXj = 0 (10)

Variance Inflation Factors (VIF) One way of detecting multicollinearity is

to investigate the linear dependency among the regression variables, since the re-

gressors are the columns of the Xmatrix. By investigating the correlation matrix

C = (XTX)−1 , multicollinearity can be detected. The diagonal elements of the

correlation matrix are called variance inflation factors, since they provide an in-

dex measuring the inflation of the variance of an estimated regression coefficient

caused by collinearity.

24Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis. pp. 286

13

Large VIF values associated with the regression coefficients indicate that they are

poorly estimated. What is meant by large values, 5 and 10 are often used based on

practical experience.25 The VIF’s can be calculated by the following formula.

VIFj = Cjj = (1−R2j )

−1 (11)

Condition Number Another way of detecting multicollinearity is by measur-

ing the spread of the eigenvalues of XTX, i.e. λ1, λ2, ..., λp. The condition number

K is defined as the fraction of the largest and lowest eigenvalue.

K =λmaxλmin

(12)

K > 100 indicates a problem with multicollinearity and K > 1000 indicates a se-

vere problem with multicollinearity. This is because the eigenvalues of a p × p

matrix A, that are given by the equation |A− λI| = 0, represent the characteristic

roots of the matrix. Hence, small values of the roots indicate near-linear depen-

dencies in the data and vice versa.

2.4 Residual analysis

Plotting residuals is a powerful technique for detecting outliers and estimating the

overall behaviour of the model. Autocorrelation can be detected through residual

plots where the plots displays residual versus time.

Residuals are defined as

ei = yi−yi, i = 1, 2, ..., n (13)

where yi is the observed value and yi is the corresponding fitted value. The resid-

ual can therefore be seen as the deviation between the data and the fit. Moreover,

the residual is also a measure of the variability in the response yi. If any assump-

25Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis pp. 296

14

tions about the residuals are wrongfully made, it will be shown in the residual

analysis.

The approximate average variance of the residuals is given by∑ni=1(ei − e)2

n− p=

∑ni=1 e

2i

n− p=

SSresn− p

= MSres (14)

The residuals have n− p degrees of freedom, which makes the non-independence

of the residuals negligible as long as the number n of residuals is not small relative

to the number of coefficients β.

2.4.1 Studentized Residuals

Studentized Residuals is a way of improving the residual scaling by dividing ei

by the exact standard deviation of the ith residual. The residual can be written

as

e = (I−H)y = (I−H)ϵ

where the hat matrix can be written as

H = X(XTX)−1XT

using that y = Xβ + ϵ. Thus the covariance matrix of the residuals is

Var(e) = Var(I−H)ϵ = σ2(I−H)

Since thematrix (I−H) generally is not diagonal, the residuals have different vari-

ances and are also correlated. The variance of the ith residual is thus Var(ei) =

σ2(1−hii), where hii is the ith diagonal element in the matrixH. Now, by estimat-

ing σ2 byMSRes, the studentized residuals are given by

ri =ei√

MSRes(1− hii)(15)

15

For i, where i = 1, 2, ..., n, and√

MSRes(1− hii) is the standard deviation, i.e.√Var(ei).

Since hii is the measure of the location of the ith point in the x space, Var(ei) be-

comes smaller for the points xi that lies relatively far away from the center of the x

space. This in turn means that the studentized residuals becomes larger for these

points. Thus, analyzing studentized residuals makes it easier to detect influential

points which is where the violations of the basic assumptions are more likely to

occur.

2.4.2 PRESS Residual

PRESS is generally regarded as a measure of how well a regression model will

perform in predicting new data. The definition of the PRESS residual is

e(i) =ei

1− hii

, i = 1, 2...n (16)

Equation (16) shows that the PRESS-residual is the ordinary residual weighted

with the diagonal element from the hat matrix. Taking the denominator into con-

sideration, a large value on hii implicates a small denominator which makes the

deleted residual large. PRESS-residuals with small values are generally desired

for a model.

2.4.3 R-Student

Anothermethod to detect an outlier is theR-student. Compared to the studentized

residual, R-student uses an external scaling instead of the internal which is used

in the estimate of the mean square residual.

S2i =

(n− p)MSRes − e2i /(1− hii)

n− p− 1(17)

If an observation is influential, the estimation of the variance S2 will differ sig-

nificant fromMSRes. This will make the denominator close to zero and therefore

16

increase the residual. Now the R-student residual is given by

ti =ei√

S2i (1− hii)

(18)

2.4.4 Partial Regression and Partial Residual Plots

An effective way to check the adequacy of the fit of the regressionmodel is tomake

a residual plot. A partial regression plot is used in order to study the marginal re-

lationship between the regressor given the other variables in the model. The plot

shows if the assumption about the specified relationship between the response

and the regressor variables has been made correctly and can also be useful for

providing information about a variable that is not currently in the model. Com-

paring each of the regressor variables with the response, the partial regressor plot

shows the marginal relation for each of the regressors.

2.5 Leverage and Influence

Occasionally, certain data points that negatively influences themodel appear. These

points are defined as influential points. The leverage point, point A in the left plot

in Figure 2.1, depicts an usual x coordinate, but a y coordinate that seems to be

located on the regression line. TheA-point in the right plot displays an influential

point with both an unusual x and y coordinate. As the name tells, the influential

point influences the regression line to move away from the rest of data, i.e. from

its more natural trajectory.

Figure 2.1: Left: leverage point. Right: influential observation

17

A small subset of data can negatively influence the model to such a degree that

the estimates of the beta coefficients mostly rely on the subset data instead of the

majority mass of the data. Since the regression model should be a construct of

all of the data, these situations must be avoided. By finding and analyzing these

points their impact can be determined and understood relative to the end regres-

sion model, and even removed if they are ”bad”.

2.5.1 Cook’s Distance

The influence of one point can be measured with respect to its location in the x

space and its response. By removing one point from the sample Cook’s distance

can measure its influence and this can be used as a deletion diagnostic.

Di =(y(i) − y)T (y(i) − y)

pMSres(19)

where y(i) is obtainedwhen removing the ith observation from the data set. Points

with largeDi values greatly influences the beta estimates, whereDi > 1 are consid-

ered being large. Therefore, it is recommended to eliminate data points exceeding

the cutoff-ratio. It is important to remember that the cutoff-ratio only is a recom-

mendation and also based on the sample size n. However, for larger samples sizes,

the specified cutoff-ratio makes more sense.

2.5.2 DFFITS

Similar to Cook’s distance, DFFITS, difference in fit or difference in fitted y’s with

and without each data point, also measures the influence of one point and is given

by

DFFITSi =yi − y(i)√S2(i)hii

(20)

where yi is the fitted value including the point i and y(i) is the fitted value excluding

the point i. The factor S2(i) is the R-student estimate of the mean square residual

18

shown in Equation (17) and hii represents the diagonal element i from the hat

matrix.

The suggested cutoff-ratio is given by |DFFITSi| >√p/n. The DFFITS is a

measure of the number standard deviations the fitted value changes if the i:th

observation is removed from the model. It essentially a useful measure when de-

tecting both leverage and prediction error.

2.6 Variable Selection

Oftentimes, there are several regressor variables to choose from, making it diffi-

cult to pinpoint the likely few important ones. Variable selection is the procedure

of finding an appropriate subset of regressor for the model. Further, variable se-

lection is the most common procedure for solving the problems of multicollinear-

ity.26

2.6.1 Null Hypothesis

As mentioned in Section 2.3.4 the normality assumptions open up the possibil-

ity of using hypothesis testing and generating reliable confidence and prediction

intervals. The standard test uses the following hypotheses

H0 : β1 = β10 H1 : β1 = β10 (21)

where β10 is set to 0, and the goal is to conclude whether the regression coeffi-

cient β1 is non-zero. Since the errors are normally distributed with expected value

zero, the observations are also normally distributed with expected valueXβ, since

E[y] = E[Xβ + ϵ] = E[Xβ] + E[ϵ] = Xβ. If H0 : β1 = β10 = 0 cannot be rejected

the linear relationship between the predictor variable x and the response y either

does not exist, is not important for explaining the response or does not have a

linear relationship with the response.

26Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis. pp. 328

19

2.6.2 t-test

The t-statistic is the ratio of the departure of the estimated value of a parameter

from its hypothesized value to its standard error. By assuming that Y is a random

variable following a normal distributionwithmean µ and variance σ2 we have Y ∼N (µ, σ2) → Z = Y−µ

σ∼ N (0, 1), implying that Z2 (Z being the standard normal

variable with 1 degree of freedom) follows a X 2 distribution, i.e. Z2 ∼ X 21 . Thus,

the square of a standard normal variable is a X 2 random variable with one degree

of freedom. If the null hypothesis given in Equation (21) is true, the t statistic t0

follows a tn−2 distribution.

t0 =β1 − β10√MSres/Sxx

(22)

The degree of freedom for t0 is the same as MSres. Using Equation (22) the ob-

served value t0 can be compared to the upper α/2 percentage point of the tn−2

distribution (tα/2,n−2). The t statistic for the null hypothesis is given by

t0 =β1√

MSres/Sxx

=β1

se(β1)(23)

since β10 = 0 from the Equation (21). If |t0| > tα/2,n−2, then the null hypothesis is

rejected.

2.6.3 F-test

The F-statistics shows the relationship between two independent variables where

both of the variables have the χ2 distribution with different degrees of freedom.

For example, if we introduce two variables X and Y where X ∼ χ2v and Y ∼ χ2

n.

Thus, the ratio is now given by

X/v

Y /n∼ Fv,n (24)

which is the F distribution with v and n degrees of freedom. This shows that the

20

ratio of two independent random variables that follow a χ2 distribution, follows

an F distribution.

Therefore, the F test can be used to test the null hypothesis. In this case, the null

hypothesis is that all of the regression coefficients are equal to zero, implying, if

the hypothesis is not rejected, that the model has no predictive capability. If the

null hypothesis is true, the following holds

MSR

MSres∼ F1,n−2 (25)

since MSR = SSR/1 has one degree of freedom and MSres = SSres/(n − 2) has

(n− 2) degrees of freedom.

F0 =MSR

MSres(26)

The null hypothesis is rejected if F0 > Fα,1,n−2, where α is the given level of signif-

icance.

2.6.4 All Possible Regressions

As stated earlier, it is desirable to choose a subset of the candidate regressors.

By fitting various combinations of the regressors and comparing them based on

certain criteria, models with a certain setup of parameters can be selected.

All possible regressions is one method of achieving this. By fitting the model with

the regressors one by one and comparing them with each other and choosing the

best ones, based on some criteria, the best regressionmodel can be selected.

Assuming that the intercept β0 is included in each model, there is a total of 2p

models to evaluate. One obvious drawback with the method is the exponentially

growing number of models based on p, for an instance 210 equals to 1024models,

however modern day computers and software have made the procedure relatively

simple to use.

21

2.6.5 Backward Elimination

Backward Elimination is a step-wise regression method used in situations where

all possible regressions can be burdensome computationally. This method begins

with the model including all p candidate regressors. The corresponding t-test or

F-test is computed for each of the regressor. Comparing these t-tests or F-tests to

a pre-selected value, called tleave for example, the regressor corresponding to the

smallest value of the t-test or F-test is removed from the model.

This procedure is repeated k times until the model contains a number of p − k

regressors, where each of the corresponding test statistic value is greater than the

cutoff value.

2.6.6 Cross Validation

In situations when there is no new fresh data to validate the model, tools as cross

validation can be used. The data set is split into two parts, one is the estimation

part, and the other is the prediction part. A new regression model is build, based

on the estimation data, and then validate the model with the prediction data. If

the original data is collected within a time frame, one can use time as the basis

when splitting the data set into these two components. However, the basis which

can be chosen arbitrarily has a probability of not stressing the model enough, and

therefore it would be preferable to use several estimation and prediction-sets to

improve the validation technique.

2.6.7 R-squared and Adjusted R-squared

To measure the overall adequacy of the model, one can use the coefficient of de-

termination called ”R squared”. The proportion of the variance in the dependant

variable that is predictable from the independent variable(s) is given by R2 . In

other words, R2 measures how close the observed data are to the fitted regression

line.

The residuals sum of squares (SSres), the discrepancy between the data and the es-

timatedmodel, divided by the total sum of squares (SStot), the squared differences

22

of each observation from the overall mean, minus 1 gives us R-squared.

R2 =Explained VariationTotal Variation

= 1− SSresSStot

∈ [0, 1] (27)

However, theR2-measure has some limitations, one of thembeing that it increases

when more predictors are added to the model. The modified version of the mea-

surement, the adjusted R squared R2adj, only increases when the added predictors

improves the model.

R2adj = 1− SSres/(n− p)

SStot(n− 1)(28)

Thus, highR-values are desirablewhenmeasuring the overallmodel adequacy.

2.6.8 Mallow’s Cp Statistic

Mallow’s Cp addresses the issue of overfitting, i.e. when a model contains more

parameters than can be justified by the data. The statistic is

Cp =SSresMSres

− n+ 2p (29)

where n denotes the number of observations and p the number of parameters. By

plotting Cp against p, the models showing lowest Cp-values and which also are

closest to the line Cp = p are the best. Figure 2.2 displays a graph of the statistic

and indicates that model C, with four parameters, although above the line Cp = p

compared to model A, is better than A, since it is below A, thus representing a

model with lower total error.

23

Figure 2.2: A Cp plot (fromMontgomery, Peck, and Vining (Introduction to Linear RegressionAnalysis) pp. 336)

Mallow’s Cp can be used as one of the criteria in the all possible regression proce-

dure for checking the model adequacy.

2.6.9 Akaike Information Criterion (AIC) and Bayesian Information Crite-rion (BIC)

Originating from the AIC, BIC is simply an extension of AIC. Acting as criteria,

for instance in the all possible regression procedure, these criteria help to choose

the best predictors by penalizing the addition of regressors as the sample size in-

creases.

Since some information almost alwayswill be lost during statisticalmodelingwhen

using themodel to represent the process, themodelswith the least loss of informa-

tion are preferred. AICmeasures the entropy of themodel, i.e. the expected infor-

mation of the model, were the lowest AIC values are most desirable and BIC mea-

sures the posterior probability of a model being true, meaning similar to AIC that

lower BIC-values are preferred. The most significant difference between these

criteria is their size of the penalty —BIC penalizes model complexity more heav-

ily.

Simply put, AIC and BIC consist of a ”measure of fit” and a ”complexity penalty

part”. The likelihood functionL, the plausibility of a value for the parameter given

some data, represents the goodness of fit while p, the number of parameters, rep-

24

resents the complexity.

AIC = −2 ln (L) + 2p = n ln(SSres

n

)+ 2p (30)

BIC = −2 ln (L) + p lnn = n ln(SSres

n

)+ p lnn (31)

Summarized, these criteria ask whether more regressors should be added to the

model, since SSres cannot increase when the number of regressors is increased,

there exists a trade-off between the goodness of the fit of the model and the com-

plexity of the model.

2.7 Transformations

When starting the regression analysis, the usual assumption about linearity is of-

ten made. With this assumption, the following needs to be satisfied:

1. The model errors have mean zero, constant variance and uncorrelated.

2. The model errors follows a normal distribution.

3. The form of the model is correct.

If these assumptions are incorrect, a transformation of the data needs to be done in

order to find a relationship between the regressor and the regressor variables.

2.7.1 Box-Cox

One suchmethod is called the Box-Coxmethod. Thismethod usesmaximum like-

lihood to find the power of λ to use in the transformation yλ. Because of the dis-

continuity when λ approaches zero, the equations are given by:

y(λ) =

yλ−1λzλ−1 , λ = 0;

zlny, λ = 0;(32)

25

Where z = ln−1[1/n∑n

i=1ln yi] is the geometric mean value of the observations

yi.

The Jacobian of the transformation and the divisor zλ−1 are related when convert-

ing y into y(λ) which ensures the residual sum of squares with different values of

the power λ to be comparable.

26

3 Data

By choosing the most popular stock and volatility indices, alongside well estab-

lished EPU indices, the idea was to include a mix of local and international mea-

sures on price and volatility changes and financial, political and economic un-

certainty. The equity and volatility indices were accessed through the Thomson

Reuters Eikon software. The data was transferred in an xlsx format into the

programming language R. The relevant information, such as Exchange Date and

Close, was extracted and aggregated.

Since indices put simply are, for example, aggregated stock prices, the relevant

information is whether or not an index has increased, since the nominal value

does not say much. Therefore, the daily returns of the indices are calculated by

dividing their respective closing prices pn for day n, with their previous price, i.e.

pn−1. Subtraction by one presents it percentage.

Daily Return =pnpn−1

− 1 (33)

Thus, an aggregated data frame consisting of thematching dates and daily returns

of the equity and volatility indices and the EPU indices was created.

However, since some of the rows of the date frame consisted of empty values,

because of indices notmatching opening exchange dates, these rows were omitted

from the data frame, i.e. rows that consisted of at least one N/A-element were

deleted. Therefore, the final data strongly depends on there being available and

consistent daily data for each index.

3.1 Trading Volume

For listed shares the trading volume can be defined as the total number of traded

shares over the day, i.e. the total number of all shares that changed hands, and in

our case —the total number of all of Avanza’s clients shares that changed hands.

Since each trade, generally, results in a commission fee, aggregating the commis-

sion revenue for each day produces an index representing the trading volume per

27

day, i.e. the revenue generated from the commission fees are in theory perfectly

correlated with Avanza’s customers activity.

3.2 Equity

Some of the stock indices used are described in this section while all of the vari-

ables, including the volatility and EPU indices, can be found in Table 3.1.

The Dow Jones Industrial Average DJI represents the stocks of 30 large and well-

knownU.S. companies covering all industries with the exception of transportation

and utilities. The stocks are selected by editors of The Wall Street Journal. The

index is price weighted.

The S&P 500 Index SPX represent over 500 large U.S. companies and captures

approximately 80% of available market capitalization.

Alongside SPX and DJI, the NASDAQ Composite Index IXIC is one of the three

most followed indices in the U.S. stock market. The index contains data of infor-

mation technology companies and includes over 3.300 common equities listed on

the exchange.

The S&P/TSXComposite Index GSPTSE represents the Canadian benchmark index

through about 250 companies and roughly 70 percentmarket capitalization on the

Toronto Stock Exchange.

The OMX Stockholm 30 Index OMXS30 is the Stockholm Stock Exchange’s lead-

ing share index. This index consists of the 30 most actively traded stocks on the

Stockholm Stock Exchange.

The OMXSSCPI Index consists of all Small Cap companies listed on NASDAQ OMX

Stockholm Exchanges. The group of Small Cap companies includes companies

whose shares have a market value of less than 150 million euro.

28

3.3 Volatility

The CBOE Volatility Index VIX represents the options on the Chicago Board Op-

tions Exchange (CBOE), i.e. options betting against S&P 500, that can be used to

hedge against volatility spikes. In other words, one could say the index measures

public concern —hence the nickname of the index: ”the fear index”.

The CBOE DJIA Volatility Index VXD, similar to VIX, is based on the prices of

options, but unlike VIX, options betting against DJI. Hence, it measures the in-

vestors near time expectancy (30-days) of the stock volatility.

TheEuro STOXXVolatility Index V2TX is based on the real time option prices in or-

der to reflect the market expectations of both short-term and long-term volatility.

Themeasurement is constructed by taking the square root of the implied variance

across all options of a given time to expiration.

3.4 Economic Policy Uncertainty (EPU)

EPU is an index of the frequency of relevant selected words in several news outlets

and there are several EPU indices to choose from country wise. Assuming that

investors are interested in following their investments in the respective country,

some indices aremore relevant than others. For example, as of now, out of the ten

most owned stocks by customers between the ages of 18 and30, seven are Swedish,

two are American and one is Canadian. Because of the geographical limitation to

the Swedish market, the American and the British indices were chosen since both

USA and UK influences the Swedish news.

The American EPU index consists of data from several different newspapers, for

example USA Today which is a national paper, but also smaller local ones. The

AmericanEPU index is constructed from three primary sources; ”economic/economy”,

”uncertain/uncertainty” and ”legislation/deficit/regulation/congress/federal re-

serve/white house”. If any of these three groups of words exist in the paper, the

index will rise. There are both monthly and daily indices of the EPU, where the

index ordered on a daily basis is chosen.27

27Nick Bloom, Scott R. Baker and Steven J. Davis, US Daily News Index

29

The British EPU index is used for investigating British policy-related economic

uncertainty. Both the American and the British index counts the number of ar-

ticles containing any of a pre-defined number of words, which for the British in-

dex is represented by ”policy”, ”tax”, ”spending”, ”regulation”, ”Bank of England”,

”budget” and ”deficit”.28

The Swedish index is constructed in the same way as for the American and British

indices,29 but because of the lack of data regarding this index, it had to be excluded

from further study. The data of the Swedish EPU index could only be generated

on a monthly basis, which is the reason why it was excluded since it significantly

reduced the number of observations in our study.

Variable Description

Commission Revenue Avanza’s aggregated daily commission revenue

DJI Dow Jones Industrial Average, equity index

SPX Standard & Poor’s 500, equity index

OMX Swedish stock market, equity index

IXIC Nasdaq Composite, equity index

NDX Nasdaq 100, equity index

SIX All listed Swedish companies, equity index

GSPTSE Canadian equivalent to the S&P 500, equity index

VIX CBOE Volatility Index, volatility index

VXN 30-day market expectations of the NDX, volatility index

V2TX Euro STOXX, volatility index

VXD CBOE DJIA, volatility index

GSPTXLV Composite High Beta, volatility index

EPU U.S. American economic policy uncertainty index

EPU U.K. British economic policy uncertainty index

FRED Federal Reserved Economic Data, volatility index

OMXSSCPI Stockholm Small Capital Price Index, volatility index

OMXSPI Stockholm Price Index, volatility index

TOPX Tokyo Stock Exchange Price Index, volatility index

STOXX Price weighted, equity index

Table 3.1: Description of the regressor variables

28Nick Bloom, Scott R. Baker and Steven J. Davis, UK Daily News Index29Nick Bloom and Davis, Sweden Monthly EPU Index

30

4 Result

First, the full model with p = 19 parameters and n = 1594 observations is fitted,

producing the first unprocessed linear regression model on the form in Equation

(3). Then, a thorough residual analysis is performed were potential outliers are

identified and analyzed. Third, the need of a transformation is investigated. If a

transformation is performed, a new residual analysis and a potential transforma-

tion are once investigated and performed. If the need for a transformation does

not exist, a variable selection is conducted and the final models are selected.

4.1 Residual Analysis

After fitting the full model a thorough residual analysis again is performed. Plot-

ting the residuals, information about potential outliers and their behaviour is given.

The standardized residuals in Figure 4.1 show that themajority of residuals reside

in the acceptable area, i.e. betweenminus and plus two, while some observations,

such as observation 1323, 952 and 1039, do not. These observations outside the

red lines are potential outliers and should be further analyzed.

0 500 1000 1500

−2

02

4

Studentized Residuals

Observations

Res

idua

ls

1234567891011121314

15

16171819

20

2122

2324

25262728

293031323334

3536

37

3839404142434445464748

49505152535455565758596061626364

656667686970717273747576777879

80818283848586878889909192

9394

9596

97

98

99

100101

102103104105106107108109110111112113114

115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146

147148149150151152153154155156157158159160161162163

164

165

166

167168169170171172173174175176177

178179180181182183184185186187

188189190191192193194195196197

198199200201

202203204205206

207208209210211212

213214215216217218219220221222223224225226

227228229230231232233234235236237238239240241

242

243

244245246247248249250251252253254255256257258

259260261262263264

265266267268269270271272273274275276277278279280281282283284285286287

288289290291292293294295296297298299300301302

303304305306307308309310311312313314

315

316

317

318

319320321

322323324325326327328

329

330331332333334335336337338339340341

342

343344345

346347348349

350351352353354355356

357

358359360

361362363364365366367368369370371372373374375376

377

378379380381382

383

384385386387388

389390

391392393394395396397398399400401402403404405406407408409

410411412

413414415416417418419420421422423424

425426427

428429

430431432433434435436437438439440441442

443

444445446447448449450451452453454455456

457458459

460

461462463464465466467468469470471472473474475476477478479480

481

482483484485

486487488489490491492493494495496497498499500501502

503

504505506507508509510511512513514515516517518519520521522523

524

525526527

528529530531532533534535

536537538539540541542543544545

546547548

549

550551552553

554

555556557558559560561562

563564

565566567

568

569570

571572573574575576577578579580581582583584585586587588589590591592593594595

596

597598599600601602603604605606607

608609610

611

612

613614615616

617

618619620621

622623

624625626

627628629630631632633634635636637638639

640

641642

643644645646647

648649650

651652653

654

655656657658659

660661

662663

664665

666

667668669670671672673674

675

676677678679680681

682683

684

685

686

687688689

690691

692

693694

695696

697

698699700701702703704705706707708709

710711712713714715

716

717

718719720721722723724725

726

727

728

729730

731732733734

735736

737738739740741742

743744745746747748

749750751752

753

754

755756757758759760761

762

763

764765

766767

768769

770

771

772773

774775776777778779

780781782783

784785786787788789790791

792793794

795796

797798799800801802803

804

805

806

807

808809810

811

812

813814815816

817

818819

820821822823

824825826

827828829830831

832

833834835836

837

838839

840

841842

843

844

845

846847848

849

850

851852

853

854

855856

857858859

860861862863864

865866

867

868

869870871

872873

874875876877878879880

881

882883884885886887888

889890

891

892

893894

895

896897

898899900901902903904905906

907

908909910

911

912

913914915916917918

919920921922923924

925926927928929930931932933934

935936937

938

939940

941

942943

944945

946947948

949950

951

952

953954

955

956957958959960961962963964965966967968969970971972973

974975976977

978979980981982983984985986987988989990991992993994

99599699799899910001001

1002100310041005100610071008

100910101011

1012101310141015

1016

1017

10181019

1020

102110221023102410251026102710281029103010311032103310341035

1036

1037

1038

1039

1040

1041

1042104310441045104610471048

104910501051105210531054

105510561057

105810591060

1061

106210631064

106510661067

10681069

10701071

1072

1073

10741075107610771078107910801081

10821083108410851086108710881089

1090

109110921093109410951096

109710981099110011011102110311041105

11061107

1108110911101111111211131114111511161117111811191120

1121

1122

112311241125112611271128

112911301131113211331134113511361137

1138

1139114011411142

11431144114511461147114811491150115111521153115411551156

1157

11581159

1160

11611162

1163116411651166

116711681169117011711172117311741175117611771178

117911801181

118211831184

1185118611871188

118911901191119211931194

1195

11961197119811991200120112021203

1204120512061207

1208

1209121012111212

1213

12141215121612171218121912201221122212231224122512261227

1228122912301231123212331234

123512361237123812391240124112421243124412451246

12471248124912501251125212531254

1255125612571258

12591260

12611262

1263

1264126512661267126812691270

1271

127212731274127512761277

1278

12791280

1281

128212831284128512861287

1288

1289129012911292

12931294

129512961297

1298

1299

13001301

1302

1303

1304

1305

13061307

1308130913101311

1312

13131314

1315131613171318

131913201321

1322

1323

1324

13251326

1327

132813291330

13311332133313341335133613371338

1339

134013411342134313441345134613471348134913501351

13521353

1354

13551356

1357

135813591360136113621363

1364

1365

136613671368

1369

1370

13711372

13731374137513761377

137813791380

138113821383138413851386138713881389139013911392139313941395139613971398139914001401140214031404140514061407140814091410141114121413

1414

14151416141714181419

14201421

1422142314241425142614271428142914301431143214331434

1435143614371438

143914401441

1442144314441445144614471448

1449145014511452

1453145414551456

145714581459146014611462146314641465

1466

1467

146814691470147114721473147414751476

1477

1478

1479

1480148114821483

1484

1485

1486

1487

148814891490

1491

1492

1493

1494

14951496

149714981499150015011502150315041505

1506

150715081509151015111512

1513

151415151516

151715181519152015211522

1523

1524

1525

1526

152715281529153015311532153315341535

15361537

1538

153915401541

1542154315441545

1546

15471548154915501551155215531554155515561557

15581559

1560

15611562156315641565

1566

15671568156915701571157215731574

1575157615771578157915801581

158215831584

158515861587

158815891590159115921593

1594

Figure 4.1: Studentized residuals plotted against the observations

31

Plotting the PRESS-residuals, the observations that are associated with large hii

will usually have large PRESS residuals, and generally be high influence points.

Figure 4.2 indicate that the observations with large PRESS-residuals are the same

as the ones with large residuals in Figure 4.1.

The observations large PRESS-residual values indicate that they are likely to be

observations were the model fits reasonably well, but does not provide good pre-

dictions of fresh data. Also, the sum of the PRESS-residuals, displayed in the title

as close to 6.53e+26, is relatively large, which is not desired.

0 500 1000 1500

0e+

004e

+12

8e+

12

PRESS = 6.53530058415464e+26

Observations

Val

ues

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364

656667686970717273747576777879808182838485868788899091929394

9596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460

461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616

617618619620621622623624625626627628629630631632633634635636637638639640

641642

643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752

753754755756757758759760761

762

763

764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804

805806807808809810

811

812813814815816

817

818819820821822823

824825826827828829830831

832833834835836837

838839840841842

843

844

845

846847848849

850

851852853

854

855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951

952

953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994

995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038

1039

1040

10411042104310441045104610471048104910501051105210531054105510561057

10581059106010611062106310641065106610671068106910701071

1072

107310741075107610771078107910801081108210831084108510861087108810891090

109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120

1121

11221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234

123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258

12591260

12611262

1263

12641265

1266

126712681269127012711272127312741275127612771278127912801281128212831284128512861287

12881289129012911292

1293

1294129512961297

1298

129913001301

1302

1303

1304

130513061307

1308130913101311

1312131313141315131613171318131913201321

1322

1323

1324

1325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421

14221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477

1478

14791480148114821483

1484

148514861487148814891490

1491

1492

1493149414951496149714981499150015011502150315041505

1506

150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545154615471548154915501551155215531554155515561557

15581559

156015611562156315641565

1566

156715681569157015711572157315741575157615771578157915801581158215831584

1585158615871588158915901591159215931594

Figure 4.2: PRESS-residuals plotted against the observations

4.2 Leverage and influential points

To extend the analysis of potential outliers and influential observations, bothCook’s

distance and DFFITS are investigated.

Cook’s distance shows three observations with large Di-values, indicating that

they influence the beta estimations (see Figure 4.3). However, the observations

Di-values do not exceed the recommended cutoff-value of Di > 1. Therefore,

Cook’s recommends us to not delete them.

32

0 500 1000 1500

0.00

0.10

0.20

0.30

Obs. number

Coo

k's

dist

ance

lm(Courtage ~ . − Date)

Cook's distance

1322

1323

952

Figure 4.3: Cook’s distance plotted against the observations

Compared to Cook’s distance, DFFITS displays several observations that are well

above the recommended threshold, the most extremes ones being 952 and 1196

(see Figure 4.4). DFFITS indicates that these observations should be deleted given

the√

p/n threshold.

0 500 1000 1500

−1.

00.

00.

51.

01.

52.

0

DFFITS

Index

dffit

s

123456789

1011121314

15

16171819

20

2122

23

2425262728

293031

3233

34

35

36

37

3839404142434445464748

49

5051525354555657585960

6162636465666768697071727374757677787980818283

8485868788899091929394

95969798

99

100

101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130

131132133134135136137138139140141142143144145146147148149150

151152153154155156157158159160161162163164

165

166167168169170171172173174175176177

178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220

221

222223224225226227228229230231232233234235236237238239240

241242

243

244

245246247248249250251252

253

254

255256257258259260261262263264

265

266267268269270

271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313

314315316

317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357

358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427

428

429430431432433434435436437438439440441442443

444445446447448449450451452453454455456

457

458459460

461

462463464465

466

467468469470471472473474

475476477478479480481

482

483484485486

487488489490491492493494495496497498499500501502503

504505506507508509510511512513514515516517518519520521522523

524

525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554

555556557558559560561562

563564

565

566567568

569

570

571

572573574575576577578579580581582

583584585586587588589590591592593594595

596

597598599600601602603604605606

607608609

610

611

612

613614615616

617

618

619620621622623624625626

627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675

676

677678679680681682683684685686687688689690691

692

693

694695696697

698

699700701702703704705706707708709710711712713714715716

717718719720721722723724725

726

727

728

729

730731732733734735736737738739740741742743744745746747748749750751

752

753754755756757758759760761

762

763764

765766767

768

769770771772773774775776777778779

780

781782783784785786787788789790791792793794795796797798799800801802803804805806

807

808809810811812813814815816817818819820821822823

824

825826827828829830831832833834835836837

838839840

841842843

844

845

846847848849850851852853854

855856857858

859860861862863864

865

866867868869870871872873874875876877878879880881882883884885886887888889890

891

892893894895

896897898899900901902903904905906

907

908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951

952

953954955956

95795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001

1002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035

1036

10371038

1039

10401041

104210431044104510461047104810491050105110521053105410551056105710581059106010611062

1063

106410651066106710681069107010711072

1073

1074

10751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118

11191120

1121

11221123112411251126112711281129113011311132113311341135113611371138

1139

114011411142114311441145114611471148114911501151115211531154115511561157115811591160

116111621163

11641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195

1196

119711981199120012011202120312041205120612071208

1209121012111212121312141215121612171218121912201221122212231224122512261227122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266

126712681269127012711272127312741275127612771278

1279

1280128112821283128412851286128712881289129012911292

1293

12941295129612971298

1299130013011302

1303

1304

13051306130713081309131013111312131313141315131613171318131913201321

1322

1323

13241325

1326

1327

1328

13291330133113321333133413351336133713381339

134013411342134313441345134613471348134913501351

13521353

1354

135513561357

1358

13591360136113621363136413651366136713681369

1370

1371137213731374137513761377

137813791380

138113821383138413851386138713881389139013911392

1393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477

1478

147914801481148214831484

1485

1486

1487

1488148914901491

1492

14931494

1495

14961497149814991500150115021503150415051506

1507150815091510151115121513

1514

151515161517151815191520152115221523

1524

15251526

15271528152915301531153215331534153515361537

1538

1539154015411542154315441545154615471548154915501551155215531554155515561557155815591560156115621563156415651566

156715681569157015711572157315741575157615771578157915801581

1582

158315841585158615871588158915901591159215931594

Figure 4.4: DFFITS plotted against the observations

33

Since the deletion diagnostics Cook’s and DFFITS do not give the same recom-

mendation, another type of diagnostic is performed. The outlier and leverage di-

agnostics are displayed in Figure 4.5 and showobservations categorized as outliers

and (or) as having leverage. Clearly, the non-outlier observations that only pro-

vide leverage are desirable for the estimation procedure, while the outlier obser-

vations that contribute with leverage are not, and should be removed. However,

the outlier and leverage observations cannot be removed without an explanation,

i.e. are they bad data or just unusual perfectly explainable events?

9

15

20

23

24

283449

63

6580

84

95

99

100

144

151

165

166

198215

220

221

230

237

240

244

245

253

254

261

265

269270

296

297

322

349

358

364393

428

442

457

461

466

467

468

482

486503

511512

524565

569

571

573

596

611

617

618

633

676

692

693698

716

717

726

728

729

739752761

762

763764

773

780

781

801

807

824

843

844

845

851

852

854858

863

865

866867

868

869871

891

892

896907

920

951

952

9561000

1036

10371039

10401041

1063

1073

1074

10771118

11191120

1121

1138

1139

1143

114811491156

1161

1163

1169

11961208

12601266

1279

1280

1281

1282

1302

1304

1321

1322

1323

13241325

1328

13491353

1354

1355

1358

1369

1370

137713971477

1478

1479

1487

148814891490

1513

1515

1524

1527

1529

153815421582

1584

Threshold: 0.025

−5

0

5

10

0.0 0.2 0.4 0.6

Leverage

RS

tude

nt

Observation

normal

leverage

outlier

outlier & leverage

Outlier and Leverage Diagnostics for Courtage

Figure 4.5: All the observations classified as normal, leverage, outlier or outlier & leverage ob-servations

Themost significant outliers are gathered in Table 4.1 and analyzedmore in detail.

Connecting the observations to their respective dates and commission revenues,

reasonable explanations quickly add up.

34

Observation # Date ∆ Comission revenue Plausible explanation

952 2016-06-27 110.60% Brexit

1039 2016-11-09 56.05% Trump

1322 2018-02-05 30.21% Unexpected inflation

1323 2018-02-06 39.56% Unexpected inflation

Table 4.1: The most severe outliers and the date of their occurrences

Starting with outlier 952 and the events on its date 2016-06-27, Avanza’s cus-

tomers broke a record in the number of stock transactions during one day, with

Brexit acting as a catalyst.30 Secondly, regarding observation 1039 and on its

date 2016-11-09, Trump was elected as the president of the U.S. —resulting in

increased uncertainty and an overall decreased willingness to invest, increasing

Avanza’s commission revenue with over 50 percent.31 Finally, regarding the two

last observations, on both their dates 2018-02-05 and 2018-02-06, the largest

downfall on the stock exchange after BREXIT, took place, apparently caused by

a large downfall after unexpected high salaries numbers in the U.S. was presented

—indicating that the inflation might rise sooner than expected.32

In summary, these outliers are clearly unusual, however they are fully plausible

observations and should therefore not be removed from the data set since they

are not bad observations such as faulty measurements or incorrect recording of

data.

4.3 Transformations

After fitting the full model and conducting a thorough residual analysis, the need

for a transformation is investigated. By plotting the log-likelihood as a function of

λwith a 95% interval, Figure 4.6a show that λ is outside of the interval, indicating

a need for a transformation.30SvD, Rekordhandel på börsen efter brexit31Privata Affärer, Experterna spår kraftig börsnedgång32SvD, Största börstappet sedan Brexitomröstningen

35

−2 −1 0 1 2

−49

00−

4700

−45

00−

4300

λ

log−

Like

lihoo

d

95%

(a) Before transformation

−2 −1 0 1 2

−33

00−

3200

−31

00

λ

log−

Like

lihoo

d

95%

(b) After transformation

The value chosen for λ is 0.5 because of its appropriateness. After conducting the

same previous residuals analysis as before, the same outliers as in Table 4.1 are

identified by the residuals, Cook’s distance, DFFITS and the outlier and leverage

diagnostics. Further, after performing the box-cox method again, no indication

of a transformation is found, since λ = 1 is found in the 95% interval (see figure

4.6b).

4.4 Variable Selection

After identifying the outliers and performing eventual transformations, the task of

producing a finalmodel still exists. Through variable selection some finalmodel(s)

are produced.

4.4.1 All Possible Regressions

Proceeding with the all possible regressions analysis, criterion such R-squared,

adjusted R-squared, Mallow’s Cp and Schwartz’s BIC are chosen. The criterion

are displayed in Figure 4.6 were the black boxes on the x-axis display whether or

not the specific regressor variable is included for that specific model on the y-axis.

For example, observing the Mallow’s Cp plot, the model with the value Cp = 8

consists of the variables DJI, SPX, IXIC, VIX, V2TX and OMXSCCPI (excluding the

36

intercept), while the model with the value Cp = 17 only consists of the variables

DJI, SPX and V2TX.r2

(Int

erce

pt)

DJI

SP

XO

MX

IXIC

ND

XS

IXS

TOX

XG

SP

TS

EV

IXV

XN

V2T

XV

XD

GS

PT

XLV

EP

U_U

SE

PU

_UK

FR

ED

OM

XS

SC

PI

OM

XS

PI

TOP

X

0.013

0.017

0.022

0.026

0.029

0.031

0.033

0.034

0.035

0.036

R−squared

adjr2

(Int

erce

pt)

DJI

SP

XO

MX

IXIC

ND

XS

IXS

TOX

XG

SP

TS

EV

IXV

XN

V2T

XV

XD

GS

PT

XLV

EP

U_U

SE

PU

_UK

FR

ED

OM

XS

SC

PI

OM

XS

PI

TOP

X

0.012

0.015

0.02

0.023

0.026

0.027

0.028

0.029

0.03

0.03

Adjusted R−squared

Cp

(Int

erce

pt)

DJI

SP

XO

MX

IXIC

ND

XS

IXS

TOX

XG

SP

TS

EV

IXV

XN

V2T

XV

XD

GS

PT

XLV

EP

U_U

SE

PU

_UK

FR

ED

OM

XS

SC

PI

OM

XS

PI

TOP

X

27

23

17

12

9.1

8

7.2

6.9

6.6

6.4

Mallows Cp

bic

(Int

erce

pt)

DJI

SP

XO

MX

IXIC

ND

XS

IXS

TOX

XG

SP

TS

EV

IXV

XN

V2T

XV

XD

GS

PT

XLV

EP

U_U

SE

PU

_UK

FR

ED

OM

XS

SC

PI

OM

XS

PI

TOP

X

23

16

11

6.1

1.8

−2.4

−4.4

−4.7

−5.3

−6

Schwartzs Bayesian Information Criterion

Figure 4.6: All possible regressions, from top left to bottom right: R2, R2adj , CP & BIC

The first impression is that several of the regressor variables are left out inmany of

the models. For example OMX and both the EPU indices. On the other hand, some

variables seems to be dominant throughout the criteria such as DJI, SPX, IXIC and

V2TX.

Judging by the plots, the all possible regression procedure recommends a model

with 10 variables for maximizing R2 and R2adj values. For minimizing Mallow’s Cp

it recommends a model with 9 variables, and for minimizing BIC, a model with

37

only 1 variable.

4.4.2 Backward Elimination

Combining the all possible regressionsmethodof choosing variableswith the back-

ward eliminationmethod, amodel can be chosen. Remember that backward elim-

ination omits variables based on their t and F statistics. Backward elimination

creates a model with 9 variables, the same as the model minimizing the Mallow’s

Cp.

Intercept DJI SPX IXIC SIX VXN V2TX GSPTXLV OMXSPI TOPX

1286.4 11071.1 -13964.0 4664.9 -12151.6 246.9 174.3 -2097.5 11906.1 -724.2

Table 4.2: 9 variable model presented by backward elimination

However, by comparing the backward eliminationmodel (see Table 4.2)with, per-

haps, a model with lesser variables, the backward eliminationmodel seems some-

what over exaggerated. This is because the all possible regressions method show

that almost the same R2 and R2adj values can be achieved by omitting some vari-

ables. For example, by omitting two variables (OMXSPI and SIX) from the backward

elimination model with nine parameters, the R2adj only decreases with ∆R2

adj ≈−0, 00021. This is a relatively cheap trade-off for a less complex model.

4.4.3 Cross Validation

Having learned that removing somevariables from thebackward eliminationmodel

does not significantly damage the models overall performance, another useful

method for choosing variables is used.

The cross validation with k = 10 folds minimizes the root mean square error

(RMSE) andmeanabsolute error (MAE).ObservingFigure 4.7, themethod chooses

amodel with five variables. Compared to themodels containingmore or less vari-

ables, the model with five variables clearly differentiate itself from the other mod-

els. This model is desirable because of its low complexity and errors.

38

5 10 153.21

6497

e+12

3.21

6500

e+12

Index

mea

n.cv

.err

ors

Figure 4.7: 10-fold cross validation

The five variables are and their estimated coefficients are

(Intercept) DJI SPX IXIC V2TX OMXSSCPI

1285.332 11285.454 -15370.214 3715.533 200.584 -1871.988

Table 4.3: Final chosen model, proposed by 10-folded cross validation

In conclusion, the model chosen by ten-fold cross validation only differs∆R2adj ≈

−0, 00262 from the model chosen by all possible regressions R2 and R2adj, while

consisting of four less variables and lower mean error (compare index 5 and 9 in

Figure 4.7).

4.4.4 OLS-assumptions

Proceeding with the final model presented in Table 4.3, the OLS-assumptions are

checked.

As stated before, OLS assumes that the random errors have a mean of zero. Cal-

culating the mean of the fitted models residuals indicate a mean value near zero

(meanValue = -3.424928e-14 ≈ 0), indicating strict exogenity.

39

Homoscedasticity is controlled by plotting the residuals of the fittedmodel against

its fitted values (see Figure 4.8). The red line in the figure estimates the spread

of the observations. The majority of the observations seem to lie on a horizontal

band, which is positive, however, the red line is clearly bent upwards, both in its

tail and head, indicating heteroscedasticity. The outliers with leverage (see Table

2.3) are probably the cause of this.

1100 1200 1300 1400 1500

−50

00

500

Fitted values

Res

idua

ls

lm(sqrt(Courtage) ~ DJI + SPX + IXIC + V2TX + OMXSSCPI)

Residuals vs Fitted

9521039 1323

Figure 4.8: Plot of residuals versus fitted values

Further, OLS assumes that no auto-correlation should exist when estimating the

beta coefficients. The detection of auto-correlation can be displayed by plotting

the residuals against the time. Observing Figure 4.9 some kind of correlation

seems to exist, whether it is negative or positive correlation is hard to decide. But

since the residuals do not alternate signs rapidly in comparison to residuals with

similar sign appearing in clusters, they are probably positively correlated. There

also seems to exist an overall trend were the residuals increase with time. The

conclusion is nevertheless same, auto-correlation is detected and the assumption

is violated.

40

2012 2014 2016 2018

−60

0−

200

020

040

060

080

0

Observation order

Res

idua

ls

Figure 4.9: Auto-correlation plot

The assumption about a normal distribution is checked by observing Figure 4.10,

the grey area in the background with a wider ”tail” and ”head” is the acceptable

range for the sample to fulfill the normality assumption. Clearly, both the errors

tail and head are outside of the grey area, resembling a heavy-tailed distribution.

Thus, it is concluded that the normality assumptions are not fulfilled.

41

−1000

−500

0

500

1000

−2 0 2Theoretical

Sam

ple

Figure 4.10: Normality QQ-plot for checking normality distribution

Multicollinearity can be detected through the calculation of the VIF:s and the

eigenvalues. The VIF:s (see Table 4.4) exceed the recommended threshold val-

ues 5 and 10, indicating a case of multicollinearity for the choice of variables. The

eigenvalues, as well as the condition number, are also calculated. The condition

number well exceeds the recommended threshold 100 (K = 788.619) and is close

to threshold indicating severe multicollinearity, i.e. K = 1000. The condition

number confirms the VIF:s analysis of a case of severe multicollinearity.

DJI SPX IXIC V2TX OMXSSCPI

VIF 18.985775 41.715249 11.744280 1.490258 1.371106

Table 4.4: Variance inflation factors

TheOLS-assumptions, except formulticollinearity, are summarized inTable 4.5.

42

Value p-value Decision

Global Stat 109.184 0.000e+00 Assumptions NOT satisfied!

Skewness 1.434 2.312e-01 Assumptions acceptable.

Kurtosis 18.765 1.479e-05 Assumptions NOT satisfied!

Link Function 31.019 2.555e-08 Assumptions NOT satisfied!

Heteroscedasticity 57.966 2.665e-14 Assumptions NOT satisfied!

Table 4.5: OLS assumptions, level of significance = 0.05

Global Stat shows the relationship between the predictors and response and in-

dicates a non-linear relationship between at least one of the predictors versus the

response, since the p-value rejects the null hypothesis, i.e. p<0.05. The Skewness

is accepted, indicating that the distribution is neither positively nor negatively

skewed and that there is no need for transformation. Kurtosis is not satisfied,

indicating that the assumption of normality does not hold given the level of signif-

icance. A rejected Link Function indicates that an alternative form of the general-

ized linearmodel, such as logistic or binomial, should beused. Last, Heteroscedasticity

is the opposite of homoscedasticity, indicating that the errors do not have a con-

stant variance for each observation.

43

4.5 Final Model

In the end the best model, given the circumstances, is a model consisting of 5

variables with equity and volatility indices and no EPU indices on the form

√y = β0 + β1x1 + β2x2 + ...+ β5x5 (34)

More specifically the model consists of the coefficients in Table 4.6 leading to the

model on form (34) to be written as

√Daily aggregated commission revenue = 1286.4 + 11112 ∗ DJI

−14356 ∗ SPX + 4899.9 ∗ IXIC − 189.31 ∗ V2TX − 2093.6 ∗ OMXSSCPI(35)

The partial residual plots are displayed in Figure 4.11 and the estimates of the

coefficients and their limits in a 95 percent interval, alongside their p-values, are

presented in the Table 4.6

44

−0.005 0.000 0.005

−60

0−

200

200

600

DJI | others

sqrt

(Cou

rtag

e) |

oth

ers

952 1039 1323

1546

−0.004 −0.002 0.000 0.002 0.004

−60

0−

200

200

600

SPX | others

sqrt

(Cou

rtag

e) |

oth

ers

9521039

166

1490

−0.010 −0.005 0.000 0.005 0.010

−60

0−

200

200

600

IXIC | others

sqrt

(Cou

rtag

e) |

oth

ers

952 1039

1169

1490

−0.2 0.0 0.2 0.4 0.6

−60

0−

200

200

600

V2TX | otherssq

rt(C

ourt

age)

| o

ther

s

952 1039

1323

1325

−0.02 0.00 0.02 0.04

−60

0−

200

200

600

OMXSSCPI | others

sqrt

(Cou

rtag

e) |

oth

ers

9521039

617

952

Added−Variable Plots

Figure 4.11: Partial residuals plots

Coefficient Estimate Lower limit (5.0%) Upper limit (95%) p-value

Intercept 1285.903 1275.029 1296.777 2e-16

DJI 11806.870 6177.023 17436.72 4.11e-05

SPX -16824.578 -25022.71 -8626.443 5.98e-05

IXIC 4634.423 996.5248 8272.321 0.0126

V2TX 221.609 41.98532 401.2327 0.0157

OMXSSCPI -1246.238 -2779.493 287.0169 0.1112

Table 4.6: Coefficient confidence interval

45

5 Discussion

5.1 Methodology

As displayed in Table 4.6, the two most significant variables in the final model

are DJI and SPX. Since both are most certainly the largest and most popular stock

indices used within the financial market, it seems reasonable that they would cap-

ture some of the explanation, such as Great Britain leaving the E.U. or Trump

being elected as president.

Apart from the three equity indices, two volatility indiceswere included in the final

model. Although included, V2TX and OMXSCCPIwere not as significant as the equity

indices and further, more established volatility indices, such as the VIX, was en-

tirely excluded from the model (see Figure 4.6). This seems odd, since in general,

periods of increased volatility imply higher uncertainty and therefore a tendency

to invest less, while short periods of volatility spikes induces higher trading ac-

tivity. The latter, we believe, is somewhat shown in Avanza’s customers trading

behaviour. Based on the outlier analysis in Section 4.2, Avanza’s customers trad-

ing activity spiked during these unexpected events, as we would have predicted.

Again however, none of the volatility nor uncertainty indices were significant in

explaining these.

One possible explanation for the insignificance of the volatility and uncertainty in-

dices is lag. For example, extracting the four dates 2018-02-[02-05-06-07] and

their respective returns and commission revenue values, some observations can

be made (see Table 5.1). The U.S. EPU index more than doubles from the first to

the secondobservation, whileVIXand the commission revenue also increase. This

seems reasonable —a relevant event causes increased uncertainty and volatility,

which is noticed by Avanza’s customers, hence increasing their transaction activ-

ity and Avanza’s aggregated commission revenue. However, the following day the

U.S. EPU index falls substantially, while the transaction activity and VIX contin-

ues to increase, i.e. now the U.S. EPU index and Avanza’s commissions revenue

are negatively correlated as opposed to the day before.

One theory is that right at the moment when an unexpected event occurs both the

46

uncertainty indices and the investors trading activity correlate, but not after. The

day after the public learned about the unexpected inflation they maybe were not

so uncertain anymore, decreasing the EPU index, while the investors still traded,

causing a lag.

Observation # Date ∆ Comission revenue VIX EPU U.S.

1321 2018-02-02 8.41% 18.6% 22.9%

1322 2018-02-05 30.21% 32.2% 138%

1323 2018-02-06 39.56% 10.7% -36,5%

1324 2018-02-07 -43.25% -15.2% -37,0%

Table 5.1: The most severe outliers and the time of their occurrences

By our definition, unexpected means that the majority was not able to foresee it.

Hence, the surprised investors will try to trade as much as possible, as soon as

possible, to minimize their losses, while other investors might try to maximize

their returns given the time frame—increasing the overall trading activity during,

but also shortly after, the initial event has taken place.

On the other hand, the EPU index can also be flawed. Since it consists of key

words aggregated from major news outputs spread across a certain geographic

area, it depends on what gets written, and how much is written about it. Usually

the news will bombard with output weeks after special occurrences, potentially

increasing the uncertainty index although the uncertainty is gone.

In summary, there might exist a time delay between the unexpected event and the

investors activity. Further, the investors activity might continue for some time

after the event has become known to the public, which the uncertainty indices do

not reflect, while the commission revenue does. And vice versa, newspapersmight

write as much about events one week later as the day it happened, thus displaying

misleading correlation. But this still does not explain why the volatility index, that

is not built by the same method as the EPU indices, is not significant. Maybe the

adequacy of our model can act as a source of explanation.

47

5.2 Model Adequacy

The final model’s adjusted R-squared value is 0.01926, i.e. this model explains

approximately 1, 926% of the observed variance. The question of whether this is

a good or bad value is irrelevant, since it depends on how the variables are mea-

sured, whether any transformations have been done etc. and whether the model

can actually be trusted.

Avanza’s daily commission revenue is a time series, i.e. it is a series of data points

indexed in time order. When the response variable is a time series, much of the

response variable’s predictive power is often a result of its own history via lags,

differences and/or seasonal adjustments, which clearly seems to be the case here

as well (see Figure 4.9). Therefore, it is unwise to evaluate its predictive power

when we do not know what lies behind it.

Further, many of the OLS-assumptions are violated (see Figure 4.5). These as-

sumptions are vital for obtaining accurate estimations of theOLSbeta coefficients.

The presence of multicollinearity, a non-normal distribution, heteroscedasticity

and kurtosis all contribute to less accurate andmisleading beta estimations.

In summary, the model’s poor adequacy could be a potential source of explana-

tion when it comes to explaining certain factors, that in theory are significant for

explaining the trading volume, were ruled out as insignificant.

5.3 Previous studies

In general, there are numerous factors affecting individual trading behaviour. As

stated in the introduction, this has been shown by studies were factors such as

price changes, volatility and even uncertainty play a significant role. By that, it

seems as if the factors we have used in this study may be miss used, rather than

irrelevant, with regards to the investigation of their relationship with the response

variable.

One important distinction is the type ofmethod that we have used and the popula-

tion that was studied. First, regressionmodeling is a powerful and frequently used

statistical tool, especially within finance subjects, but almost no studies regarding

48

our subject using regression analysis were found. One reason might again be that

a time series analysis should be used instead of conducting a regression analysis,

mainly because of the lag and auto correlation. Secondly, usually when referring

to price and volume, the meaning of the concepts are the stock and its price and

volume. This report refers to several stock indices and Avanza’s aggregated com-

mission revenue. These two relationships are of course different, which is why the

established price-volume correlation between a stocks price and volume should

not be easily accepted here.

In conclusion, the studies found do neither strengthen nor weaken the results,

actually not much research about the fluctuating revenue streams of online stock

brokers and the cause of them has been found at all. However, the literature that

is used is still valuable for giving some insight into the mechanisms of investors

trading behaviour.

5.4 Future Work

There are several suggestionswhere to start (regarding) futurework. For example,

by using generalized linear models (GLM) one could surpass the assumptions of

no auto correlation and a normal distribution, since GLM does not assumes that

the errors are not auto correlated and can choose witch distribution to account for

in its modeling.

Moreover, a time series analysis would be recommended in order to analyze the

data with a more scientific and statistic approach to improve the reliability of the

outcome, since the data is taken at different time periods. The result given in

Figure 4.9 showed that the there is a possible trend in the residuals, which must

be accounted for.

5.5 Avanza

Because of the poormodel adequacy, the results cannot help to explain the overall

trading behaviour of their customers. The analyze of the competitiveness through

Porter’s five forces can therefore not be used in further implementation forAvanza.

49

Although some of their customer’s behaviour is known, such as taking the chance

to trade directly after unexpected events as when the Swedish bank Swedbank

was caught laundering money which led to an 89% increase of Avanza customers

buying Swedbank’s stock,33 not much fruitful comes out of it. Since these events

usually are unexpected, nobody can predict them.

Relevant information for Avanza might be information explaining both long and

short term behaviour of their customers. Perhaps, more in detail on how the cus-

tomers are affected by the seasons of the year, or how trading differs on Mondays

and Fridays etc. This could lead to key insights such as, for example, learning that

customers are more prone to invest in stocks during springs than autumns, or

that customers tend to save more of their salaries directly after receiving it, rather

than long after they have received. Information such as this could be used as pil-

lars when developing new products, or taken into account when streamlining, or

increasing themoney their customers save andperhaps used for forecasting future

earnings.

33Lans, Kraftig ökning av Swedbankaktieägare

50

6 Conclusion

In this study three main research questions were asked.

1. Does uncertainty, in a financial, political and macroeconomic sense, along

with market indicators such as stock prices and volatility indices, impact

Avanza’s customers trading activity?

2. What are the possible relationships and what do they look like?

3. How should the possible relationships be implemented?

Question two is answered in the final model were the possible relationships are

represented through a multiple linear regression model, and question three is an-

swered in Section 5.5.

The answer to question one is: ”we cannot say”. The hypothesis was that influen-

tial exogenous market factors such as political, economical and financial uncer-

tainty, stock price returns and volatility could help to explain Avanza’s customers

trading activity. However, scientifically the final model does not fulfill the basic

requirements for it to be reliable, hence the significance of its variables with re-

spect to the response variable, cannot be strengthened nor weakened.

51

References

[1] Avanza©. (). History, [Online]. Available: http://investors.avanza.se/

en/history. (accessed: May 28, 2019).

[2] Avanza Bank Holding AB, Årsredovisning 2018, Annual Report, 2018.

[3] Svenskt Kvalitetsindex, “Personlig service utmanar digitala tjänster,” Re-

search Report, Oct. 2018.

[4] Shefrin,H.,Abehavioral approach to asset pricing, 2nd ed. 2008, ch. 29.3,

pp. 499–500, ISBN: 978-0-12-374356-5. [Online]. Available: https : / /

www.sciencedirect.com/book/9780123743565/a-behavioral-approach-

to-asset-pricing.

[5] Statman, M., Vorkink, K., and Thorley, S., “Investor overconfidence and

trading volume,” The Review of Financial Studies, vol. Vol. 19, no. No.1,

pp. 1531–1565, Issue 4 2006. [Online]. Available: https://academic.oup.

com/rfs/article/19/4/1531/1580508.

[6] Ying, C. C., “Stock market prices and volumes of sales,” Econometrica, vol.

Vol. 34, no. No. 3, 1966. [Online]. Available: https://www.jstor.org/

stable/1909776?seq=1#metadata_info_tab_contents.

[7] Cornell, B., “The relationship between volume andprice variability in future

markets,” The Journal of Future Markets, vol. Vol. 41, no. No. 1, pp. 135–

155, 1981.

[8] Clark, P. K., “A subordinated stochastic process model with finite variance

for speculative prices,”Econometrica, vol. Vol. 1, pp. 303–316, Issue 3 1981.

[9] Batrinca, B., Hesse, C.W., and Treleaven, P. C., “Examining drivers of trad-

ing volume in europeanmarkets,” International Journal of Finance & Eco-

nomics, vol. Vol. 23, pp. 134–154, Issue 2 Apr. 2018.

[10] Sevelin, J., “Swedish stockmarket: Explaining trade volumes in single stocks,”

Bachelor’s Thesis, KTH Royal Institute of Technology, Jun. 2017. [Online].

Available: http://www.diva- portal.se/smash/get/diva2:1120590/

FULLTEXT01.pdf.

52

[11] Baker, S. R., Bloom, N., and Davis, S. J., “Measuring economic policy un-

certainty,” The Quarterly Journal of Economics, vol. Vol. 131, pp. 1593–

163, 4 Nov. 2016. [Online]. Available: http://www.policyuncertainty.

com/media/EPU_BBD_Mar2016.pdf.

[12] Epps, T. W., “The demand for brokers’ services: The relation between secu-

rity trading volume and transaction cost,” The Bell Journal of Economics,

vol. Vol. 7, no. No.1, pp. 163–194, Issue 4 1976. [Online]. Available: https:

/ / www . jstor . org / stable / 3003195 ? seq = 1 # metadata _ info _ tab _

contents.

[13] Sakalauskas, V. andKrikščiūnienė,D., “The impact of daily trade volume on

the day-of-theweek effect in emerging stock markets,” Information Tech-

nology and Control, vol. Vol. 36, no. No.1 A, 4 2007. [Online]. Available:

https://pdfs.semanticscholar.org/d32c/66eaadc7b6a111ba75cac01c386561b0c6ae.

pdf.

[14] Chappelow, J. (Apr. 2019). Porter’s 5 forces, [Online]. Available: https:

/ / www . investopedia . com / terms / p / porter . asp. (accessed: May 28,

2019).

[15] Montgomery, D. C., Peck, E. A., and Vining, G. G., Introduction to linear

regression analysis, 5th ed. 2012, ISBN: 978-0-470-54281-1.

[16] Johanna Kull and Nicklas Andersson, Ed. (Jun. 2018). Tips inför sommar-

börsen, [Online]. Available: https://blogg.avanza.se/avanza/avanzapodden-

tips-infor-sommarborsen/. (accessed: May 28, 2019).

[17] Carlgren, F. (Feb. 2019).Hushållens inkomster, [Online]. Available: https:

//www.ekonomifakta.se/fakta/ekonomi/hushallens-ekonomi/hushallens-

inkomster/. (accessed: May 28, 2019).

[18] Definition of “regression” from the cambridge advanced learner’s dictio-

nary & thesaurus ©cambridge university press.

[19] Nick Bloom, Scott R. Baker and Steven J. Davis, Ed. (2012). Us daily news

index, [Online]. Available: http : / / www . policyuncertainty . com / us _

daily.html. (accessed: May 28, 2019).

53

[20] Nick Bloom, Scott R. Baker and Steven J. Davis, Ed. (2012). Uk daily news

index, [Online]. Available: http : / / www . policyuncertainty . com / uk _

monthly.html. (accessed: May 28, 2019).

[21] Nick Bloom, S. R. B. and Davis, S. J. (). Sweden monthly epu index, [On-

line]. Available: http://www.policyuncertainty.com/sweden_monthly.

html. (accessed: May 28, 2019).

[22] SvD. (Jun. 2016). Rekordhandel på börsen efter brexit, [Online]. Available:

https://www.svd.se/rekordhandel- pa- borsen- efter- brexit. (ac-

cessed: May 28, 2019).

[23] PrivataAffärer, Ed. (Nov. 2016). Experterna spår kraftig börsnedgång, [On-

line]. Available: https://www.privataaffarer.se/articles/2016/11/

09/bors-risk-off-kraftiga-ras-spas-vid-trump-vinst-strateger/.

(accessed: May 28, 2019).

[24] SvD, Ed. (Feb. 2018). Största börstappet sedan brexitomröstningen, [On-

line]. Available: https://www.svd.se/borsen-dyker. (accessed: May 28,

2019).

[25] Lans, K. (Mar. 2019). Kraftig ökning av swedbankaktieägare, Placera, [On-

line]. Available: https://www.avanza.se/placera/redaktionellt/2019/

03/28/antalet-swedbankaktieagare-fordubblat.html. (accessed: May

28, 2019).

54

Appendices

List of Figures

1.1 Plot of OMX against months for the time period 2000-2017 . . . . . . . . . . 8

2.1 Left: leverage point. Right: influential observation . . . . . . . . . . . . . . . 17

2.2 A Cp plot (from Montgomery, Peck, and Vining (Introduction to Linear Regres-

sion Analysis) pp. 336) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Studentized residuals plotted against the observations . . . . . . . . . . . . . 31

4.2 PRESS-residuals plotted against the observations . . . . . . . . . . . . . . . 32

4.3 Cook’s distance plotted against the observations . . . . . . . . . . . . . . . . 33

4.4 DFFITS plotted against the observations . . . . . . . . . . . . . . . . . . . 33

4.5 All the observations classified as normal, leverage, outlier or outlier & leverage

observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 All possible regressions, from top left to bottom right: R2, R2adj , CP & BIC . . . . 37

4.7 10-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.8 Plot of residuals versus fitted values . . . . . . . . . . . . . . . . . . . . . . 40

4.9 Auto-correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.10 Normality QQ-plot for checking normality distribution . . . . . . . . . . . . . 42

4.11 Partial residuals plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

List of Tables

3.1 Description of the regressor variables . . . . . . . . . . . . . . . . . . . . . 30

4.1 The most severe outliers and the date of their occurrences . . . . . . . . . . . 35

4.2 9 variable model presented by backward elimination . . . . . . . . . . . . . . 38

4.3 Final chosen model, proposed by 10-folded cross validation . . . . . . . . . . 39

4.4 Variance inflation factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 OLS assumptions, level of significance = 0.05 . . . . . . . . . . . . . . . . . 43

4.6 Coefficient confidence interval . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 The most severe outliers and the time of their occurrences . . . . . . . . . . . 47

55

TRITA -SCI-GRU 2019:153

www.kth.se