Dr. Ka-fu Wong

74
Ka-fu Wong © 2003 Chap 13- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data

description

Dr. Ka-fu Wong. ECON1003 Analysis of Economic Data. Chapter Thirteen. Linear Regression and Correlation. GOALS. Draw a scatter diagram. Understand and interpret the terms dependent variable and independent variable. - PowerPoint PPT Presentation

Transcript of Dr. Ka-fu Wong

Page 1: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 1

Dr. Ka-fu Wong

ECON1003Analysis of Economic Data

Page 2: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 2l

GOALS

1. Draw a scatter diagram.2. Understand and interpret the terms dependent

variable and independent variable.3. Calculate and interpret the coefficient of correlation,

the coefficient of determination, and the standard error of estimate.

4. Conduct a test of hypothesis to determine if the population coefficient of correlation is different from zero.

5. Calculate the least squares regression line and interpret the slope and intercept values.

6. Construct and interpret a confidence interval and prediction interval for the dependent variable.

7. Set up and interpret an ANOVA table.

Chapter ThirteenLinear Regression and CorrelationLinear Regression and Correlation

Page 3: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 3

Correlation Analysis

Correlation Analysis is a group of statistical techniques used to measure the strength of the association between two variables.

A Scatter Diagram is a chart that portrays the relationship between the two variables.

The Dependent Variable is the variable being predicted or estimated.

The Independent Variable provides the basis for estimation. It is the predictor variable.

Page 4: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 4

Types of Relationships

Direct vs. InverseDirect - X and Y increase together Inverse - X and Y have opposite

directions Linear vs. Curvilinear

Linear - Straight line best describes the relationship between X and Y

Curvilinear - Curved line best describes the relationship between X and Y

Page 5: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 5

Direct vs. Inverse Relationship

Advertising

Sa

les

Anti-Pollution ExpendituresP

ollu

tio

n E

mis

sio

ns

Positive Slope

Negative Slope

Direct Relationship Inverse Relationship

Page 6: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 6

Example

Suppose a university administrator wishes to determine whether any relationship exists between a student’s score on an entrance examination and that student’s cumulative GPA. A sample of eight students is taken. The results are shown below

Student Exam Score GPA

A 74 2.6

B 69 2.2

C 85 3.4

D 63 2.3

E 82 3.1

F 60 2.1

G 79 3.2

H 91 3.8

Page 7: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 7

Scatter Diagram: GPA vs. Exam Score

| | | | | | | | | |

50 55 60 65 70 75 80 85 90 95

Exam Score

4.00 -3.75 -3.50 -3.25 -3.00 -2.75 -2.50 -2.25 -2.00 -

Cu

mu

lati

ve G

PA

Page 8: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 8

Possible relationships between X and Y in Scatter Diagrams

Y

X

(a) Direct linearY

X

(b) Inverse linear

Y

X

(f) No relationship

Y

X

(c) Direct curvilinear

Y

X

(d) Inverse curvilinearY

X

(e) Inverse linearwith more scattering

Page 9: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 9

The Coefficient of Correlation, r

The Coefficient of Correlation (r) is a measure of the strength of the linear relationship between two variables. It requires interval or ratio-scaled data. It can range from -1.00 to 1.00. Values of -1.00 or 1.00 indicate perfect and

strong correlation. Values close to 0.0 indicate weak

correlation. Negative values indicate an inverse

relationship and positive values indicate a direct relationship.

Page 10: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 10

Formula for r

])(][)([

)(

])(][)([

)(

)1(

)(

)1(

)()1(

)(

)1(

))((

2

11

22

11

2

1 11

1

2

1

2

1 11

1

2

1

2

1 111

n

ii

n

ii

n

ii

n

ii

n

i

n

iii

n

iii

n

ii

n

ii

n

i

n

iii

n

iii

n

ii

n

ii

n

i

n

iii

n

iii

yx

n

iii

yynxxn

yxyxn

yyxx

yxyxn

n

yy

n

xxn

yxyxn

ssn

yyxxr

We calculate the coefficient of correlation from the following formulas.

Page 11: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 11

Perfect Negative Correlation (r = -1)

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

X

Y

Page 12: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 12

Perfect Positive Correlation (r = +1)

X

Y

10 9 8 7 6 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10

Page 13: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 13

Zero Correlation (r = 0)

X

Y

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

Page 14: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 14

Strong Positive Correlation (0<r<1)

X

Y

10 9 8 7 6 5 4 3 2 1 0

0 1 2 3 4 5 6 7 8 9 10

Page 15: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 15

Coefficient of Determination

The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. It ranges from 0 to 1. It does not give any information on the direction

of the relationship between the variables.

Special cases: No correlation: r=0, r2=0. Perfect negative correlation: r=-1, r2=1. Perfect positive correlation: r=+1, r2=1.

Page 16: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 16

EXAMPLE 1

Dan Ireland, the student body president at Toledo State University, is concerned about the cost to students of textbooks. He believes there is a relationship between the number of pages in the text and the selling price of the book. To provide insight into the problem he selects a sample of eight textbooks currently on sale in the bookstore. Draw a scatter diagram. Compute the correlation coefficient.Book Page Price ($)

Intro to History 500 84

Basic Algebra 700 75

Intro to Psyc 800 99

Intro to Sociology 600 72

Bus. Mgt. 400 69

Intro to Biology 500 81

Fund. of Jazz 600 63

Princ. Of Nursing 800 93

Page 17: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 17

Example 1 continued

400 500 600 700 800

60

70

80

90

100

Page

Scatter Diagram of Number of Pages and Selling Price of Text

Price ($)

Page 18: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 18

Example 1 continued

Book Page Price ($)

X Y XY X2 Y2

Intro to History 500 84 42,000 250,000 7,056

Basic Algebra 700 75 52,500 490,000 5,625

Intro to Psyc 800 99 79,200 640,000 9,801

Intro to Sociology 600 72 43,200 360,000 5,184

Bus. Mgt. 400 69 27,600 160,000 4,761

Intro to Biology 500 81 40,500 250,000 6,561

Fund. of Jazz 600 63 37,800 360,000 3,969

Princ. Of Nursing 800 93 74,400 640,000 8,649

Total 4,900 636 397,200 3,150,000

51,606

Page 19: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 19

Example 1 continued

614.0

])636()606,51(8][)900,4(000,150,3(8[

)636)(900,4()200,397(8

])(][)([

)(

22

2

11

22

11

2

1 11

n

ii

n

ii

n

ii

n

ii

n

i

n

iii

n

iii

yynxxn

yxyxnr

The correlation between the number of pages and the selling price of the book is 0.614. This indicates a moderate association between the variable.

Page 20: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 20

EXAMPLE 1 continued

Is there a linear relation between number of pages and price of books?

Test the hypothesis that there is no correlation in the population. Use a .02 significance level.

Under the null hypothesis that there is no correlation in the population. The statistic

)2/()1( 2

nr

rt

follows student t-distribution with (n-2) degree of freedom.

Page 21: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 21

EXAMPLE 1 continued

Step 1: H0: The correlation in the population is zero. H1: The correlation in the population is not zero.

Step 2: H0 is rejected if t>3.143 or if t<-3.143. There are 6 degrees of freedom, found by n – 2 = 8 – 2 = 6.

Step 3: To find the value of the test statistic we use:

905.1)614(.1

28614.

)2/()1( 22

nr

rt

Step 4: H0 is not rejected. We cannot reject the hypothesis that there is no correlation in the population. The amount of association could be due to chance.

Page 22: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 22

Regression Analysis

In regression analysis we use the independent variable (X) to estimate the dependent variable (Y).

The relationship between the variables is linear.

Both variables must be at least interval scale.

Page 23: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 23

Simple Linear Regression Model

Relationship Between Variables Is a Linear Function

iii XY 10

Y intercept Slope Random Error

Dependent (Response) Variable

Independent (Explanatory) Variable x

y

0 Run

Rise

1 = Rise/Run

0 and 1 are unknown,therefore, are estimated from the data.

Page 24: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 24

Finance Application: Market Model

One of the most important applications of linear regression is the market model.

It is assumed that rate of return on a stock (R) is linearly related to the rate of return on the overall market (Rm).

Rate of return on a particular stock

Rate of return on some major stock index

The beta coefficient measures how sensitive the stock’s rate of return is to changes in the level of the overall market.

R = 0 + 1Rm +

Page 25: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 25

Assumptions Underlying Linear Regression

For each value of X, there is a group of Y values, and these Y values are normally distributed.

The means of these normal distributions of Y values all lie on the straight line of regression.

The standard deviations of these normal distributions are equal.

The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.

Page 26: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 26

Choosing the line that fits best

The estimates are determined by drawing a sample from the population of interest, calculating sample statistics. producing a straight line that cuts into the data.

The question is:Which straight line fits best?

x

y

Page 27: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 273

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 +

(4,3.2)

(3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

2.5

Let us compare two lines

The second line is horizontal

The smaller the sum of squared differences the better the fit of the line to the data. That is, the line with the least sum of squares (of differences) will fit the line best.

The best line is the one that minimizes the sum of squared vertical differences between the points and the line.

Choosing the line that fits best

Page 28: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 28

Choosing the line that fits bestOrdinary Least Squares (OLS) Principle

Straight lines can be described generally by Y = b0 + b1X

Finding the best line with smallest sum of squared difference is the same as

∑≡n

1i

2i10i10

b,b

)]xb(b[y )b,S(bmin10

Let b0* and b1

* be the solution of the above

problem. Y* = b0* + b1

*Xis known as the “average predicted value” (or simply “predicted value”) of y for any X.

Page 29: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 29

Coefficient estimates from the ordinary least squares (OLS) principle

Solving the minimization problem implies the first order conditions:

0)x)](-bxb(b-2[y ∂b

)b,∂S(b

0))](-bxb(b-2[y ∂b

)b,∂S(b

)]xb(b -[y ≡)b,S(b

n

1ii1i10i

1

10

n

1i0i10i

0

10

n

1i

2i10i10

Page 30: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 30

Coefficient estimates from the ordinary least squares (OLS) principle

Solving the first order conditions implies

xbyn

xb

n

yb

xnx

yxnyx

)x()xn(

yx)yxn(b

1

n

1ii

1

n

1ii

0

2n

1i

2i

n

1iii

2n

1ii

n

1i

2i

n

1ii

n

1ii

n

1iii

1

Page 31: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 31

EXAMPLE 2 continued from Example 1

Develop a regression equation for the information given in EXAMPLE 1. The information there can be used to estimate the selling price based on the number of pages.

05143.)900,4()000,150,3(8

)636)(900,4()200,397(821

b

0.488

900,405143.0

8

6360 b

Page 32: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 32

Example 2 continued from Example 1

The regression equation is:

Y* = 48.0 + .05143X

The equation crosses the Y-axis at $48. A book with no pages would cost $48.

The slope of the line is .05143. Each additional page costs about $0.05 or five cents.

The sign of the b value and the sign of r will always be the same.

Page 33: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 33

Example 2 continued from Example 1

We can use the regression equation to estimate values of Y.

The estimated selling price of an 800 page book is $89.14, found by

Y* = 48.0 + .05143X = 48.0 + .05143(800) = 89.14

Page 34: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 34

Standard Error of Estimate (denoted se or Sy.x)

Measures the reliability of the estimating equation A measure of dispersion Measures the variability, or scatter of the observed

values around the regression line

2

)(

2

)(

11

10

2

1

2

1

*

.

n

yxbyby

n

yyss

n

iii

n

ii

n

ii

n

iii

xye

Page 35: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 35

More Accurate Estimatorof X, Y Relationship

Less Accurate Estimatorof X, Y Relationship

Scatter Around the Regression Line

Page 36: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 36

se measures the dispersion of the points around the regression line If se = 0, equation is a “perfect” estimator

se is used to compute confidence intervals of the estimated value

Assumptions: Observed Y values are normally

distributed around each estimated value of Y*

Constant variance

Interpreting the Standard Error of the Estimate

Page 37: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 37

X1

X2

X

Y

f(e)y values are normally distributed around the regression line.

For each x value, the “spread” or variance around the regression line is the same.

Regression Line

Variation of Errors Around the Regression Line

Page 38: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 38

De

pen

de

nt

Va

ria

ble

( Y

)

Independent Variable (X)

Y = b0 + b1X + 2se

Y = b0 + b1X + 1se

Y = b0 + b1X - 1se

Y = b0 + b1X - 2se

Y = b0 + b1X regression line

2se (95.5% Lie in this Region)

Scatter around the Regression Line

1se (68% Lie in this Region)

Page 39: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 39

Example 3 continued from Example 1 and 2.

Find the standard error of estimate for the problem involving the number of pages in a book and the selling price.

408.1028

)200,397(05143.0)636(48606,51

2

)(1

11

02

1

n

yxbybys

n

iii

n

ii

n

ii

e

Page 40: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 40

Equations for the Interval Estimates

htsy e*

2

1

2

)(

)(1

xx

xx

nh

n

ii

htsy e 1*

Confidence Interval for the Mean of y

Prediction Interval for the Mean of y

Page 41: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 41

Confidence Interval Estimate for Mean Response

X

The following factors influence the width of the interval: Std Error, Sample Size, X Value

y* = b0+b1xi

Page 42: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 42

Confidence Interval continued from Example 1, 2 and 3.

For books of 800 pages long, what is that 95% confidence interval for the mean price? This calls for a confidence interval on the

average price of books of 800 pages long.

31.1514.898

)4900(000,150,3

)5.612800(

8

1)408.10(447.214.89

)(

)(1

2

2

2

1

2**

xx

xx

ntsyhtsy

n

ii

ee

Page 43: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 43

Prediction Interval continued from Example 1, 2 and 3.

For a book of 800 pages long, what is the 95% prediction interval for its price? This calls for a prediction interval on the price

of an individual book of 800 pages long.

72.2914.898

)4900(000,150,3

)5.612800(

8

11)408.10(447.214.89

)(

)(111

2

2

2

1

2**

xx

xx

ntsyhtsy

n

ii

ee

Page 44: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 44

1. Slope (b1)

Estimated Y changes by b1 for each 1 unit increase in X

2. Y-Intercept (b0 ) Estimated value of Y when X = 0

Interpretation of Coefficients

Page 45: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 45

1. Tests if there is a linear relationship between X & Y

2. Involves population slope

3. Hypotheses H0: = 0 (no linear relationship)

H1: 0 (linear relationship)

4. Theoretical basis is sampling distribution of slopes

Test of Slope Coefficient (b1)

Page 46: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 46

Sampling Distribution of the Least Squares Coefficient Estimator

If the standard least squares assumptions hold, then b1 is an unbiased estimator of 1 and has a population variance

2

2

1

2

22

)1()(

1

xn

ii

b snxx

and an unbiased sample variance estimator

2

2

1

2

22

)1()(

1

x

en

ii

eb sn

s

xx

ss

Bias = E(b1) 1 “Unbiasd” means E(b1) 1 =0

Page 47: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 47

Basis for Inference About the Population Regression Slope

Let 1 be a population regression slope and b1 its least squares estimate based on n pairs of sample observations. Then, if the standard regression assumptions hold and it can also be assumed that the errors i are normally distributed, the random variable

1

11 -

bs

bt

is distributed as Student’s t with (n – 2) degrees of freedom. In addition the central limit theorem enables us to conclude that this result is approximately valid for a wide range of non-normal distributions and large sample sizes, n.

Page 48: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 48

Tests of the Population Regression Slope

If the regression errors i are normally distributed and the standard least squares assumptions hold (or if the distribution of b1 is approximately normal), the following tests have significance value :

1. To test either null hypothesis H0: 1 = 1

* or H0:1 1*

against the alternative H1: 1 > 1

*

The decision rule is to reject if

2),(n-b

*11 t

sβ-b

t1

Page 49: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 49

Tests of the Population Regression Slope

2. To test either null hypothesis H0: 1 = 1

* or H0:1 > 1*

against the alternative H1: 1 1

*

the decision rule is to reject if

2),(nb

*11 t

sβ-b

t1

-≤

Page 50: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 50

Tests of the Population Regression Slope

3. To test either null hypothesis H0: 1 = 1

* against the alternative H1: 1 1

*

the decision rule is to reject if

/22),(nb

*11

/22),(nb

*11 t-

sβ-b

t or t s

β-bt

11

-- ≤≥

/22),(n-b

*11 t

sβb

t 1

≥ -

Equivalently

Page 51: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 51

Confidence Intervals for the Population Regression Slope 1

If the regression errors i , are normally distributed and the standard regression assumptions hold, a 100(1 - )% confidence interval for the population regression slope 1 is given by

11 b/22),(n11b/22),(n-1 stbβst-b -

Page 52: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 52

Some cautions about the interpretation of significance tests

Rejecting H0: b1 = 0 and concluding that the relationship between x and y is significant does not enable us to conclude that a cause-and-effect relationship is present between x and y.

Causation requires: Association Accurate time sequence Other explanation for correlation

Correlation Causation Correlation Causation

Page 53: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 53

Some cautions about the interpretation of significance tests

Just because we are able to reject H0: 1 = 0 and demonstrate statistical significance does not enable us to conclude that there is a linear relationship between x and y.

Linear relationship is a very small subset of possible relationship among variables.

A test of linear versus nonlinear relationship requires another batch of analysis.

Page 54: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 54

Variation Measures Coeff. Of Determination Standard Error of Estimate

Test Coefficients for Significance

yi* = b0 +b1xi

Evaluating the Model

Page 55: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 55

Y

X

Y

Xi

Total Sum of Squares (Yi - Y)2

Unexplained Sum of Squares (Yi -Yi

*)2

Explained Sum of Squares (Yi

* - Y)2

Yi

SST

SSE

SSR

yi* = b0 +b1xi

Variation Measures

Page 56: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 56

Total Sum of Squares (SST)Measures variation of observed Yi

around the mean,Y Explained Variation (SSR)

Variation due to relationship between X & Y

Unexplained Variation (SSE)Variation due to other factors

SST=SSR+SSE

Measures of Variation in Regression

Page 57: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 57

R2 (=r2, the coefficient of determination)

measures the proportion of the variation in y that is explained by the variation in x.

n

ii

n

ii

n

ii

n

ii yy

SSR

yy

SSEyy

yy

SSER

1

2

1

2

1

2

1

2

2

)()(

)(

)(1

R2 takes on any value between zero and one. R2 = 1: Perfect match between the line and the data

points. R2 = 0: There are no linear relationship between x and

y.

Variation in y (SST) = SSR + SSE

Page 58: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 58

Summarizing the Example’s results (Example 1, 2 and 3)

The estimated selling price for a book with 800 pages is $89.14.

The standard error of estimate is $10.41. The 95 percent confidence interval for all

books with 800 pages is $89.14 ± $15.31. This means the limits are between $73.83 and $104.45.

The 95 percent prediction interval for a particular book with 800 pages is $89.14 ± $29.72. The means the limits are between $59.42 and $118.86.

These results appear in the following output.

Page 59: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 59

Example 3 continued

Regression Analysis: Price versus Pages

The regression equation isPrice = 48.0 + 0.0514 Pages

Predictor Coef SE Coef T PConstant 48.00 16.94 2.83 0.030Pages 0.05143 0.02700 1.90 0.105

S = 10.41 R-Sq = 37.7% R-Sq(adj) = 27.3%

Analysis of VarianceSource DF SS MS F PRegression 1 393.4 393.4 3.63 0.105Residual Error 6 650.6 108.4Total 7 1044.0

Page 60: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 60

Testing for Linearity

Key Argument: If the value of y does not change linearly

with the value of x, then using the mean value of y is the best predictor for the actual value of y. This implies is preferable.

If the value of y does change linearly with the value of x, then using the regression model gives a better prediction for the value of y than using the mean of y. This implies is preferable.

yy

* yy

Page 61: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 61

Three Tests for Linearity

Testing the Coefficient of Correlation

H0: = 0 There is no linear relationship between x and y.

H1: 0 There is a linear relationship between x and y.

Test Statistic:

n

iie xnxs

bt

1

22

1

/

Testing the Slope of the Regression LineH0: = 0 There is no linear relationship between x and y.

H1: 0 There is a linear relationship between x and y.

Test Statistic:

)2/()1( 2

nr

rt

Page 62: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 62

Three Tests for Linearity

The Global F-testH0: There is no linear relationship between x and y.

H1: There is a linear relationship between x and y.

Test Statistic:

)2/()(

)(

)2/(

1/

2

1

*

2

1

*

nyy

yy

nSSE

SSR

MSE

MSRF

n

iii

n

ii

[Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model. The null hypothesis should be rejected; thus, the model is valid.Note: At the level of simple linear regression, the global F-test is equivalent to the t-test on b1. When we conduct regression analysis of multiple variables, the global F-test will take on a unique function.

Page 63: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 63

PurposesExamine Linearity Evaluate violations of assumptions

Graphical Analysis of ResidualsPlot residuals versus Xi values

Difference between actual Yi & predicted Yi

*

Studentized residuals:Allows consideration for the

magnitude of the residuals

Residual Analysis

Page 64: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 64

Residual Analysis for Linearity

Not Linear LinearOK

X

e e

X

Page 65: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 65

Heteroscedasticity OK Homoscedasticity

Using Standardized Residuals

SR

X

SR

X

Residual Analysis for Homoscedasticity

When the requirement of a constant variance (homoscedasticity) is violated we have heteroscedasticity.

Page 66: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 66

Residual Analysis for Independence

Not Independent Independent

X

SR

X

SR

OK

Page 67: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 67

A time series is constituted if data were collected over time.

Examining the residuals over time, no pattern should be observed if the errors are independent.

When a pattern is detected, the errors are said to be autocorrelated.

Autocorrelation can be detected by graphing the residuals against time.

Non-independence of error variables

Page 68: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 68

+

+++ +

++

++

+ +

++ + +

+

++ +

+

+

+

+

+

+Time

Residual Residual

Time+

+

+

Note the runs of positive residuals,replaced by runs of negative residuals

Note the oscillating behavior of the residuals around zero.

0 0

Patterns in the appearance of the residuals over time indicates that autocorrelation exists.

Page 69: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 69

n

ii

n

iii

e

eeD

1

2

2

21)( Should be close to 2.

If not, examine the model for autocorrelation.

Used when data is collected over time to detect autocorrelation (Residuals in one time period are related to residuals in another period)

Measures Violation of independence assumption

The Durbin-Watson Statistic

Page 70: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 70

An outlier is an observation that is unusually small or large.

Several possibilities need to be investigated when an outlier is observed:There was an error in recording the value.The point does not belong in the sample.The observation is valid.

Identify outliers from the scatter diagram. It is customary to suspect an observation is

an outlier if its |standard residual| > 2

Outliers

Page 71: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 71

+

+

+

+

+ +

+ + ++

+

+

+

+

+

+

+

The outlier causes a shift in the regression line

… but, some outliers may be very influential

++++++++++

An outlier An influential observation

Page 72: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 72

Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.

The transformations can improve the linear relationship between the dependent variable and the independent variables.

Many computer software systems allow us to make the transformations easily.

Remedying violations of the required conditions

Page 73: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 73

A brief list of transformations

Y’ = log y (for y > 0) Use when the se increases with y, or

Use when the error distribution is positively skewed

y’ = y2

Use when the se2 is proportional to E(y), or

Use when the error distribution is negatively skewed

y’ = y1/2 (for y > 0) Use when the se

2 is proportional to E(y)

y’ = 1/y Use when se

2 increases significantly when y increases beyond some value.

Page 74: Dr. Ka-fu Wong

Ka-fu Wong © 2003 Chap 13- 74

- END -

Chapter ThirteenLinear Regression and CorrelationLinear Regression and Correlation