Estimating Model Parameters

28
1 Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2 nd edition, Chapter 11 Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 11: Linear Regression and Correlation Methods Hildebrand, Ott and Gray Basic Statistical Ideas for Managers Second Edition 2 Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2 nd edition, Chapter 11 Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Learning Objectives for Chapter 11 Using the scatterplot in regression analysis Using the method of least squares for finding the best fitting line Understanding the underlying assumptions in regression analysis Determining whether or not any observations are high leverage points, y-outliers or high influence points Using the the t-test for testing the significance of the slope coefficient Using the the F-test for testing the slope coefficient Understanding the difference between the correlation coefficient and the coefficient of determination 3 Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2 nd edition, Chapter 11 Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Sections 11.1 – 11.2 The Linear Regression Model; Estimating Model Parameters

Transcript of Estimating Model Parameters

1Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 11:Linear Regression

and Correlation Methods

Hildebrand, Ott and GrayBasic Statistical Ideas for Managers

Second Edition

2Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Learning Objectives for Chapter 11

• Using the scatterplot in regression analysis• Using the method of least squares for finding the best

fitting line• Understanding the underlying assumptions in regression

analysis• Determining whether or not any observations are high

leverage points, y-outliers or high influence points• Using the the t-test for testing the significance of the slope

coefficient• Using the the F-test for testing the slope coefficient• Understanding the difference between the correlation

coefficient and the coefficient of determination

3Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Sections 11.1 – 11.2The Linear Regression Model;Estimating Model Parameters

4Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Objective: Model the relationship between a response or dependent variable (Y) and one predictor or independent variable (x).

Examples:• For consumer purchase decisions, let Y = market share

and x = consumer’s degree of ‘top of mind’ brand awareness (% of consumers who name this brand first).

• In beta analysis, let Y = return on a security (IBM) over a period of time and x = return on the market (DJIA).

• For a particular corporation, let Y = sales revenue for a region at the year’s end and x = advertising expenditures for the year. Y is recorded in tens of thousands of dollarsand x is recorded in thousands of dollars.

• In all three examples, can x be used to predict Y?

5Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Consider the example where Y = sales revenue and x = advertising expenditures. The data follow:

Region Sales Adv ExpA 1 1B 1 2C 2 1D 2 3E 3 2F 3 4G 4 3H 4 5I 5 5J 5 6

a) Is there a linear relationship between Sales and Advertising Expenditures?

6Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• A scatterplot of the data follows:

• The scatterplot is used to assess whether or not there is a linear relationship between Sales and Advertising Expenditures.

A d v Exp

Sale

s

654321

5

4

3

2

1

S c a tte r p l o t f o r S a l e s v s . A d v e r t i s i n g E x p e n d i tu r e s

7Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Is a linear relationship feasible?

• From the scatterplot, it appears that as Advertising Expenditures increase, Sales increase linearly.

• The relationship between Sales and Advertising Expenditures is an example of a statisticalrelationship between 2 variables.If a straight line is fit to the points, all the points would not fall on the straight line. There are other factors besides Advertising Expenditures that affect Sales.

• An example of a deterministic relationship is:Total Costs = Fixed Costs + Variable Costs

8Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Fitted Model• The general expression for the line to be fit is:

• Residuals are prediction errors in the sample.• The residual for an observation is the difference

between the actual value of sales and the predicted value:

• How should the fitted line be determined?

xy 10ˆˆˆ ββ +=

)ˆˆ( 10 ii xy ββ +−=ii yy ˆ−=

),( ii yx

Residuali

9Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Possible criteria for fitting a line passing throughi. Minimize the sum of the residuals or min

Deficiency: An infinite number of lines satisfy this criterion.

ii. Minimize the sum of the absolute value of the residuals or min

Deficiency: Procedure not available in most statistical software.

iii. Minimize the sum of the squared residuals or min

Criterion (iii) is known as the Method of Least Squares.

:),( yx∑

=

−n

iii yy

1

)ˆ(

∑=

−n

iii yy

1

ˆ

∑=

−n

iii yy

1

2)ˆ(

10Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Procedure to obtain Intercept and Slope of Fitted Model• When the Least-Squares criterion is used, are

found by solving the following two expressions:

and

To facilitate the arithmetic, note that

10 and∧∧

ββ

xy 10ˆˆ ββ −=

( )( ) yxnyxyyxx iiii −Σ=−−Σ

( ) 222 xnxxx iii −Σ=−Σ

( )( ) 21 )(/ˆ

iiii xxyyxx −Σ−−Σ=β

and

11Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):

b) Find the least-squares regression line.Do the calculations by hand.Region Sales(y) Adv Exp(x) (x)(y)

A 1 1 1 1B 1 2 2 4C 2 1 2 1D 2 3 6 9E 3 2 6 4F 3 4 12 16G 4 3 12 9H 4 5 20 25I 5 5 25 25J 5 6 30 36

2x

30 32 116 130Sum

12Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

From the previous slide,

=1β̂

∑ = 116ii yx ∑ = 1302ix 2.3=x 0.3=y

∑ =−=− 20)0.3)(2.3)(10(116yxnyx ii

∑ =−=− 6.27)2.3)(10(130 222 xnxi

681.0)2.3)(725.0(0.3ˆˆ10 =−=−= xy ββ

[ ]/∑ − yxnyx ii [ ]=−∑ 22 xnxi

725.06.27/20 ==

, , ,

13Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• The least-squares regression equation is:

• Notice the similarity to the point-slope formula for a straight line:

y = mx + b

• In regression, the terms are reversed.

xY 725.0681.0ˆ +=The Slope CoefficientThe y-intercept

14Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):c) State the equation of the least-squares regression line

by using Minitab. The Minitab output follows:Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

The equation of the fitted line is:

Sales = 0.681+0.725 Adv Exp^

15Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):d) Obtain the fitted line plot by using Minitab.

The fitted line plot reinforces the appropriateness of using a linear model.

A d v Exp

Sale

s

654321

5

4

3

2

1

S 0.829702R -S q 72.5%R -S q (ad j ) 69.0%

F itte d R e gr e s s ion L ine fo r S a le s v s . A dv Ex pS a le s = 0 .6812 + 0 .7246 A dv Exp

16Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):

e) Interpret the slope of the fitted line in the context of this problem

Sales increase by 0.725 units for each 1 unit increase in Advertising Expenditure

f) Predict sales for a region that has advertising expenditures of 3 units.

unitsy 856.2)0.3(725.0681.0ˆ =+=

17Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

g) Determine the residual for Region A

Residual = (Actual Sales) – (Predicted Sales)= (1.0) – [0.6812 +(0.725)(1.0)]= 1 – 1.4062= -0.4062

The fitted values and residuals for all regions follow.Region Adv Exp Sales Fit Residual

A 1.00 1.000 1.406 -0.406 B 2.00 1.000 2.130 -1.130 C 1.00 2.000 1.406 0.594 D 3.00 2.000 2.855 -0.855 E 2.00 3.000 2.130 0.870 F 4.00 3.000 3.580 -0.580 G 3.00 4.000 2.855 1.145 H 5.00 4.000 4.304 -0.304 I 5.00 5.000 4.304 0.696

J 6.00 5.000 5.029 -0.029

18Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

h) On the fitted line plot, specify the residual for Region G.

A dv Exp

Sale

s

654321

5

4

3

2

1

S 0.829702R-Sq 72.5%R-Sq (ad j) 69.0%

Fitted Regression L ine for Sales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp

Residual for Region G

19Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Corresponding to the fitted model is the population model:

0 1 2 3 4 5 6 x

y

1

3

4

5

2

Error

ErrorUnknown

xYE 10)( ββ +=

The probability distribution for Y at x=5

20Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Properties of the population model• At each value of x, there is a probability distribution

of Y values.• The means, E(Y), of these probability distributions

lie on a straight line, where • is the intercept, and is the slope

• The expression for E(Y) is:

where εi is the error or difference between Yi and E(Yi)

0β 1β

iii xYorxYEεββ

ββ++=

+=

10

10 ,)(

21Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Assumptions1. The relation is in fact linear, so that the

errors all have expected value zero; for all i.

2. The errors all have the same variance; for all i.

3. The errors are independent of each other.

• The fitted line or model is an estimate of the population model

0)( =iE ε

2)( εσε =iVar

22Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• is also unknown and needs to be estimated.

• Since residuals estimate errors, use the variation in the residuals to estimate the variation in the errors.

• There are 2 constraints on the residuals:

• For the Sales vs. Advertising Expenditures example, these constraints are shown on the next slide.

• The residuals have (n-2) degrees of freedom.

2εσ

∑=

=

iii

ii

residualx

residual

0))((

0

23Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• The variation in the residuals is:

• Use to estimate εs εσ

][∑=

−−=n

ii nresiduals

1

22 )2/(0ε

Region Sales Adv Exp Fit Res (AdvExp)(Res)A 1 1 1.406 -0.406 -0.406B 1 2 2.130 -1.130 -2.261C 2 1 1.406 0.594 0.594D 2 3 2.855 -0.855 -2.565E 3 2 2.130 0.870 1.739F 3 4 3.580 -0.580 -2.319G 4 3 2.855 1.145 3.435H 4 5 4.304 -0.304 -1.522I 5 5 4.304 0.696 3.478J 5 6 5.029 -0.029 -0.174

Sum 0.000 0.000

24Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Terminology for

–Sample standard deviation around the regression line

–Standard error of estimate

–Residual standard deviation

εs

25Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):i) Identify the value of the sample standard deviation about the

regression line from the Minitab output.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

The value of the sample standard deviation about the regression line is: s = 0.829702

26Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• How is this useful?

“Like any other standard deviation, the residual standard deviation may be interpreted by the Empirical Rule. About 95% of the prediction errors will fall within +/-2 standard deviations of the mean error; the mean error is always 0 in the least-squares regression model. Therefore, a residual standard deviation of 0.83 means that about 95% of the prediction errors will be less than +/- 2(0.83) = +/-1.66” [Hildebrand, Ott and Gray]

27Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• A high-leverage point is one for which the x-value is, in some sense, far away from most of the x-values.

y

x

.....

.. .. ....

.... ... . ..... .. ......... ...

.. ........ .. .. ...

..

....

.Most of x-values

A high leverage point

28Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• MINITAB flags a high leverage point with an X symbol.The determination is made by looking at the leverage, denoted by hi, for each observation:

Some limits are built in:

If this observation is flagged with an X symbol. Why 6/n?

ix( )( ) 2

1

21

xx

xxn

hi

n

j

ii

−+=

∑=

Squared deviation of a particular

Variation in all x’s relative to x

nhhhnn

iii /2;2;1/1

1==≤≤ ∑

=

,/6 nhi >hn 3/6 =

29Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures):j) Find the leverage for region J where the point is (6, 5).

• Is this a high leverage point?No, since

• A point with a large x-value is not necessarily a high leverage point.

384.06.27

)2.36(101 2

=−

+=Jh

.6.0384.010 <=h

30Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• The leverage values for all 10 observations follow:

Region Sales Adv Exp SRES1 HI1A 1 1 -0.57455 0.275362B 1 2 -1.47969 0.152174C 2 1 0.84130 0.275362D 2 3 -1.08720 0.101449E 3 2 1.13822 0.152174F 3 4 -0.74617 0.123188G 4 3 1.45574 0.101449H 4 5 -0.41464 0.217391I 5 5 0.94776 0.217391

J 5 6 -0.04451 0.384058

• All leverages are less than 6/n⇒ No High leverage points.

Leverages for each observation

31Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• The fitted line plot follows. The graph suggests there are no high leverage points

Adv Exp

Sale

s

654321

5

4

3

2

1

S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%

Fitted Regression Line for Sales vs. Adv ExpSales = 0.6812 + 0.7246 Adv Exp

32Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures): Region G has just sent an email stating that they reported incorrect values for sales and advertising expenditures. The bad news is that they actually spent $10,000 (10 units) on Advertising. The good news is that Sales were actually $80,000 (8 units). Is the revised data point for Region G a high leverage point?

The point (10,8) is a high leverage point.The Minitab output follows.

21 (10 3.9) 0.64 0.6010 68.9Gh −

= + = >

33Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• The regression output for the revised data follows.Regression Analysis: Sales vs Adv Exp

[Revised Data: Change (3,4) to (10,8)]

The regression equation isSales = 0.491 + 0.746 Adv Exp

Predictor Coef SE Coef T PConstant 0.4906 0.4032 1.22 0.258Adv Exp 0.74601 0.08577 8.70 0.000S = 0.711965 R-Sq = 90.4% R-Sq(adj) = 89.2%Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid

7 10.0 8.000 7.951 0.570 0.049 0.12 XX denotes an observation whose X value gives it large influence.

34Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• The leverage (HI) for all of the observations follows.Region Sales Adv Exp SRES HI

A 1 1 -0.37674 0.222061B 1 2 -1.49904 0.152395C 2 1 1.21572 0.222061D 2 3 -1.08582 0.111756E 3 2 1.55218 0.152395F 3 4 -0.70272 0.100145G 8 10 0.11553 0.640058H 4 5 -0.32986 0.117562I 5 5 1.16534 0.117562J 5 6 0.05128 0.164006

35Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• The fitted line plot reinforces that Region G is a high leverage point

A dv Exp_1

Sale

s_1

1086420

8

7

6

5

4

3

2

1

0

S 0.711965R-Sq 90.4%R-Sq (ad j) 89.2%

Fitted Regression L ine for Sales vs. Adv ExpS ales_1 = 0.4906 + 0.7460 Adv Exp_1

36Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• A high leverage point is not necessarily “Bad.”

• A high leverage point has the potential to alter the fitted line.

• For the Example, the potential was notrealized. For the original data, the slope was 0.725.

For the revised data, the slope is 0.746.

37Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

Standardized Residuals• A residual is standardized as follows:

• A Standardized Residual (SR) depends on the residual and the leverage.

i

i

i

i

hsresidual

residualofdevstdresidual

iSR

−=

=

1

..

ε

38Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

• A requirement is that the errors in the population regression model be normally distributed.

• Since the residuals estimate the errors, this implies the residuals should be normally distributed.

• Standardized Residuals can be viewed as values of a standard normal random variable.

• A Standardized Residual is considered large if |SRi| > 2.

39Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

Example (Sales vs. Advertising Expenditures): Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The good news is that they only spent $3,000 (3 units) on Advertising. The better news is that Sales were actually $50,000 (5 units).Does this result in a y-outlier for Region G?

Region G is now a point with a y-outlier.

0.2067.21015.01)043.1(043.2

1

>==

=

− G

G

hsresidual

GSRε

40Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

The Minitab output for the revised data follows.Regression Analysis: Sales versus Adv Exp

[Revised Data: Change (3,4) to (3,5)]The regression equation is

Sales = 0.804 + 0.717 Adv Exp

Predictor Coef SE Coef T PConstant 0.8043 0.7155 1.12 0.294Adv Exp 0.7174 0.1985 3.61 0.007S = 1.04257 R-Sq = 62.0% R-Sq(adj) = 57.3%

Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid

7 3.00 5.000 2.957 0.332 2.043 2.07RR denotes an observation with a large standardized residual.

41Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

The fitted line plot reinforces that Region G is a y-outlier.

A dv Exp

Sale

s

654321

5

4

3

2

1

S 1.04257R-Sq 62.0%R-Sq(adj) 57.3%

Fitted Regression Line for Sales vs. Adv ExpSales = 0.8043 + 0.7174 Adv Exp

y-outlier

42Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

High Influence Points• A high leverage point that also corresponds to a y-outlier is a

high influence point.Example (Sales vs. Advertising Expenditures): Region G just can’t get their act together. Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The bad news is that they spent $10000 (10 units) on Advertising. The really bad news is that Sales were still only $40,000 (4 units).

Does this result in a high influence point for Region G?From the Minitab output that follows, the answer is yes.The potential of the high leverage to alter the regression line was realized.

43Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.2 Estimating Model Parameters

Regression Analysis: Sales versus Adv Exp [Revised Data: Change (3,4) to (10,4)]

The regression equation isSales = 1.47 + 0.392 Adv Exp

Predictor Coef SE Coef T PConstant 1.4717 0.6145 2.39 0.044Adv Exp 0.3919 0.1307 3.00 0.017S = 1.08509 R-Sq = 52.9% R-Sq(adj) = 47.0%

Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid7 10.0 4.000 5.390 0.868 -1.390 -2.14RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

44Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Section 11.3Inferences

about Regression Parameters

45Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• Model:

• Model Parsimony:Is x useful in predicting Y?

• Hypotheses:

• The sampling distribution of is needed to test

• Additional assumption for the population model:Errors (ε ) are normally distributed.

xE(Y) 10 ββ +=

0:H.0:H 1a10 ≠= ββ vs

0:H 10 =β

46Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• Sampling distribution of

The sampling distribution of is the probability distribution of the differentvalues of which would be obtainedwith repeated sampling, when thevalues of the independent variable x areheld constant for the repeated samples.

47Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• is normally distributed with:

• Substitute for to estimate the 2εs

)( 1

βVar

( )∑ −==

=∧∧

2

2

11

11

)(V)(Var

)(E

xxi

εσββ

ββ

2εσ

48Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• The estimated standard error of is:

• Minitab denotes this as “SE Coef”

( )∑ −2

xx

s

i

ε

49Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• The distribution of

is tn-2

• To test , use

] ofError Standard [Estimated 1

11∧

β

ββ

] ofError Standard [Estimated

0

1

1∧

β

β

0:H 10 =β

50Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• The rejection region depends on the research hypothesis:

Rejection RegionResearch Hypothesis

0: 1 ≠βaH

0: 1 >βaH

0: 1 <βaH

2,2/

2,2/0

ifor

ifReject

−<

>

n

n

tt

ttH

α

α

2,0 ifReject −> nttH α

2,0 ifReject −−< nttH α

51Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

Example (Sales vs. Advertising Expenditures):

Test at the 5% significance level by using the Minitab output.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv Exp

Predictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

0:Hvs.0:H 110 ≠= ββ a

52Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• The test statistic is

• Equivalently, since p-value = .002, reject

⇒ Advertising Expenditures is a significant predictor of Sales.

0:Hreject ,306.259.4 Since

59.41579.

07246.Coef SE

0

108,025.

1

==>=

=−

=−

=

β

β

tt

t

0:H 10 =β

53Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.3 Inferences about Regression Parameters

• The F statistic can also be used to test H0: β1 = 0 vs. Ha: β1 ≠ 0

• The F test is equivalent to the t-test in simple regression.

• The F test is presented in the slides for Section 11.5.

54Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Section 11.4Predicting New Y Values

Using Regression

55Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

Scenario 1: Find a confidence interval for E(Y) at xn+1

• A new value of x, denoted by xn+1, is specifiedY

Xn+1 X

E(Y)

56Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

Scenario 1: • Point Estimate:

• Interval Estimate:

• The further that xn+1 is from , the ____ the interval.• The larger the range of x-values, the _______ the

interval.• The larger the number of data points, the _______ the

interval.

widernarrower

narrower

x

1101ˆˆˆ ++ += nn xy ββ

∑ −−

+± +−+ 2

21

2,2/1)()(1ˆ

xxxx

nsty

i

nenn α

57Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

Example (Sales vs. Advertising Expenditures): With 95% confidence, within what limits is the average value for sales when advertising expenditures are $4,000?

• Point Estimate:

• Interval Estimate:

• This is a confidence interval estimate for the population average for sales regions with advertising expenditures of $4,000.

6.27)(;830.;58.3)4(725.681.ˆ 2 =−Σ==+= ii xxsY ε

]25.4,91.2[6.27

)2.34(101)830)(.306.2(58.3

2

or

−+±

58Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

Scenario 2: Find a prediction interval for Yn+1 at xn+1

Y

Xn+1 X

E(Y)• yn+1

59Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

• Predicting a specific value for Y at a given value of x.

• Point Estimate:

• Interval Estimate:

• A prediction interval for Yn+1 is ____ than a confidence interval for E(Yn+1).

wider

1101ˆˆˆ ++ += nn xy ββ

∑ −−

++± +−+ 2

21

2,2/1)()(11ˆ

xxxx

nsty

i

nenn α

60Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

Example (Sales vs. Advertising Expenditures):

Suppose a new region is to be allowed advertising expenditures of $4,000. What sales revenue can be anticipated? Obtain a 95% prediction interval.Point Estimate:

Interval Estimate:

This is a 95% prediction interval for the sales of an individual region when advertising expenditures are $4,000

6.27)(;830.;58.3ˆ 2 =−Σ== ii xxsy ε

]61.5,55.1[6.27

)2.34(1011)830)(.306.2(58.3

2

or

−++±

61Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

The Minitab output follows for the confidence and prediction intervals:

Values of Predictors for New Observations

NewObs Adv Exp1 4.00

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI1 3.580 0.291 (2.908, 4.251) (1.552, 5.607)

62Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.4 Predicting New Y ValuesUsing Regression

The confidence and prediction intervals are shown in the following graph:

Adv Exp

Sale

s

654321

8

7

6

5

4

3

2

1

0

-1

S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%

Regression95% CI95% PI

Fitted Regression Line for Sales vs. Adv ExpSales = 0.6812 + 0.7246 Adv Exp

63Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Section 11.5Correlation

64Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Coefficient of Determination • Based on explained and unexplained

deviation

• Of the total deviation, how much is explained by fitting the regression line and how much is left over?

( )22yx Rorr

⎟⎠⎞⎜

⎝⎛ −+⎟

⎠⎞⎜

⎝⎛ −=−

ΛΛ

iiii yyyyyy

Total deviation

Explained deviation

Unexplained deviation

65Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

Example (Sales vs. Advertising Expenditures):For region I, x=5 and y=5

Total deviation

Explained deviation Unexplained deviation

304.4)5(7246.06812.0ˆ 9 =+=y

2)35()( 9 =−=−= yy

304.1)3304.4()ˆ( 9 =−=−= yy

696.0)304.45()ˆ( 99 =−=−= yy

66Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

This is shown on the fitted line plot:

A dv Exp

Sale

s

654321

5

4

3

2

1

S 0.829702R-S q 72.5%R-S q (ad j) 69.0%

F itted R egression L ine for S ales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp

yExplained deviation

Unexplained deviation

Total deviation

67Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Square both sides to account for negative deviations

• Sum over all observations][)ˆ()ˆ()( 222 termproductcrossyyyyyy iiii −+−+−=−

( )22

2⎟⎠⎞

⎜⎝⎛ −Σ+⎟

⎠⎞

⎜⎝⎛ −Σ=−Σ

∧∧

iiii yyyyyy

Sum of Squares due

to Total(SST)

Sum of Squares due to Regression

(SSR)

Sum of Squares due

to Error(SSE)

68Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Each of Sum of Squares has associated with it “degrees of freedom” (df).

• SST = SSR + SSE

• Degrees of freedom for each Sum of Squares are:

• (n – 1) = 1 + (n - 2)

( )22

2⎟⎠⎞⎜

⎝⎛ −Σ+⎟

⎠⎞⎜

⎝⎛ −Σ=−Σ

∧∧

iiii yyyyyy

69Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Mean Square = Sum of Squares/ df

• MSR = SSR/df = SSR /1 = SSR

• MSE = SSE/df = SSE/(n – 2)

• This leads to another test statistic, the F-statistic, for testing H0: β1 = 0.

• F = MSR/MSE { F–Statistic

70Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Rationale: If SSR is large relative to SSE, this indicates the independent variable x has real predictive value.

• The F test is one-tailed.

• Rejection Region: Reject H0 : β1 = 0 if F-Statistic > Fα,1,n-2

Or, reject H0 if p-value < α.

71Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

Example: Sales and Advertising ExpendituresLocate the value of the F-Statistic, the associated p-value and state your conclusion.

Regression Analysis: Sales versus Adv Exp

The regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

Analysis of VarianceSource DF SS MS F PRegression 1 14.493 14.493 21.05 0.002Residual Error 8 5.507 0.688Total 9 20.000

72Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• F test = 14.493/0.688 = 21.05

Since 21.05 > F.05,1,8 = 5.32, reject H0: β1 = 0.

• Percentage points of the F distribution are in Table 6.

• Equivalently, since p-value = .002, reject H0: β1 = 0.

73Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• In simple regression, t2 = F.Example: Sales and Advertising Expenditures

t2 = (4.59)2 = 21.07 = F

• In simple regression, the p-values for the F test and t test are equal.

Example: Sales and Advertising Expenditures

p-value for F test = .002 = p-value for t-test

74Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• Coefficient of Determination, denoted by, is

• specifies how much of the total variation is explained by the fitted line.

22 Rorryx

SSTSSRR =2

2R

75Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

Example (Sales vs. Advertising Expenditures):Use the Minitab output to find the value of the coefficient of determination and interpret the value in the context of this problem.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpS = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

From the output,

Interpretation: 72.5% of the variation in Sales is explained by the regression model with Advertising Expenditures as the predictor

%5.722 =R

76Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• The coefficient of determination, R2, ranges from 0 to 1

• The coefficient of correlation, denoted by r, is obtained from R2 :

where the sign (+,-) of r is the same as the sign of .1̂β

2Rr =

77Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• r takes on values from -1 to +1

• Correlation measures the strength of the linear relationship between x and Y.

• The coefficient of determination (R2) and the coefficient of correlation (r) have very different interpretations.

78Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• In correlation, both variables are on an equal footing.

• It does not matter which is labeled x and which is labeled Y. The objective is to measure the association between x and Y.

• This is in contrast to regression analysis, where the objective is to use x to predict Y.

79Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

Example (Sales vs. Advertising Expenditures)

• r has a positive sign because the slope of is positive:

Warning: Correlation does not imply causation.

1β̂

851.0725.02 +=== Rr

725.01̂ +=β

80Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• r denotes the sample correlation coefficient.

• r estimates the population correlation (ρ)

• Hypothesis testing on ρ requires certain assumptions.

In regression analysis, the values of x are predetermined constantsIn correlation analysis, the values of x are randomly selected.

In correlation analysis, the x values also come from a normal distribution. More precisely, the random sample of (x,y) values is drawn from a bivariate normal distribution.

81Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

• To test H0: ρ = 0, the test statistic is where t has (n – 2) d.f.

• The rejection region depends on the form of Ha:• If Ha: ρ > 0 reject H0 if t > tα,n-2

• If Ha: ρ < 0 reject H0 if t < -tα,n-2

• If Ha: ρ ≠ 0 reject H0 if | t | > tα/2,n-2

212

rnrt−

−=

82Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

11.5 Correlation

Exercise 11.27: A survey of recent M.B.A. graduates of a business school obtained data on first-year salary and years of prior work experience. [The data are in Exercise 11.27. Assume that the 51 students were randomly selected]

• The Minitab output follows:Pearson correlation of SALARY and EXPER = 0.703P-Value = 0.000

• For such a large t-value, the p-value is 0.000.Conclusion: Reject H0: ρ = 0

92.6)703(.1

251703.01

222

=−

−=

−=

rnrt

83Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Keywords: Chapter 11

• Regression analysis• Independent variable• Predictor variable• Dependent variable• Response variable• Scatterplot• Simple regression• Slope• y-intercept• Least-squares method

• High leverage points• Y-outliers• High influence points• Standard error of the

estimate• t-test on slope• F-test on slope• Coefficient of determination• Coefficient of correlation

84Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Summary of Chapter 11

• Understanding the role of the scatterplot• Understanding the rationale of the least squares

method for finding the best fitting line• Understanding the impact of high leverage points, y-

outliers and high influence points on the fitted line• Understanding variability around the regression line• Testing the slope coefficient using the t-test• Testing the slope coefficient using the F-test• Understanding the difference between the correlation

coefficient and the coefficient of determination