Post on 07-May-2023
1Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 11:Linear Regression
and Correlation Methods
Hildebrand, Ott and GrayBasic Statistical Ideas for Managers
Second Edition
2Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Learning Objectives for Chapter 11
• Using the scatterplot in regression analysis• Using the method of least squares for finding the best
fitting line• Understanding the underlying assumptions in regression
analysis• Determining whether or not any observations are high
leverage points, y-outliers or high influence points• Using the the t-test for testing the significance of the slope
coefficient• Using the the F-test for testing the slope coefficient• Understanding the difference between the correlation
coefficient and the coefficient of determination
3Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Sections 11.1 – 11.2The Linear Regression Model;Estimating Model Parameters
4Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Objective: Model the relationship between a response or dependent variable (Y) and one predictor or independent variable (x).
Examples:• For consumer purchase decisions, let Y = market share
and x = consumer’s degree of ‘top of mind’ brand awareness (% of consumers who name this brand first).
• In beta analysis, let Y = return on a security (IBM) over a period of time and x = return on the market (DJIA).
• For a particular corporation, let Y = sales revenue for a region at the year’s end and x = advertising expenditures for the year. Y is recorded in tens of thousands of dollarsand x is recorded in thousands of dollars.
• In all three examples, can x be used to predict Y?
5Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Consider the example where Y = sales revenue and x = advertising expenditures. The data follow:
Region Sales Adv ExpA 1 1B 1 2C 2 1D 2 3E 3 2F 3 4G 4 3H 4 5I 5 5J 5 6
a) Is there a linear relationship between Sales and Advertising Expenditures?
6Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• A scatterplot of the data follows:
• The scatterplot is used to assess whether or not there is a linear relationship between Sales and Advertising Expenditures.
A d v Exp
Sale
s
654321
5
4
3
2
1
S c a tte r p l o t f o r S a l e s v s . A d v e r t i s i n g E x p e n d i tu r e s
7Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Is a linear relationship feasible?
• From the scatterplot, it appears that as Advertising Expenditures increase, Sales increase linearly.
• The relationship between Sales and Advertising Expenditures is an example of a statisticalrelationship between 2 variables.If a straight line is fit to the points, all the points would not fall on the straight line. There are other factors besides Advertising Expenditures that affect Sales.
• An example of a deterministic relationship is:Total Costs = Fixed Costs + Variable Costs
8Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Fitted Model• The general expression for the line to be fit is:
• Residuals are prediction errors in the sample.• The residual for an observation is the difference
between the actual value of sales and the predicted value:
• How should the fitted line be determined?
xy 10ˆˆˆ ββ +=
)ˆˆ( 10 ii xy ββ +−=ii yy ˆ−=
),( ii yx
Residuali
9Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Possible criteria for fitting a line passing throughi. Minimize the sum of the residuals or min
Deficiency: An infinite number of lines satisfy this criterion.
ii. Minimize the sum of the absolute value of the residuals or min
Deficiency: Procedure not available in most statistical software.
iii. Minimize the sum of the squared residuals or min
Criterion (iii) is known as the Method of Least Squares.
:),( yx∑
=
−n
iii yy
1
)ˆ(
∑=
−n
iii yy
1
ˆ
∑=
−n
iii yy
1
2)ˆ(
10Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Procedure to obtain Intercept and Slope of Fitted Model• When the Least-Squares criterion is used, are
found by solving the following two expressions:
and
To facilitate the arithmetic, note that
10 and∧∧
ββ
xy 10ˆˆ ββ −=
( )( ) yxnyxyyxx iiii −Σ=−−Σ
( ) 222 xnxxx iii −Σ=−Σ
( )( ) 21 )(/ˆ
iiii xxyyxx −Σ−−Σ=β
and
11Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):
b) Find the least-squares regression line.Do the calculations by hand.Region Sales(y) Adv Exp(x) (x)(y)
A 1 1 1 1B 1 2 2 4C 2 1 2 1D 2 3 6 9E 3 2 6 4F 3 4 12 16G 4 3 12 9H 4 5 20 25I 5 5 25 25J 5 6 30 36
2x
30 32 116 130Sum
12Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
From the previous slide,
=1β̂
∑ = 116ii yx ∑ = 1302ix 2.3=x 0.3=y
∑ =−=− 20)0.3)(2.3)(10(116yxnyx ii
∑ =−=− 6.27)2.3)(10(130 222 xnxi
681.0)2.3)(725.0(0.3ˆˆ10 =−=−= xy ββ
[ ]/∑ − yxnyx ii [ ]=−∑ 22 xnxi
725.06.27/20 ==
, , ,
13Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• The least-squares regression equation is:
• Notice the similarity to the point-slope formula for a straight line:
y = mx + b
• In regression, the terms are reversed.
xY 725.0681.0ˆ +=The Slope CoefficientThe y-intercept
14Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):c) State the equation of the least-squares regression line
by using Minitab. The Minitab output follows:Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002
S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%
The equation of the fitted line is:
Sales = 0.681+0.725 Adv Exp^
15Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):d) Obtain the fitted line plot by using Minitab.
The fitted line plot reinforces the appropriateness of using a linear model.
A d v Exp
Sale
s
654321
5
4
3
2
1
S 0.829702R -S q 72.5%R -S q (ad j ) 69.0%
F itte d R e gr e s s ion L ine fo r S a le s v s . A dv Ex pS a le s = 0 .6812 + 0 .7246 A dv Exp
16Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):
e) Interpret the slope of the fitted line in the context of this problem
Sales increase by 0.725 units for each 1 unit increase in Advertising Expenditure
f) Predict sales for a region that has advertising expenditures of 3 units.
unitsy 856.2)0.3(725.0681.0ˆ =+=
17Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
g) Determine the residual for Region A
Residual = (Actual Sales) – (Predicted Sales)= (1.0) – [0.6812 +(0.725)(1.0)]= 1 – 1.4062= -0.4062
The fitted values and residuals for all regions follow.Region Adv Exp Sales Fit Residual
A 1.00 1.000 1.406 -0.406 B 2.00 1.000 2.130 -1.130 C 1.00 2.000 1.406 0.594 D 3.00 2.000 2.855 -0.855 E 2.00 3.000 2.130 0.870 F 4.00 3.000 3.580 -0.580 G 3.00 4.000 2.855 1.145 H 5.00 4.000 4.304 -0.304 I 5.00 5.000 4.304 0.696
J 6.00 5.000 5.029 -0.029
18Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
h) On the fitted line plot, specify the residual for Region G.
A dv Exp
Sale
s
654321
5
4
3
2
1
S 0.829702R-Sq 72.5%R-Sq (ad j) 69.0%
Fitted Regression L ine for Sales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp
Residual for Region G
19Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Corresponding to the fitted model is the population model:
0 1 2 3 4 5 6 x
y
1
3
4
5
2
Error
ErrorUnknown
xYE 10)( ββ +=
The probability distribution for Y at x=5
20Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Properties of the population model• At each value of x, there is a probability distribution
of Y values.• The means, E(Y), of these probability distributions
lie on a straight line, where • is the intercept, and is the slope
• The expression for E(Y) is:
where εi is the error or difference between Yi and E(Yi)
0β 1β
iii xYorxYEεββ
ββ++=
+=
10
10 ,)(
21Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Assumptions1. The relation is in fact linear, so that the
errors all have expected value zero; for all i.
2. The errors all have the same variance; for all i.
3. The errors are independent of each other.
• The fitted line or model is an estimate of the population model
0)( =iE ε
2)( εσε =iVar
22Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• is also unknown and needs to be estimated.
• Since residuals estimate errors, use the variation in the residuals to estimate the variation in the errors.
• There are 2 constraints on the residuals:
• For the Sales vs. Advertising Expenditures example, these constraints are shown on the next slide.
• The residuals have (n-2) degrees of freedom.
2εσ
∑
∑=
=
iii
ii
residualx
residual
0))((
0
23Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• The variation in the residuals is:
• Use to estimate εs εσ
][∑=
−−=n
ii nresiduals
1
22 )2/(0ε
Region Sales Adv Exp Fit Res (AdvExp)(Res)A 1 1 1.406 -0.406 -0.406B 1 2 2.130 -1.130 -2.261C 2 1 1.406 0.594 0.594D 2 3 2.855 -0.855 -2.565E 3 2 2.130 0.870 1.739F 3 4 3.580 -0.580 -2.319G 4 3 2.855 1.145 3.435H 4 5 4.304 -0.304 -1.522I 5 5 4.304 0.696 3.478J 5 6 5.029 -0.029 -0.174
Sum 0.000 0.000
24Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• Terminology for
–Sample standard deviation around the regression line
–Standard error of estimate
–Residual standard deviation
εs
25Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):i) Identify the value of the sample standard deviation about the
regression line from the Minitab output.
Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002
S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%
The value of the sample standard deviation about the regression line is: s = 0.829702
26Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.1 The Linear Regression Model11.2 Estimating Model Parameters
• How is this useful?
“Like any other standard deviation, the residual standard deviation may be interpreted by the Empirical Rule. About 95% of the prediction errors will fall within +/-2 standard deviations of the mean error; the mean error is always 0 in the least-squares regression model. Therefore, a residual standard deviation of 0.83 means that about 95% of the prediction errors will be less than +/- 2(0.83) = +/-1.66” [Hildebrand, Ott and Gray]
27Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• A high-leverage point is one for which the x-value is, in some sense, far away from most of the x-values.
y
x
.....
.. .. ....
.... ... . ..... .. ......... ...
.. ........ .. .. ...
..
....
.Most of x-values
A high leverage point
28Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• MINITAB flags a high leverage point with an X symbol.The determination is made by looking at the leverage, denoted by hi, for each observation:
Some limits are built in:
If this observation is flagged with an X symbol. Why 6/n?
ix( )( ) 2
1
21
xx
xxn
hi
n
j
ii
−
−+=
∑=
Squared deviation of a particular
Variation in all x’s relative to x
nhhhnn
iii /2;2;1/1
1==≤≤ ∑
=
,/6 nhi >hn 3/6 =
29Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures):j) Find the leverage for region J where the point is (6, 5).
• Is this a high leverage point?No, since
• A point with a large x-value is not necessarily a high leverage point.
384.06.27
)2.36(101 2
=−
+=Jh
.6.0384.010 <=h
30Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• The leverage values for all 10 observations follow:
Region Sales Adv Exp SRES1 HI1A 1 1 -0.57455 0.275362B 1 2 -1.47969 0.152174C 2 1 0.84130 0.275362D 2 3 -1.08720 0.101449E 3 2 1.13822 0.152174F 3 4 -0.74617 0.123188G 4 3 1.45574 0.101449H 4 5 -0.41464 0.217391I 5 5 0.94776 0.217391
J 5 6 -0.04451 0.384058
• All leverages are less than 6/n⇒ No High leverage points.
Leverages for each observation
31Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• The fitted line plot follows. The graph suggests there are no high leverage points
Adv Exp
Sale
s
654321
5
4
3
2
1
S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%
Fitted Regression Line for Sales vs. Adv ExpSales = 0.6812 + 0.7246 Adv Exp
32Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures): Region G has just sent an email stating that they reported incorrect values for sales and advertising expenditures. The bad news is that they actually spent $10,000 (10 units) on Advertising. The good news is that Sales were actually $80,000 (8 units). Is the revised data point for Region G a high leverage point?
The point (10,8) is a high leverage point.The Minitab output follows.
21 (10 3.9) 0.64 0.6010 68.9Gh −
= + = >
33Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• The regression output for the revised data follows.Regression Analysis: Sales vs Adv Exp
[Revised Data: Change (3,4) to (10,8)]
The regression equation isSales = 0.491 + 0.746 Adv Exp
Predictor Coef SE Coef T PConstant 0.4906 0.4032 1.22 0.258Adv Exp 0.74601 0.08577 8.70 0.000S = 0.711965 R-Sq = 90.4% R-Sq(adj) = 89.2%Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid
7 10.0 8.000 7.951 0.570 0.049 0.12 XX denotes an observation whose X value gives it large influence.
34Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• The leverage (HI) for all of the observations follows.Region Sales Adv Exp SRES HI
A 1 1 -0.37674 0.222061B 1 2 -1.49904 0.152395C 2 1 1.21572 0.222061D 2 3 -1.08582 0.111756E 3 2 1.55218 0.152395F 3 4 -0.70272 0.100145G 8 10 0.11553 0.640058H 4 5 -0.32986 0.117562I 5 5 1.16534 0.117562J 5 6 0.05128 0.164006
35Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• The fitted line plot reinforces that Region G is a high leverage point
A dv Exp_1
Sale
s_1
1086420
8
7
6
5
4
3
2
1
0
S 0.711965R-Sq 90.4%R-Sq (ad j) 89.2%
Fitted Regression L ine for Sales vs. Adv ExpS ales_1 = 0.4906 + 0.7460 Adv Exp_1
36Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• A high leverage point is not necessarily “Bad.”
• A high leverage point has the potential to alter the fitted line.
• For the Example, the potential was notrealized. For the original data, the slope was 0.725.
For the revised data, the slope is 0.746.
37Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
Standardized Residuals• A residual is standardized as follows:
• A Standardized Residual (SR) depends on the residual and the leverage.
i
i
i
i
hsresidual
residualofdevstdresidual
iSR
−=
=
1
..
ε
38Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
• A requirement is that the errors in the population regression model be normally distributed.
• Since the residuals estimate the errors, this implies the residuals should be normally distributed.
• Standardized Residuals can be viewed as values of a standard normal random variable.
• A Standardized Residual is considered large if |SRi| > 2.
39Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
Example (Sales vs. Advertising Expenditures): Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The good news is that they only spent $3,000 (3 units) on Advertising. The better news is that Sales were actually $50,000 (5 units).Does this result in a y-outlier for Region G?
Region G is now a point with a y-outlier.
0.2067.21015.01)043.1(043.2
1
>==
=
−
− G
G
hsresidual
GSRε
40Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
The Minitab output for the revised data follows.Regression Analysis: Sales versus Adv Exp
[Revised Data: Change (3,4) to (3,5)]The regression equation is
Sales = 0.804 + 0.717 Adv Exp
Predictor Coef SE Coef T PConstant 0.8043 0.7155 1.12 0.294Adv Exp 0.7174 0.1985 3.61 0.007S = 1.04257 R-Sq = 62.0% R-Sq(adj) = 57.3%
Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid
7 3.00 5.000 2.957 0.332 2.043 2.07RR denotes an observation with a large standardized residual.
41Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
The fitted line plot reinforces that Region G is a y-outlier.
A dv Exp
Sale
s
654321
5
4
3
2
1
S 1.04257R-Sq 62.0%R-Sq(adj) 57.3%
Fitted Regression Line for Sales vs. Adv ExpSales = 0.8043 + 0.7174 Adv Exp
y-outlier
42Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
High Influence Points• A high leverage point that also corresponds to a y-outlier is a
high influence point.Example (Sales vs. Advertising Expenditures): Region G just can’t get their act together. Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The bad news is that they spent $10000 (10 units) on Advertising. The really bad news is that Sales were still only $40,000 (4 units).
Does this result in a high influence point for Region G?From the Minitab output that follows, the answer is yes.The potential of the high leverage to alter the regression line was realized.
43Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.2 Estimating Model Parameters
Regression Analysis: Sales versus Adv Exp [Revised Data: Change (3,4) to (10,4)]
The regression equation isSales = 1.47 + 0.392 Adv Exp
Predictor Coef SE Coef T PConstant 1.4717 0.6145 2.39 0.044Adv Exp 0.3919 0.1307 3.00 0.017S = 1.08509 R-Sq = 52.9% R-Sq(adj) = 47.0%
Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid7 10.0 4.000 5.390 0.868 -1.390 -2.14RX
R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.
44Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Section 11.3Inferences
about Regression Parameters
45Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• Model:
• Model Parsimony:Is x useful in predicting Y?
• Hypotheses:
• The sampling distribution of is needed to test
• Additional assumption for the population model:Errors (ε ) are normally distributed.
xE(Y) 10 ββ +=
∧
1β
0:H.0:H 1a10 ≠= ββ vs
0:H 10 =β
46Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• Sampling distribution of
The sampling distribution of is the probability distribution of the differentvalues of which would be obtainedwith repeated sampling, when thevalues of the independent variable x areheld constant for the repeated samples.
∧
1β
∧
1β
∧
1β
47Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• is normally distributed with:
• Substitute for to estimate the 2εs
)( 1
∧
βVar
∧
1β
( )∑ −==
=∧∧
∧
2
2
11
11
)(V)(Var
)(E
xxi
εσββ
ββ
2εσ
48Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• The estimated standard error of is:
• Minitab denotes this as “SE Coef”
∧
1β
( )∑ −2
xx
s
i
ε
49Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• The distribution of
is tn-2
• To test , use
] ofError Standard [Estimated 1
11∧
∧
−
β
ββ
] ofError Standard [Estimated
0
1
1∧
∧
−
β
β
0:H 10 =β
50Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• The rejection region depends on the research hypothesis:
Rejection RegionResearch Hypothesis
0: 1 ≠βaH
0: 1 >βaH
0: 1 <βaH
2,2/
2,2/0
ifor
ifReject
−
−
−<
>
n
n
tt
ttH
α
α
2,0 ifReject −> nttH α
2,0 ifReject −−< nttH α
51Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
Example (Sales vs. Advertising Expenditures):
Test at the 5% significance level by using the Minitab output.
Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv Exp
Predictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002
S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%
0:Hvs.0:H 110 ≠= ββ a
52Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• The test statistic is
• Equivalently, since p-value = .002, reject
⇒ Advertising Expenditures is a significant predictor of Sales.
0:Hreject ,306.259.4 Since
59.41579.
07246.Coef SE
0
108,025.
1
==>=
=−
=−
=
∧
β
β
tt
t
0:H 10 =β
53Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.3 Inferences about Regression Parameters
• The F statistic can also be used to test H0: β1 = 0 vs. Ha: β1 ≠ 0
• The F test is equivalent to the t-test in simple regression.
• The F test is presented in the slides for Section 11.5.
54Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Section 11.4Predicting New Y Values
Using Regression
55Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
Scenario 1: Find a confidence interval for E(Y) at xn+1
• A new value of x, denoted by xn+1, is specifiedY
Xn+1 X
E(Y)
56Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
Scenario 1: • Point Estimate:
• Interval Estimate:
• The further that xn+1 is from , the ____ the interval.• The larger the range of x-values, the _______ the
interval.• The larger the number of data points, the _______ the
interval.
widernarrower
narrower
x
1101ˆˆˆ ++ += nn xy ββ
∑ −−
+± +−+ 2
21
2,2/1)()(1ˆ
xxxx
nsty
i
nenn α
57Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
Example (Sales vs. Advertising Expenditures): With 95% confidence, within what limits is the average value for sales when advertising expenditures are $4,000?
• Point Estimate:
• Interval Estimate:
• This is a confidence interval estimate for the population average for sales regions with advertising expenditures of $4,000.
6.27)(;830.;58.3)4(725.681.ˆ 2 =−Σ==+= ii xxsY ε
]25.4,91.2[6.27
)2.34(101)830)(.306.2(58.3
2
or
−+±
58Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
Scenario 2: Find a prediction interval for Yn+1 at xn+1
Y
Xn+1 X
E(Y)• yn+1
59Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
• Predicting a specific value for Y at a given value of x.
• Point Estimate:
• Interval Estimate:
• A prediction interval for Yn+1 is ____ than a confidence interval for E(Yn+1).
wider
1101ˆˆˆ ++ += nn xy ββ
∑ −−
++± +−+ 2
21
2,2/1)()(11ˆ
xxxx
nsty
i
nenn α
60Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
Example (Sales vs. Advertising Expenditures):
Suppose a new region is to be allowed advertising expenditures of $4,000. What sales revenue can be anticipated? Obtain a 95% prediction interval.Point Estimate:
Interval Estimate:
This is a 95% prediction interval for the sales of an individual region when advertising expenditures are $4,000
6.27)(;830.;58.3ˆ 2 =−Σ== ii xxsy ε
]61.5,55.1[6.27
)2.34(1011)830)(.306.2(58.3
2
or
−++±
61Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
The Minitab output follows for the confidence and prediction intervals:
Values of Predictors for New Observations
NewObs Adv Exp1 4.00
Predicted Values for New Observations
NewObs Fit SE Fit 95% CI 95% PI1 3.580 0.291 (2.908, 4.251) (1.552, 5.607)
62Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.4 Predicting New Y ValuesUsing Regression
The confidence and prediction intervals are shown in the following graph:
Adv Exp
Sale
s
654321
8
7
6
5
4
3
2
1
0
-1
S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%
Regression95% CI95% PI
Fitted Regression Line for Sales vs. Adv ExpSales = 0.6812 + 0.7246 Adv Exp
63Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Section 11.5Correlation
64Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Coefficient of Determination • Based on explained and unexplained
deviation
• Of the total deviation, how much is explained by fitting the regression line and how much is left over?
( )22yx Rorr
⎟⎠⎞⎜
⎝⎛ −+⎟
⎠⎞⎜
⎝⎛ −=−
ΛΛ
iiii yyyyyy
Total deviation
Explained deviation
Unexplained deviation
65Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
Example (Sales vs. Advertising Expenditures):For region I, x=5 and y=5
Total deviation
Explained deviation Unexplained deviation
304.4)5(7246.06812.0ˆ 9 =+=y
2)35()( 9 =−=−= yy
304.1)3304.4()ˆ( 9 =−=−= yy
696.0)304.45()ˆ( 99 =−=−= yy
66Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
This is shown on the fitted line plot:
A dv Exp
Sale
s
654321
5
4
3
2
1
S 0.829702R-S q 72.5%R-S q (ad j) 69.0%
F itted R egression L ine for S ales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp
yExplained deviation
Unexplained deviation
Total deviation
67Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Square both sides to account for negative deviations
• Sum over all observations][)ˆ()ˆ()( 222 termproductcrossyyyyyy iiii −+−+−=−
( )22
2⎟⎠⎞
⎜⎝⎛ −Σ+⎟
⎠⎞
⎜⎝⎛ −Σ=−Σ
∧∧
iiii yyyyyy
Sum of Squares due
to Total(SST)
Sum of Squares due to Regression
(SSR)
Sum of Squares due
to Error(SSE)
68Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Each of Sum of Squares has associated with it “degrees of freedom” (df).
• SST = SSR + SSE
• Degrees of freedom for each Sum of Squares are:
• (n – 1) = 1 + (n - 2)
( )22
2⎟⎠⎞⎜
⎝⎛ −Σ+⎟
⎠⎞⎜
⎝⎛ −Σ=−Σ
∧∧
iiii yyyyyy
69Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Mean Square = Sum of Squares/ df
• MSR = SSR/df = SSR /1 = SSR
• MSE = SSE/df = SSE/(n – 2)
• This leads to another test statistic, the F-statistic, for testing H0: β1 = 0.
• F = MSR/MSE { F–Statistic
70Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Rationale: If SSR is large relative to SSE, this indicates the independent variable x has real predictive value.
• The F test is one-tailed.
• Rejection Region: Reject H0 : β1 = 0 if F-Statistic > Fα,1,n-2
Or, reject H0 if p-value < α.
71Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
Example: Sales and Advertising ExpendituresLocate the value of the F-Statistic, the associated p-value and state your conclusion.
Regression Analysis: Sales versus Adv Exp
The regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002
S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%
Analysis of VarianceSource DF SS MS F PRegression 1 14.493 14.493 21.05 0.002Residual Error 8 5.507 0.688Total 9 20.000
72Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• F test = 14.493/0.688 = 21.05
Since 21.05 > F.05,1,8 = 5.32, reject H0: β1 = 0.
• Percentage points of the F distribution are in Table 6.
• Equivalently, since p-value = .002, reject H0: β1 = 0.
73Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• In simple regression, t2 = F.Example: Sales and Advertising Expenditures
t2 = (4.59)2 = 21.07 = F
• In simple regression, the p-values for the F test and t test are equal.
Example: Sales and Advertising Expenditures
p-value for F test = .002 = p-value for t-test
74Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• Coefficient of Determination, denoted by, is
• specifies how much of the total variation is explained by the fitted line.
22 Rorryx
SSTSSRR =2
2R
75Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
Example (Sales vs. Advertising Expenditures):Use the Minitab output to find the value of the coefficient of determination and interpret the value in the context of this problem.
Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpS = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%
From the output,
Interpretation: 72.5% of the variation in Sales is explained by the regression model with Advertising Expenditures as the predictor
%5.722 =R
76Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• The coefficient of determination, R2, ranges from 0 to 1
• The coefficient of correlation, denoted by r, is obtained from R2 :
where the sign (+,-) of r is the same as the sign of .1̂β
2Rr =
77Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• r takes on values from -1 to +1
• Correlation measures the strength of the linear relationship between x and Y.
• The coefficient of determination (R2) and the coefficient of correlation (r) have very different interpretations.
78Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• In correlation, both variables are on an equal footing.
• It does not matter which is labeled x and which is labeled Y. The objective is to measure the association between x and Y.
• This is in contrast to regression analysis, where the objective is to use x to predict Y.
79Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
Example (Sales vs. Advertising Expenditures)
• r has a positive sign because the slope of is positive:
Warning: Correlation does not imply causation.
1β̂
851.0725.02 +=== Rr
725.01̂ +=β
80Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• r denotes the sample correlation coefficient.
• r estimates the population correlation (ρ)
• Hypothesis testing on ρ requires certain assumptions.
In regression analysis, the values of x are predetermined constantsIn correlation analysis, the values of x are randomly selected.
In correlation analysis, the x values also come from a normal distribution. More precisely, the random sample of (x,y) values is drawn from a bivariate normal distribution.
81Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
• To test H0: ρ = 0, the test statistic is where t has (n – 2) d.f.
• The rejection region depends on the form of Ha:• If Ha: ρ > 0 reject H0 if t > tα,n-2
• If Ha: ρ < 0 reject H0 if t < -tα,n-2
• If Ha: ρ ≠ 0 reject H0 if | t | > tα/2,n-2
212
rnrt−
−=
82Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
11.5 Correlation
Exercise 11.27: A survey of recent M.B.A. graduates of a business school obtained data on first-year salary and years of prior work experience. [The data are in Exercise 11.27. Assume that the 51 students were randomly selected]
• The Minitab output follows:Pearson correlation of SALARY and EXPER = 0.703P-Value = 0.000
• For such a large t-value, the p-value is 0.000.Conclusion: Reject H0: ρ = 0
92.6)703(.1
251703.01
222
=−
−=
−
−=
rnrt
83Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Keywords: Chapter 11
• Regression analysis• Independent variable• Predictor variable• Dependent variable• Response variable• Scatterplot• Simple regression• Slope• y-intercept• Least-squares method
• High leverage points• Y-outliers• High influence points• Standard error of the
estimate• t-test on slope• F-test on slope• Coefficient of determination• Coefficient of correlation
84Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Summary of Chapter 11
• Understanding the role of the scatterplot• Understanding the rationale of the least squares
method for finding the best fitting line• Understanding the impact of high leverage points, y-
outliers and high influence points on the fitted line• Understanding variability around the regression line• Testing the slope coefficient using the t-test• Testing the slope coefficient using the F-test• Understanding the difference between the correlation
coefficient and the coefficient of determination