Estimating Model Parameters

Chapter 11:Linear Regression

and Correlation Methods

Hildebrand, Ott and GrayBasic Statistical Ideas for Managers

Second Edition

Learning Objectives for Chapter 11

• Using the scatterplot in regression analysis• Using the method of least squares for finding the best

fitting line• Understanding the underlying assumptions in regression

analysis• Determining whether or not any observations are high

leverage points, y-outliers or high influence points• Using the the t-test for testing the significance of the slope

coefficient• Using the the F-test for testing the slope coefficient• Understanding the difference between the correlation

coefficient and the coefficient of determination

Sections 11.1 – 11.2The Linear Regression Model;Estimating Model Parameters

11.1 The Linear Regression Model11.2 Estimating Model Parameters

• Objective: Model the relationship between a response or dependent variable (Y) and one predictor or independent variable (x).

Examples:• For consumer purchase decisions, let Y = market share

and x = consumer’s degree of ‘top of mind’ brand awareness (% of consumers who name this brand first).

• In beta analysis, let Y = return on a security (IBM) over a period of time and x = return on the market (DJIA).

• For a particular corporation, let Y = sales revenue for a region at the year’s end and x = advertising expenditures for the year. Y is recorded in tens of thousands of dollarsand x is recorded in thousands of dollars.

• In all three examples, can x be used to predict Y?

• Consider the example where Y = sales revenue and x = advertising expenditures. The data follow:

Region Sales Adv ExpA 1 1B 1 2C 2 1D 2 3E 3 2F 3 4G 4 3H 4 5I 5 5J 5 6

a) Is there a linear relationship between Sales and Advertising Expenditures?

• A scatterplot of the data follows:

• The scatterplot is used to assess whether or not there is a linear relationship between Sales and Advertising Expenditures.

A d v Exp

654321

S c a tte r p l o t f o r S a l e s v s . A d v e r t i s i n g E x p e n d i tu r e s

• Is a linear relationship feasible?

• From the scatterplot, it appears that as Advertising Expenditures increase, Sales increase linearly.

• The relationship between Sales and Advertising Expenditures is an example of a statisticalrelationship between 2 variables.If a straight line is fit to the points, all the points would not fall on the straight line. There are other factors besides Advertising Expenditures that affect Sales.

• An example of a deterministic relationship is:Total Costs = Fixed Costs + Variable Costs

Fitted Model• The general expression for the line to be fit is:

• Residuals are prediction errors in the sample.• The residual for an observation is the difference

between the actual value of sales and the predicted value:

• How should the fitted line be determined?

xy 10ˆˆˆ ββ +=

)ˆˆ( 10 ii xy ββ +−=ii yy ˆ−=

),( ii yx

Residuali

Possible criteria for fitting a line passing throughi. Minimize the sum of the residuals or min

Deficiency: An infinite number of lines satisfy this criterion.

ii. Minimize the sum of the absolute value of the residuals or min

Deficiency: Procedure not available in most statistical software.

iii. Minimize the sum of the squared residuals or min

Criterion (iii) is known as the Method of Least Squares.

:),( yx∑

iii yy

Procedure to obtain Intercept and Slope of Fitted Model• When the Least-Squares criterion is used, are

found by solving the following two expressions:

To facilitate the arithmetic, note that

10 and∧∧

xy 10ˆˆ ββ −=

( )( ) yxnyxyyxx iiii −Σ=−−Σ

( ) 222 xnxxx iii −Σ=−Σ

( )( ) 21 )(/ˆ

iiii xxyyxx −Σ−−Σ=β

Example (Sales vs. Advertising Expenditures):

b) Find the least-squares regression line.Do the calculations by hand.Region Sales(y) Adv Exp(x) (x)(y)

A 1 1 1 1B 1 2 2 4C 2 1 2 1D 2 3 6 9E 3 2 6 4F 3 4 12 16G 4 3 12 9H 4 5 20 25I 5 5 25 25J 5 6 30 36

30 32 116 130Sum

From the previous slide,

=1β̂

∑ = 116ii yx ∑ = 1302ix 2.3=x 0.3=y

∑ =−=− 20)0.3)(2.3)(10(116yxnyx ii

∑ =−=− 6.27)2.3)(10(130 222 xnxi

681.0)2.3)(725.0(0.3ˆˆ10 =−=−= xy ββ

[ ]/∑ − yxnyx ii [ ]=−∑ 22 xnxi

725.06.27/20 ==

• The least-squares regression equation is:

• Notice the similarity to the point-slope formula for a straight line:

y = mx + b

• In regression, the terms are reversed.

xY 725.0681.0ˆ +=The Slope CoefficientThe y-intercept

Example (Sales vs. Advertising Expenditures):c) State the equation of the least-squares regression line

by using Minitab. The Minitab output follows:Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

The equation of the fitted line is:

Sales = 0.681+0.725 Adv Exp^

Example (Sales vs. Advertising Expenditures):d) Obtain the fitted line plot by using Minitab.

The fitted line plot reinforces the appropriateness of using a linear model.

A d v Exp

654321

S 0.829702R -S q 72.5%R -S q (ad j ) 69.0%

F itte d R e gr e s s ion L ine fo r S a le s v s . A dv Ex pS a le s = 0 .6812 + 0 .7246 A dv Exp

e) Interpret the slope of the fitted line in the context of this problem

Sales increase by 0.725 units for each 1 unit increase in Advertising Expenditure

f) Predict sales for a region that has advertising expenditures of 3 units.

unitsy 856.2)0.3(725.0681.0ˆ =+=

g) Determine the residual for Region A

Residual = (Actual Sales) – (Predicted Sales)= (1.0) – [0.6812 +(0.725)(1.0)]= 1 – 1.4062= -0.4062

The fitted values and residuals for all regions follow.Region Adv Exp Sales Fit Residual

A 1.00 1.000 1.406 -0.406 B 2.00 1.000 2.130 -1.130 C 1.00 2.000 1.406 0.594 D 3.00 2.000 2.855 -0.855 E 2.00 3.000 2.130 0.870 F 4.00 3.000 3.580 -0.580 G 3.00 4.000 2.855 1.145 H 5.00 4.000 4.304 -0.304 I 5.00 5.000 4.304 0.696

J 6.00 5.000 5.029 -0.029

h) On the fitted line plot, specify the residual for Region G.

A dv Exp

654321

S 0.829702R-Sq 72.5%R-Sq (ad j) 69.0%

Fitted Regression L ine for Sales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp

Residual for Region G

• Corresponding to the fitted model is the population model:

0 1 2 3 4 5 6 x

ErrorUnknown

xYE 10)( ββ +=

The probability distribution for Y at x=5

• Properties of the population model• At each value of x, there is a probability distribution

of Y values.• The means, E(Y), of these probability distributions

lie on a straight line, where • is the intercept, and is the slope

• The expression for E(Y) is:

where εi is the error or difference between Yi and E(Yi)

0β 1β

iii xYorxYEεββ

ββ++=

10 ,)(

• Assumptions1. The relation is in fact linear, so that the

errors all have expected value zero; for all i.

2. The errors all have the same variance; for all i.

3. The errors are independent of each other.

• The fitted line or model is an estimate of the population model

0)( =iE ε

2)( εσε =iVar

• is also unknown and needs to be estimated.

• Since residuals estimate errors, use the variation in the residuals to estimate the variation in the errors.

• There are 2 constraints on the residuals:

• For the Sales vs. Advertising Expenditures example, these constraints are shown on the next slide.

• The residuals have (n-2) degrees of freedom.

residualx

residual

• The variation in the residuals is:

• Use to estimate εs εσ

][∑=

−−=n

ii nresiduals

22 )2/(0ε

Region Sales Adv Exp Fit Res (AdvExp)(Res)A 1 1 1.406 -0.406 -0.406B 1 2 2.130 -1.130 -2.261C 2 1 1.406 0.594 0.594D 2 3 2.855 -0.855 -2.565E 3 2 2.130 0.870 1.739F 3 4 3.580 -0.580 -2.319G 4 3 2.855 1.145 3.435H 4 5 4.304 -0.304 -1.522I 5 5 4.304 0.696 3.478J 5 6 5.029 -0.029 -0.174

Sum 0.000 0.000

• Terminology for

–Sample standard deviation around the regression line

–Standard error of estimate

–Residual standard deviation

Example (Sales vs. Advertising Expenditures):i) Identify the value of the sample standard deviation about the

regression line from the Minitab output.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

The value of the sample standard deviation about the regression line is: s = 0.829702

• How is this useful?

“Like any other standard deviation, the residual standard deviation may be interpreted by the Empirical Rule. About 95% of the prediction errors will fall within +/-2 standard deviations of the mean error; the mean error is always 0 in the least-squares regression model. Therefore, a residual standard deviation of 0.83 means that about 95% of the prediction errors will be less than +/- 2(0.83) = +/-1.66” [Hildebrand, Ott and Gray]

11.2 Estimating Model Parameters

• A high-leverage point is one for which the x-value is, in some sense, far away from most of the x-values.

.. .. ....

.... ... . ..... .. ......... ...

.. ........ .. .. ...

.Most of x-values

A high leverage point

• MINITAB flags a high leverage point with an X symbol.The determination is made by looking at the leverage, denoted by hi, for each observation:

Some limits are built in:

If this observation is flagged with an X symbol. Why 6/n?

ix( )( ) 2

Squared deviation of a particular

Variation in all x’s relative to x

nhhhnn

iii /2;2;1/1

1==≤≤ ∑

,/6 nhi >hn 3/6 =

Example (Sales vs. Advertising Expenditures):j) Find the leverage for region J where the point is (6, 5).

• Is this a high leverage point?No, since

• A point with a large x-value is not necessarily a high leverage point.

384.06.27

)2.36(101 2

.6.0384.010 <=h

• The leverage values for all 10 observations follow:

Region Sales Adv Exp SRES1 HI1A 1 1 -0.57455 0.275362B 1 2 -1.47969 0.152174C 2 1 0.84130 0.275362D 2 3 -1.08720 0.101449E 3 2 1.13822 0.152174F 3 4 -0.74617 0.123188G 4 3 1.45574 0.101449H 4 5 -0.41464 0.217391I 5 5 0.94776 0.217391

J 5 6 -0.04451 0.384058

• All leverages are less than 6/n⇒ No High leverage points.

Leverages for each observation

• The fitted line plot follows. The graph suggests there are no high leverage points

Adv Exp

654321

S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%

Fitted Regression Line for Sales vs. Adv ExpSales = 0.6812 + 0.7246 Adv Exp

Example (Sales vs. Advertising Expenditures): Region G has just sent an email stating that they reported incorrect values for sales and advertising expenditures. The bad news is that they actually spent $10,000 (10 units) on Advertising. The good news is that Sales were actually $80,000 (8 units). Is the revised data point for Region G a high leverage point?

The point (10,8) is a high leverage point.The Minitab output follows.

21 (10 3.9) 0.64 0.6010 68.9Gh −

= + = >

• The regression output for the revised data follows.Regression Analysis: Sales vs Adv Exp

[Revised Data: Change (3,4) to (10,8)]

The regression equation isSales = 0.491 + 0.746 Adv Exp

Predictor Coef SE Coef T PConstant 0.4906 0.4032 1.22 0.258Adv Exp 0.74601 0.08577 8.70 0.000S = 0.711965 R-Sq = 90.4% R-Sq(adj) = 89.2%Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid

7 10.0 8.000 7.951 0.570 0.049 0.12 XX denotes an observation whose X value gives it large influence.

• The leverage (HI) for all of the observations follows.Region Sales Adv Exp SRES HI

A 1 1 -0.37674 0.222061B 1 2 -1.49904 0.152395C 2 1 1.21572 0.222061D 2 3 -1.08582 0.111756E 3 2 1.55218 0.152395F 3 4 -0.70272 0.100145G 8 10 0.11553 0.640058H 4 5 -0.32986 0.117562I 5 5 1.16534 0.117562J 5 6 0.05128 0.164006

• The fitted line plot reinforces that Region G is a high leverage point

A dv Exp_1

1086420

S 0.711965R-Sq 90.4%R-Sq (ad j) 89.2%

Fitted Regression L ine for Sales vs. Adv ExpS ales_1 = 0.4906 + 0.7460 Adv Exp_1

• A high leverage point is not necessarily “Bad.”

• A high leverage point has the potential to alter the fitted line.

• For the Example, the potential was notrealized. For the original data, the slope was 0.725.

For the revised data, the slope is 0.746.

Standardized Residuals• A residual is standardized as follows:

• A Standardized Residual (SR) depends on the residual and the leverage.

hsresidual

residualofdevstdresidual

• A requirement is that the errors in the population regression model be normally distributed.

• Since the residuals estimate the errors, this implies the residuals should be normally distributed.

• Standardized Residuals can be viewed as values of a standard normal random variable.

• A Standardized Residual is considered large if |SRi| > 2.

Example (Sales vs. Advertising Expenditures): Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The good news is that they only spent $3,000 (3 units) on Advertising. The better news is that Sales were actually $50,000 (5 units).Does this result in a y-outlier for Region G?

Region G is now a point with a y-outlier.

0.2067.21015.01)043.1(043.2

hsresidual

The Minitab output for the revised data follows.Regression Analysis: Sales versus Adv Exp

[Revised Data: Change (3,4) to (3,5)]The regression equation is

Sales = 0.804 + 0.717 Adv Exp

Predictor Coef SE Coef T PConstant 0.8043 0.7155 1.12 0.294Adv Exp 0.7174 0.1985 3.61 0.007S = 1.04257 R-Sq = 62.0% R-Sq(adj) = 57.3%

Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid

7 3.00 5.000 2.957 0.332 2.043 2.07RR denotes an observation with a large standardized residual.

The fitted line plot reinforces that Region G is a y-outlier.

A dv Exp

654321

S 1.04257R-Sq 62.0%R-Sq(adj) 57.3%

y-outlier

High Influence Points• A high leverage point that also corresponds to a y-outlier is a

high influence point.Example (Sales vs. Advertising Expenditures): Region G just can’t get their act together. Region G has just sent another email stating that they reported incorrect values again for sales and advertising expenditures. The bad news is that they spent $10000 (10 units) on Advertising. The really bad news is that Sales were still only $40,000 (4 units).

Does this result in a high influence point for Region G?From the Minitab output that follows, the answer is yes.The potential of the high leverage to alter the regression line was realized.

Regression Analysis: Sales versus Adv Exp [Revised Data: Change (3,4) to (10,4)]

The regression equation isSales = 1.47 + 0.392 Adv Exp

Predictor Coef SE Coef T PConstant 1.4717 0.6145 2.39 0.044Adv Exp 0.3919 0.1307 3.00 0.017S = 1.08509 R-Sq = 52.9% R-Sq(adj) = 47.0%

Unusual ObservationsObs AdvExp Sales Fit SE Fit Residual St Resid7 10.0 4.000 5.390 0.868 -1.390 -2.14RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

Section 11.3Inferences

about Regression Parameters

11.3 Inferences about Regression Parameters

• Model:

• Model Parsimony:Is x useful in predicting Y?

• Hypotheses:

• The sampling distribution of is needed to test

• Additional assumption for the population model:Errors (ε ) are normally distributed.

xE(Y) 10 ββ +=

0:H.0:H 1a10 ≠= ββ vs

0:H 10 =β

• Sampling distribution of

The sampling distribution of is the probability distribution of the differentvalues of which would be obtainedwith repeated sampling, when thevalues of the independent variable x areheld constant for the repeated samples.

• is normally distributed with:

• Substitute for to estimate the 2εs

( )∑ −==

=∧∧

)(V)(Var

εσββ

• The estimated standard error of is:

• Minitab denotes this as “SE Coef”

( )∑ −2

• The distribution of

is tn-2

• To test , use

] ofError Standard [Estimated 1

] ofError Standard [Estimated

0:H 10 =β

• The rejection region depends on the research hypothesis:

Rejection RegionResearch Hypothesis

0: 1 ≠βaH

0: 1 >βaH

0: 1 <βaH

ifReject

2,0 ifReject −> nttH α

2,0 ifReject −−< nttH α

Test at the 5% significance level by using the Minitab output.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv Exp

Predictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

0:Hvs.0:H 110 ≠= ββ a

• The test statistic is

• Equivalently, since p-value = .002, reject

⇒ Advertising Expenditures is a significant predictor of Sales.

0:Hreject ,306.259.4 Since

59.41579.

07246.Coef SE

108,025.

0:H 10 =β

• The F statistic can also be used to test H0: β1 = 0 vs. Ha: β1 ≠ 0

• The F test is equivalent to the t-test in simple regression.

• The F test is presented in the slides for Section 11.5.

Section 11.4Predicting New Y Values

Using Regression

11.4 Predicting New Y ValuesUsing Regression

Scenario 1: Find a confidence interval for E(Y) at xn+1

• A new value of x, denoted by xn+1, is specifiedY

Xn+1 X

Scenario 1: • Point Estimate:

• Interval Estimate:

• The further that xn+1 is from , the ____ the interval.• The larger the range of x-values, the _______ the

interval.• The larger the number of data points, the _______ the

interval.

widernarrower

narrower

1101ˆˆˆ ++ += nn xy ββ

∑ −−

+± +−+ 2

2,2/1)()(1ˆ

nenn α

Example (Sales vs. Advertising Expenditures): With 95% confidence, within what limits is the average value for sales when advertising expenditures are $4,000?

• Point Estimate:

• This is a confidence interval estimate for the population average for sales regions with advertising expenditures of $4,000.

6.27)(;830.;58.3)4(725.681.ˆ 2 =−Σ==+= ii xxsY ε

]25.4,91.2[6.27

)2.34(101)830)(.306.2(58.3

−+±

Scenario 2: Find a prediction interval for Yn+1 at xn+1

Xn+1 X

E(Y)• yn+1

• Predicting a specific value for Y at a given value of x.

• Point Estimate:

• A prediction interval for Yn+1 is ____ than a confidence interval for E(Yn+1).

1101ˆˆˆ ++ += nn xy ββ

∑ −−

++± +−+ 2

2,2/1)()(11ˆ

nenn α

Suppose a new region is to be allowed advertising expenditures of $4,000. What sales revenue can be anticipated? Obtain a 95% prediction interval.Point Estimate:

Interval Estimate:

This is a 95% prediction interval for the sales of an individual region when advertising expenditures are $4,000

6.27)(;830.;58.3ˆ 2 =−Σ== ii xxsy ε

]61.5,55.1[6.27

)2.34(1011)830)(.306.2(58.3

−++±

The Minitab output follows for the confidence and prediction intervals:

Values of Predictors for New Observations

NewObs Adv Exp1 4.00

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI1 3.580 0.291 (2.908, 4.251) (1.552, 5.607)

The confidence and prediction intervals are shown in the following graph:

Adv Exp

654321

S 0.829702R-Sq 72.5%R-Sq(adj) 69.0%

Regression95% CI95% PI

Section 11.5Correlation

11.5 Correlation

• Coefficient of Determination • Based on explained and unexplained

deviation

• Of the total deviation, how much is explained by fitting the regression line and how much is left over?

( )22yx Rorr

⎟⎠⎞⎜

⎝⎛ −+⎟

⎠⎞⎜

⎝⎛ −=−

iiii yyyyyy

Total deviation

Explained deviation

Unexplained deviation

11.5 Correlation

Example (Sales vs. Advertising Expenditures):For region I, x=5 and y=5

Total deviation

Explained deviation Unexplained deviation

304.4)5(7246.06812.0ˆ 9 =+=y

2)35()( 9 =−=−= yy

304.1)3304.4()ˆ( 9 =−=−= yy

696.0)304.45()ˆ( 99 =−=−= yy

11.5 Correlation

This is shown on the fitted line plot:

A dv Exp

654321

S 0.829702R-S q 72.5%R-S q (ad j) 69.0%

F itted R egression L ine for S ales vs. Adv ExpS ales = 0.6812 + 0.7246 Adv Exp

yExplained deviation

Unexplained deviation

Total deviation

11.5 Correlation

• Square both sides to account for negative deviations

• Sum over all observations][)ˆ()ˆ()( 222 termproductcrossyyyyyy iiii −+−+−=−

2⎟⎠⎞

⎜⎝⎛ −Σ+⎟

⎠⎞

⎜⎝⎛ −Σ=−Σ

∧∧

iiii yyyyyy

Sum of Squares due

to Total(SST)

Sum of Squares due to Regression

Sum of Squares due

to Error(SSE)

11.5 Correlation

• Each of Sum of Squares has associated with it “degrees of freedom” (df).

• SST = SSR + SSE

• Degrees of freedom for each Sum of Squares are:

• (n – 1) = 1 + (n - 2)

2⎟⎠⎞⎜

⎝⎛ −Σ+⎟

⎠⎞⎜

⎝⎛ −Σ=−Σ

∧∧

iiii yyyyyy

11.5 Correlation

• Mean Square = Sum of Squares/ df

• MSR = SSR/df = SSR /1 = SSR

• MSE = SSE/df = SSE/(n – 2)

• This leads to another test statistic, the F-statistic, for testing H0: β1 = 0.

• F = MSR/MSE { F–Statistic

11.5 Correlation

• Rationale: If SSR is large relative to SSE, this indicates the independent variable x has real predictive value.

• The F test is one-tailed.

• Rejection Region: Reject H0 : β1 = 0 if F-Statistic > Fα,1,n-2

Or, reject H0 if p-value < α.

11.5 Correlation

Example: Sales and Advertising ExpendituresLocate the value of the F-Statistic, the associated p-value and state your conclusion.

Regression Analysis: Sales versus Adv Exp

The regression equation isSales = 0.681 + 0.725 Adv ExpPredictor Coef SE Coef T PConstant 0.6812 0.5694 1.20 0.266Adv Exp 0.7246 0.1579 4.59 0.002

S = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

Analysis of VarianceSource DF SS MS F PRegression 1 14.493 14.493 21.05 0.002Residual Error 8 5.507 0.688Total 9 20.000

11.5 Correlation

• F test = 14.493/0.688 = 21.05

Since 21.05 > F.05,1,8 = 5.32, reject H0: β1 = 0.

• Percentage points of the F distribution are in Table 6.

• Equivalently, since p-value = .002, reject H0: β1 = 0.

11.5 Correlation

• In simple regression, t2 = F.Example: Sales and Advertising Expenditures

t2 = (4.59)2 = 21.07 = F

• In simple regression, the p-values for the F test and t test are equal.

Example: Sales and Advertising Expenditures

p-value for F test = .002 = p-value for t-test

11.5 Correlation

• Coefficient of Determination, denoted by, is

• specifies how much of the total variation is explained by the fitted line.

22 Rorryx

SSTSSRR =2

11.5 Correlation

Example (Sales vs. Advertising Expenditures):Use the Minitab output to find the value of the coefficient of determination and interpret the value in the context of this problem.

Regression Analysis: Sales versus Adv ExpThe regression equation isSales = 0.681 + 0.725 Adv ExpS = 0.829702 R-Sq = 72.5% R-Sq(adj) = 69.0%

From the output,

Interpretation: 72.5% of the variation in Sales is explained by the regression model with Advertising Expenditures as the predictor

%5.722 =R

11.5 Correlation

• The coefficient of determination, R2, ranges from 0 to 1

• The coefficient of correlation, denoted by r, is obtained from R2 :

where the sign (+,-) of r is the same as the sign of .1̂β

11.5 Correlation

• r takes on values from -1 to +1

• Correlation measures the strength of the linear relationship between x and Y.

• The coefficient of determination (R2) and the coefficient of correlation (r) have very different interpretations.

11.5 Correlation

• In correlation, both variables are on an equal footing.

• It does not matter which is labeled x and which is labeled Y. The objective is to measure the association between x and Y.

• This is in contrast to regression analysis, where the objective is to use x to predict Y.

11.5 Correlation

Example (Sales vs. Advertising Expenditures)

• r has a positive sign because the slope of is positive:

Warning: Correlation does not imply causation.

851.0725.02 +=== Rr

725.01̂ +=β

11.5 Correlation

• r denotes the sample correlation coefficient.

• r estimates the population correlation (ρ)

• Hypothesis testing on ρ requires certain assumptions.

In regression analysis, the values of x are predetermined constantsIn correlation analysis, the values of x are randomly selected.

In correlation analysis, the x values also come from a normal distribution. More precisely, the random sample of (x,y) values is drawn from a bivariate normal distribution.

11.5 Correlation

• To test H0: ρ = 0, the test statistic is where t has (n – 2) d.f.

• The rejection region depends on the form of Ha:• If Ha: ρ > 0 reject H0 if t > tα,n-2

• If Ha: ρ < 0 reject H0 if t < -tα,n-2

• If Ha: ρ ≠ 0 reject H0 if | t | > tα/2,n-2

rnrt−

11.5 Correlation

Exercise 11.27: A survey of recent M.B.A. graduates of a business school obtained data on first-year salary and years of prior work experience. [The data are in Exercise 11.27. Assume that the 51 students were randomly selected]

• The Minitab output follows:Pearson correlation of SALARY and EXPER = 0.703P-Value = 0.000

• For such a large t-value, the p-value is 0.000.Conclusion: Reject H0: ρ = 0

92.6)703(.1

251703.01

Keywords: Chapter 11

• Regression analysis• Independent variable• Predictor variable• Dependent variable• Response variable• Scatterplot• Simple regression• Slope• y-intercept• Least-squares method

• High leverage points• Y-outliers• High influence points• Standard error of the

estimate• t-test on slope• F-test on slope• Coefficient of determination• Coefficient of correlation

Summary of Chapter 11

• Understanding the role of the scatterplot• Understanding the rationale of the least squares

method for finding the best fitting line• Understanding the impact of high leverage points, y-

outliers and high influence points on the fitted line• Understanding variability around the regression line• Testing the slope coefficient using the t-test• Testing the slope coefficient using the F-test• Understanding the difference between the correlation

coefficient and the coefficient of determination

Estimating Model Parameters

Documents

Transcript of Estimating Model Parameters

ECM Cost Estimating

ESTIMATING CORRELATIONS FROM A COASTAL OCEAN MODEL FOR LOCALIZING AN ENSEMBLE TRANSFORM KALMAN FILTER

Estimating Measurement Uncertainty

Determination of Rainfall/Runoff Model Parameters

Estimating Gas Emissions from Multiple Sources Using a Backward Lagrangian Stochastic Model

A hybrid model for estimating global solar radiation

S-PARAMETERS - nanoHUB

Estimating material parameters of a structurally based constitutive relation for skin mechanics

11OBDG11 Parameters(12)

Estimating demographic parameters for capture–recapture data in the presence of multiple mark types

Mean squared prediction error in the spatial linear model with estimated covariance parameters

Estimating parameters in a sea ice model using an ensemble ...

A Dynamic Model for Estimating the Interaction of ROS–PUFA ...

A model for estimating how variability of biological parameters affects economic factors in an integrated turkey farm

Estimating required information size by quantifying diversity in random-effects model meta-analyses

Temporal stability of model parameters in crime rate analysis_An empirical examination

Estimating worldwide life satisfaction

A FORMALIZED MODEL FOR SEMANTIC WEB SERVICE SELECTION BASED ON QoS PARAMETERS

A Geometrical Pore Model for Estimating the Microscopical Pore Geometry of Soil with Infiltration Measurements

A new multi-wavelength model-based method for determination of enzyme kinetic parameters