unit 10: simple correlation and regression

213Statistics for Management

Simple Correlation and Regression Unit 10

UNIT 10: SIMPLE CORRELATION AND REGRESSION

UNIT STRUCTURE10.1 Learning Objectives

10.2 Introduction

10.3 Correlation Analysis

10.3.1 Correlation and Causation

10.3.2 Types of Correlation

10.4 Methods of Measuring Correlation

10.4.1 Scatter diagram Method

10.4.2 Karl Pearson’s correlation coefficient method

10.4.3 Spearman’s Rank Correlation Coefficient

10.5 Probable error of correlation coefficient

10.6 Coefficient of determination

10.7 Regression Analysis

10.8 Regression Lines

10.8.1 Determination of the regression line of y on x

10.8.2 Determination of the regression line of x on x

10.8.3 Regression coefficient

10.9 Standard Error Of Estimate

10.10 Let Us Sum Up

10.11 Further Readings

10.12 Model Questions

10.1 LEARNING OBJECTIVES

After going through this unit you will be able to

• learn how correlation analysis expresses the quantitatively the

degree and direction of the association between two variables

214 Statistics for Management

Simple Correlation and RegressionUnit 10

• compute and interpret different measures of correlation namely Karl

Pearson’s Correlation Coefficient, Spearman’s Rank Correlation

Coefficient

• use regression analysis for estimating the relationship between

variables

• use least square method for estimating equation to predict future

values of the dependent variable

10.2 INTRODUCTIONIn many business environments, we come across problems or

situations where two variables seem to move in the same direction such

as both are increasing or decreasing. At times an increase in one variable

is accompanied by a decline in another. Thus, if two variables are such that

when one changes the other also changes then the two variables are said

to be correlated. The knowledge of such a relationship is important to make

inferences from the relationship between variables in a given situation. For

example, a marketing manager may be interested to investigate the degree

of relationship between advertising expenditure and the sales volume. The

manager would like to know whether money that he is going to spent on

advertising is justified or not in terms of sales generated.

Correlation analysis is used as a statistical technique to ascertain

the association between two quantitative variables. Usually, in correlation,

the relationship between two variables is expressed by a pure number i.e.,

a number without having any unit of measurement. This pure number is

referred to as coefficient of correlation or correlation coefficient which

indicates the strength and direction of statistical relationship between

variables.

It may be noted that correlation analysis is one of the most widely

employed statistical devices adopted by applied statisticians and has been

used extensively not only in biological problems but also in agriculture,

economics, business and several other fields. In this unit we shall introduce

correlation analysis for two variables.



The importance of examining the relationship between two or more

variables can be stated in the form of following questions and accordingly

requires the statistical devices to arrive at conclusions:

(i) Does there exist an association between two or more variables? If

exists, what is the form and the degree of that relationship?

(ii) Is the relationship strong or significant enough to be useful to arrive

at a desirable conclusion?

(iii) Can the relationship be used for predictions of future events, that

is, to forecast the most likely value of a dependent variable

corresponding to the given value of independent variable or

variables?

The first two questions can be answered with the help of correlation

analysis while the final question can be answered by using the regression

analysis.

In case of correlation analysis, the data on values of two variables

must come from sampling in pairs, one for each of the two variables.

10.3 CORRELATION ANALYSIS

By the term ‘correlation’ we mean the relationship between two

variables. Two variables are said to be correlated if the change in one variable

results in a corresponding change in the other variable. In the practical field

we need to investigate the type of relationship that might exist between the

ages of husbands and wives, the heights of fathers and sons, the amount

of rainfall and the volume of production of a particular crop, the price of a

commodity and the demand for it, and so on. Similarly, we may state the

example of price of a commodity and the demand for it. If the price of a

commodity increases, there is a decline in its demand. Thus there exists

correlation between the two variables price and demand. In correlation, we

study the nature and the degree of relationship between the two variables.

Uses of Correlation Analysis:

In spite of certain limitations correlation analysis is a widely used

statistical device. With the help of correlation analysis one can ascertain



the existence as well as degree and direction of relations between two

variables. It is an indispensable tool of analysis for the people of Economics

and Business. Variables in Economics and Business are usually interrelated.

In order to study the nature (positive or negative) and degree (low, moderate

or high) of relationship between any two of such related variables correlation

analysis is used. In reality, besides Business and Economics, it is

extensively used in various other branches.

10.3.1 CORRELATION AND CAUSATION

Correlation analysis helps us to have an idea about the

degree and direction of the relationship between the two variables

under study. However, it fails to reflect upon the cause and effect

relationship between the variables. If there exist a cause and effect

relationship between the two variables, they are bound to vary in

sympathy with each other and, therefore, there is bound to be a high

degree of correlation between them. In other words, causation always

implies correlation. However, the converse is not true i.e., even a fairly

high degree of correlation between the two variables need not imply a

cause and effect relationship between them. The high degree of

correlation between the variables may be due to the following reasons:

(i) Mutual dependence: The phenomenon under study may

inter influence each other. Such situations are usually observed in

data relating to economic and business situations. For example,

variables like price, supply, and demand of a commodity are mutually

correlated. According to the principle of economics, as the price of a

commodity increases, its demand decreases, so price influences

the demand level. But if demand of a commodity increases due to

growth in population, then its price also increases. In this case

increased demand makes an effect on the price. However, the amount

of export of a commodity is influenced by an increase or decrease in

custom duties but the reverse is normally not true.

(ii) Pure chance: It may happen that a small randomly

selected sample from a bivariate distribution (i.e., a distribution in



which each unit of the series assumes two values) may show a fairly

high degree of correlation though, actually, such a relationship may

not exist in the universe. Such correlation may be due to chance

fluctuations. Moreover, the conscious or unconscious bias on the part

of the investigator, in the selection of the sample may also result in

high degree of correlation in the sample. It may be noted that in both

the phenomena a fairly high degree of correlation may be observed,

though it is not possible to conceive them as being causally related.

(iii) Influence of external factors: A high degree of

correlation may be observed between two variables due to the effect

or inter action of a third variable or a number of variables on each of

these variables. For example, a fairly high degree of correlation may

be observed between the yield per hectare of two crops, say , rice

and potato, due to the effect of a number of factors like favourable

weather conditions, fertilizers used, irrigation facilities, etc., on each

of them. But none of the two is the cause of the other.

10.3.2 TYPES OF CORRELATION

Correlation may be broadly classified into the following three

types:

(a) Positive, Negative and Zero Correlation,

(b) Linear and Non-linear Correlation,

(c) Simple, Partial and Multiple Correlation .

(a) Positive, Negative and Zero Correlation,

Positive Correlation: Two variables are said to be positively

or directly correlated if the values of the two variables deviate or move

in the same direction i.e., if the increase in the values of one variable

results, on an average, in a corresponding increase in the values of

the other variable or if a decrease in the values of one variable results,

on an average, in a corresponding decrease in the values of the other

variable. For example, there exists positive correlation between the

following pairs of variables.



(i) The income and expenditure of a family on luxury items,

(ii) Advertising expenditure and the sales volume of a company,

(iii) Amount of rainfall and yield of a crop,

(iv) Height and weight of a student

(v) Price and supply of a commodity,

(vi) Temperature and sale of cold drinks on different days

of a month in summer.

When the changes in two related variables are exactly

proportional and are in the same direction then we say that there is

perfect positive correlation between them. For example, there exists

perfect positive correlation between the following pairs of sets of data

where each set of data may be assumed to be the values of a variable.

X: 10 20 30 40 50

Y: 2 4 6 8 10

U: 40 35 30 25 20 15 10

V: 14 12 10 8 6 4 2

Negative Correlation: The correlation between two variables

is said to be negative or inverse if the variables deviate in the opposite

direction i.e., if the increase (decrease) in the values of one variable

results, on an average, in a corresponding decrease (increase) in the

values of the other variable. The following pairs of variables are

negatively correlated:

(i) Price and demand for a commodity,

(ii) Volume and pressure of a perfect gas,

(iii) Sale of woolen garments and the day temperature,

(iv) Number of workers and time required to complete a

work

When the changes in two related variables are exactly

proportional but are in the opposite directions then we say that there

is perfect negative correlation between them. The correlation between

each of the following pairs of variables is perfectly negative:



X: 60 50 40 30 20

Y: 2 4 6 8 10

U: 0 1 2 3

V: 2 -3 -8 -13

Zero Correlation: Two variables are said to have zero

correlation or no correlation if they tend to change with no connection

to each other. In such situation the variables are said to be

uncorrelated. For example, one should expect zero correlation

between the yield of crop and the heights of students, or between

price of rice and demand for sugar.

(b) Linear and Non-linear Correlation: The correlation

between two variables is said to be linear if corresponding to a unit

change in one variable, there is a constant change in the other variable

over the entire range of the values. The following example illustrates

a linear correlation between the two variables X and Y.

X: 10 20 30 40 50

Y: 40 60 80 100 120

When these pairs of values X and Y are plotted on a graph

paper, the line obtained by joining the points would be a straight line.

In general, two variables X and Y are said to be linearly related

if the relationship that exists between the two variables is of the form

given by,

Y= a +bX

Where ‘b’ is the slope and ‘a’ the intercept.

On the other hand, a non-linear correlation indicates an absolute

change in one of the variable values with respect to changes in values

of another variable. In other words, correlation is said to be non-linear

when the amount of change in the values of one variable does not

bear a constant ratio to the amount of change in the corresponding

values of another variable. The following example illustrates a non-



linear correlation between the given variables.

X: 8 9 9 10 10 28 29 30

Y: 80 130 170 150 230 560 460 600

When these pair of values are plotted on a graph paper, the

line obtained by joining these points would not be a straight line, rather

it would be curvi-linear.

(c) Simple, Partial and Multiple Correlation: The distinction

amongst Simple, Partial and Multiple Correlation depends upon the

number of variables involved under study.

In Simple correlation only two variables are introduced to

study the relationship between them. A study on income with respect

to saving only, or sales revenue with respect to amount of money

spent on advertisement etc. are a few examples studied under Simple

Correlation.

When the study involves more than two variables then it is a

problem of either Partial or Multiple Correlation. In Partial Correlation,

we study relationship between two variables while the effect of other

variable is held constant. In other words, in Partial Correlation we

study the linear relationship between a dependent variable and one

particular independent variable out of a set of independent variables

when all other variables are held constant. For example, suppose our

study involves three variables X1, X2 and Y where X1 is the number of

hours studied, X2 is I.Q. and Y is the marks secured in the examination.

Now if we study the relationship between number of hours (X1) and

marks obtained (Y) by the student keeping the effect of I.Q. (X2)

constant then it is a problem of Partial Correlation.

In Multiple Correlation three or more than three variables are

studied simultaneously. For example, the study of the relationship

between the production of a particular crop on one side and rainfall

and use of fertilizer on the other side falls under Multiple Correlation.



CHECK YOUR PROGRESSQ 1: State whether the following statements

are true or false:(i) Correlation helps to formulate

the relationship between the variables.(ii) If the

relationship between variables x and y is positive, then as the variable

y decreases, the variable x increases.(iii) In a negative relationship

as x increases, y decreases.(iv)Multiple correlation deals with

studying three or more than three variables simultaneously

10.4 METHODS OF MEASURING CORRELATION

Here we shall confine our discussion to the methods of measuring only

linear relationships only. The commonly used methods for studying the

correlation between two variables are :

(i) Scatter Diagram method

(ii) Karl Pearson’s correlation coefficient method

(iii) Spearman’s Rank correlation method

(iv) Concurrent Deviation method

10.4.1 SCATTER DIAGRAM METHOD

Scatter Diagram is one of the simplest methods of diagrammatic

representation of a bivariate distribution and used to study the nature

(i.e., positive, negative and zero) and degree (i.e., weak or strong) of

correlation between two variables. A scatter diagram can be obtained

on a graph paper by plotting observed pairs of values of variables x

and y, considering the independent variable values on the x-axis and

the dependent variable values on the y-axis. Suppose we are given n

pairs of values nn yxyxyx ,....,,.........,,, 2211 of two variables X and

Y. These n points may be plotted as dots (

) in the

xy

plane. The

diagram of dots so obtained is known as scatter diagram. From scatter

diagram we can form a fairly good, though rough idea about the

existence of relationship between the two variables. After plotting the

points in the plane we may have one of the types of scatter diagrams

as shown below:



Now with the help of Scatter Diagram, we can interpret the

correlation between the two variables as:

(i) If the points are very close to each other, then we can expect

a fairly good amount of correlation between the two variables. On the

other hand, if there appears to be no obvious pattern of the points of

the scatter diagram then it indicates that there is either no correlation

or very low amount of correlation between the variables.

(ii) If the points on the scatter diagram reveal any trend (either

upward or backward), the variables are said to be correlated and the

variables are uncorrelated if no trend is revealed.

(iii) If there is an upward trend rising from lower left hand corner

to the upper right hand corner then we expect a positive correlation

because in such a situation both the variables move in the same

direction. On the other hand, if the points on the scatter diagram depict

a downward trend starting from upper left hand corner to the lower

right hand corner, the correlation is negative since in this case the

values of the two variables run in the opposite direction.

(iv) In particular, if all the points lie on a straight line starting

from the left bottom and going up towards the right top, the correlation

is perfect and positive, and if all the points lie on a straight line starting

from left top and coming down to right bottom, the correlation is perfect

and negative.

Remark: 1.The Scatter Diagram method enables us to form a

rough idea of the nature of the relationship between the two variables

simply by inspection of the graph. However, this method is not suitable

to situations involving large number of observations.

2. The method of scatter diagram provides information only about

the nature of the relationship that is whether it is positive or negative

and whether it is high or low but fails to provide an exact measure of

the extent of the relationship between the two variables.



Example 10.1: The percentage examination scores of 10

students in Data analysis and Economics were as follows. Draw a

scatter diagram for the data and comment on the nature of correlation.

Student : A B C D E F G H I J

Data Analysis: 65 90 52 44 95 36 48 63 80 15

Economics : 62 71 58 58 64 40 42 66 67 55

Solution: A scatter diagram will give a preliminary indication of

whether linear correlation exists. We plot the ordered pairs (65, 62),

(90, 71),………., (15, 55) as shown in the following figure.

Since the points are very close to each other, we may expect a

high degree of correlation. Further, since the points reveal an upward

trend starting from left bottom to top right hand corner, the correlation

is positive. Hence, we conclude that there exists a high degree of

positive correlation between the scores of the students in Data Analysis

and Economics.



10.4.2 KARL PEARSON’S CORRELATION

COEFFICIENT METHOD

The scatter diagram gives a rough indication of the nature and

extent/strength of the relationship between the two variables. The

quantitative measurement of the degree of linear relationship between

two variables, say ‘x’ and ‘y’, is given by a parameter called correlation

coefficient. It was developed by Karl Pearson. Karl Pearson;s method

of measuring correlation between two variables is called the coefficient

of correlation or correlation coefficient. It is also known as product

moment coefficient. For a set of

n

pairs of values of

x

and

y

, Karl

Pearson’s correlation coefficient, usually denoted by

YXr ,

or xyr or

simply r is defined by,

)()(

,YVarXVar

YXCovr

……………..(10.1)

Or, YX

YXCOVr

),(

where YYXXn

YXCov 1),(

22 1),( XX

nXVar X

22 1),( YY

nYVar Y

Substituting these values in the definition of, We have

22 YYXX

YYXXr

..................... (10.2)



2222 YYnXXn

yXXYn

..................... (10.3)

2222 YnYXnX

YXnXY

..................... (10.3/)

Step deviation method for ungrouped data: When actual

mean values x and y are in fraction, the calculation of Pearson’scorrelation coefficient can be simplified by taking deviations of x and

values from their assumed means A and B, respectively. Thus when

and BYd y , where A and B are assumed means ofand values, the formula (11.2) becomes

)5.10.......(..........2222

yyxx

yxyx

fdfdnfdfdn

fdfddfdnr

Assumptions of Using Pearson’s Correlation Coefficient:

Karl Pearson’s correlation coefficient XYr is based on the

following four assumptions:

(i) It is appropriate to calculate when both variables x and are

measured on an interval or a ratio scale.

(ii) The random variables and Y are normally distributed.

(iii) There is a linear relationship between the variables.

(iv) There is a cause and effect relationship between two

variables that influences the distributions of both the variables.

Merits and Limitations of Correlation Coefficient: The

correlation coefficient is a numerical number lying between -1 and +1

that summarizes the magnitude as well as direction of association

between two variables. The chief merits of this method are given

below:

(i) Karl Pearson’s coefficient of correlation is a widely used

statistical device.



(ii) It summarizes the degree (high, moderate or low) in one

figure.

(iii) It is based on all the observations.

However, analysis based on Pearsonian coefficient is subject

to certain severe limitations which are presented as:

(i) A value of which is near to zero does not necessarily indicate

that the two variables X and Y are uncorrelated, but it merely indicates

that there is no linear relationship between them. There may be a

curvilinear or some complex relationship between the two variables

which Pearson’s formula cannot detect as it is an instrument of

measuring linear correlation ship only.

(ii) Correlation is only a measure of the nature and degree of

relationship between two variables and it gives no indication of the

kind of cause and effect relationship that may exist between the two

variables. It fails to identify the variables as dependent or independent

variables. Correlation theory simply seeks to discover if a covariation

between two variables exists or not. Statistical correlation technique

may reveal a very close relationship between two variables, but it

cannot tell us about cause and effect relationship between them or

which variable causes other to react.

(iii) Two uncorrelated variables may exhibit a high degree of

correlation between them. For example, the data relating to the yield

of rice and wheat may show a fairly high degree of positive correlation

although there is no connection between the two variables, viz., yield

of rice and yield of wheat. This may be due to the favourable impact

of extraneous factors like weather conditions, fertilizers used, irrigation

facilities, improved variety of seeds etc. on both of them.

(iv) Sometimes high correlation between two variables may be

entirely spurious. However, such a high correlation may exist due to



chance and consequently such correlations are termed as chance

correlations.

Properties of Correlation Coefficient:

Some of the important properties of Correlation coefficient are

given below:

(a) Correlation coefficient is a pure number.

(b) Correlation coefficient is independent of change of origin

and scale of measurement.

We shall now establish property (b).

Proof: Suppose we want to study the relationship between two

variables X and Y. Let these variables be transformed to the new

variables U and V by the change of origin and scale viz.,

haXU

kbYV

....…………………….(10.6)

where a and b are known as assumed mean or origin and h and k are

known as scale of measurement.

Therefore from (11.5) we have,

hUaX and ………………………...(10.7)

Summing both sides and dividing by we get,

and VkbY ………………………….(10.8)

Subtracting (11.7) from (11.6) we have,

UUhXX and VVkYY

Putting these values in equation (11.2) we get,

2222XY

VVkUUh

VVkUUhrr



22VVUUhk

VVUUhk

22VVUU

VVUU

UVr

Since UVXY rr therefore the correlation coefficient between the

two original variables X and Y is equal to the correlation coefficient

between the new variables U and V (where the new variables U and V

are obtained from X and Y respectively after changing the origin and

scale of X and Y). Hence, we conclude that Correlation coefficient is

independent of change of origin and scale of measurement.

(c) Correlation coefficient lies between -1 and +1, i.e.,1ror1r1

Proof: Let us introduce two variables X and Y with their arithmetic

means X and Y and with standard deviations X and Y respectively.

Let us now consider the sum of squares

2

YX

YYXX

Which is always non-negative.

i.e., 0YYXX2

YX

0YYXX2YYXX

YX

2

Y

2

X



Now dividing by n , we get

0r22

and

and

r1 and r 1

Remarks: 1. This property provides us a check on our calcula-

tions. If any problem, the obtained value of lies outside the limits

this implies that there is some mistake in our calculations

2. indicates perfect positive correlation between the

variables and r = — 1 indicates perfect negative correlation between

the variables.

(d) If X and Y are two independent variables, ,0XYr but the

converse is not true.

Proof: We have

yx

)y,X(Covr

22

YYEXXE

YYXXE

………………………(10.9)



Now, YXYXYXXYEYYXXE

YXEYEXXEYXYE

= YXYXXYXYE

= YXXYE (Since X and Y are given to be indepen-dent,

= YXYEXE )()( YEXEXYE )

= YXYX

0YYXXE

Therefore from (11.9),

0

YYEXXE

0r22

Thus when two variables are independent, they are uncorrelatedi.e., 0XYr

But the converse is not true:

If two variables are uncorrelated then they may not beindependent. Since we know that Karl Pearson’s correlation coefficientis a measure of only linear correlation between two variables,therefore 0XYr indicates that there exists no linear relationshipbetween the variables. There may, however, exist a strong non-linearor curvilinear relationship between x and

y

even though

.0XYr

(e) Correlation coefficient is symmetric, i.e., YXXY rr .

Proof: We have,

....(*)....................YYXX

YYXXr

22XY



Now interchanging X and Y we get,

YXr

22

XXYY

XXYY

).......(**....................YYXX

YYXXr

22YX

From (*) and (**) we find that

Interpretation of Various values of Correlation Coefficient:

The interpretations of various values of XYr are as follows:

(i) 10 XYr implies that there is positive correlation between

X and Y. The closer the value of XYr to 1, the stronger is the positive

correlation.

(ii) 1XYr implies that there exists perfect and positive

correlation between the variables.

(iii) 01 XYr implies that there is negative correlation

between X and Y. The closer the value of to-1, the stronger is the

negative correlation.

(iv) indicates that the correlation between the variables is perfect

and negative.

(v) means that there is no correlation between the variables

and hence the variables are said to be uncorrelated.

CHECK YOUR PROGRESS

Q 2: Given ,43YYXX

,322

XX 722

YY

What is correlation coefficient r ?Q 3: Given =0.25, Cov(X, Y)=3.6, Var(X)=36 then what is S.D.(Y)



Example 10.2: The following data gives indices of industrial production

and number of registered unemployed people (in lakh). Determine Karl

Pearson’s correlation coefficient

Solution: To calculate the Karl Pearson’s correlation coefficient we prepare

the following table:

;1048

832n

XX

158

120n

YY

Thus Karl Pearson’s Correlation Coefficient,

22XY

YYXX

YYXXr

18412092

Since coefficient of correlation 619.0r is moderately negative, it

indicates that there is a moderately large inverse correlation between the

two variables. Hence, we conclude that as the production index increases,

the number of unemployed decreases and vice-versa.



Example 10.3: The following data relate to age of employees and the number

of days they reported sick in a month

Calculate Karl Pearson’s coefficient of correlation and interpret it.

Solution: Let age and sick days be represented by variables X and Y

respectively. Then Karl Pearson’s Correlation coefficient

22XY

YYXX

YYXXr

Now we prepare the following table for calculation



6.41046

nX

X

41040

nY

Y

Therefore we have,

22XY

YYXX

YYXXr

= 641092230

= 0.870

Since coefficient of correlation is closer to 1 and positive, therefore age ofemployees and number of sick days are positively correlated to a high degree.Hence we conclude that as the age of an employee increases, he is likelyto go on sick leave more often than others.

Example 10.4: The following data gives Sales and Net Profit for some ofthe top Auto-makers during the quarter July-September 2006. Find thecorrelation coefficient.

Solution: Let the average sales and average net profit of the givenautomobiles be denoted by X and Y respectively. Then correlation coefficientis given by

2222 YYnXXn

YXXYnr

[Using equation (10.3)]

5.00



Now we make the following table for calculation:

Therefore correlation coefficient is given by,

22 17450.5436820073988

17420061448r

= 302764349240000591843480049152

= 132161918414352

= 9608629.1145063175.138

14352

= 80578.15922

14352

= 0.90135

Correlation Coefficient in case of Grouped Data:

In case of Bivariate frequency, if we are to deal with large volume of datathen these are classified in the form of a two-way frequency table known asbivariate table or correlation table. Here for each of the variables, the valuesare classified into different classes following the same considerations as inthe case of univariate distribution. If there are m classes for the values ofthe variable and n classes for the values of the variable thenthere will be nm cells in the two-way table. Now we shall discuss thecalculation of Karl Pearson’s correlation coefficient with the help of thefollowing example,

Example 10.5: Family income and its percentage spent on food in the caseof 100 families gave the following bivariate frequency distribution. Calculate

5.00



the coefficient of correlation.

Solution: Let us denote the income (in Rs.) and the food expenditure (%)by the variables X and Y respectively. Now to calculate Karl Pearson’scoefficient of correlation, we follow the steps given by,

Step 1: Find the mid points of various classes for X and Y series.

Step 2: Change the origin and scale in X series and Y series to the newvariables

u

and

v

by using the transformations:

100450X

hAXu

and 55.17Y

kBYv

where x and

y

denote mid points of X series and Y series respectively andsimilarly

h

and k denote magnitude of the classes of X and Y seriesrespectively.

Step 3: For each class of X, find the total of cell frequencies of all theclasses of Y and similarly for each class of Y find the total of cell frequenciesof all the classes of X.

Step 4: Multiply the frequencies of X by the corresponding values of thevariable u and find the sum

.fu

Step 5: Multiply the frequencies of Y by the corresponding values of thevariable v and find the sum

.fv

Step 6: Multiply the frequency of each cell by the corresponding values of

u and

v

and write the product

vuf

within a square in the right hand topcorner for each cell.



The above calculations are presented below in the table:

2222 fvfvNfufuN

fvfufuvN

=

)100200100()0120100(

1000481002

10000120004800

1000020000120004800

4381.012000

48

Step 7: Add together all the figures in the top corner squares as obtained instep 6 to get the last column uvf for each of the X and Y series. Finally, findthe total of the last column to get .fuv

8. Multiply the values of fu and fv by the corresponding values of u andto get the columns for and 2fv . Add these values to obtain 2fu and

2fv .



Since Correlation coefficient is independent of change of origin and scaleof measurement, therefore, .4381.0 uvXY rr

10.4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT

So far, we have confined our discussion with correlation betweentwo variables, which can be measured and quantified in appropriateunits of money, time, etc. However, sometimes, the data on twovariables is given in the form of the ranks of two variables based onsome criterion. Here we introduce the method to study correlationbetween ranks of the variables rather than their absolute values. Thismethod was developed by the British psychologist Charles EdwardSpearman in 1904 .In other words, this method is used in a situationin which quantitative measure of certain qualitative factors such asjudgement, brands personalities, TV programmes, leadership, colour,taste, cannot be fixed, but individual observations can be arranged ina definite order. The ranking is assigned by using a set of ordinal ranknumbers, with 1 for the individual observation ranked first either interms of quantity or quality; and n for the individual observation rankedlast in a group of n pairs of observations. Mathematically, Spearman’srank correlation coefficient is defined as:

)1(d6

-1R 2

2

nn ......................................(10.11)

Where R Rank correlation coefficient.

d = the difference between the pairs of ranks of the same individual in the two characteristics . n =the number of pairs.

Advantages and Disadvantages of Spearman’sCorrelation coefficient method:

Advantages:

(i) It is easy to understand and its application is simpler thanPearson’s method.

(ii) It can be used to study correlation when variables areexpressed in qualitative terms like beauty, intelligence, honesty,



efficiency and so on.(iii) It is appropriate to measure the association between two

variables if the data type is at least ordinal scaled (ranked).(iv) The sample data of values of two variables is converted

into ranks either in ascending order or descending order for calculatingdegree of correlation between two variables.

Disadvantages:

(i) Values of both variables are assumed to be normallydistributed and describing a linear relationship rather than non-linearrelationship.

(ii) It is not applicable in case of bivariate frequency distribution.(iii) It needs a large computational time when number of pairs

of values of two variables exceed 30.

Case I: When ranks are given

When observations in a data set are already arranged in aparticular order (rank), consider the differences in pairs of observationsto determine Square these differences and obtain the total 2d .Finally apply the formula (11.11) to calculate correlation coefficient.

Example 10.6: An office has 12 clerks. The long service clerksfeel that they should have a seniority increment based on length ofservice built into their salary structure. An assessment of theirefficiency by their departmental manager and the personneldepartment produces a ranking of efficiency. This is shown belowtogether with a ranking of their length of service.

Ranking according tolength of service : 1 2 3 4 5 6 7 8 9 10 11 12Ranking accordingto efficiency : 2 3 5 1 9 10 11 12 8 7 6 4

Do the data support the clerks’ claim for seniority increment.

Solution: To determine whether the data support the clerks’



claim, we use Spearman’s correlation coefficient which is given by

1nnd6

1R 2

2

Since in the given data, the ranks are already been assigned,therefore we prepare the following table for calculation.

Therefore, Spearman’s correlation coefficient is given by,

378.0171610681

1121217861R 2

Thus from the result we observe that there exist a low degreeof positive correlation between length of service and efficiency.Therefore the claim of the clerks for a seniority increment based onlength of service is not justified.

Example 10.7: Ten competitors in a beauty contest are rankedby three judges in the following order:

1st Judge : 1 6 5 10 3 2 4 9 7 72nd Judge : 3 5 8 4 7 10 2 1 6 93rd Judge : 6 4 9 8 1 2 3 10 5 9

Use the rank correlation coefficient to determine which pair ofJudges has the nearest approach to common tastes in beauty.



Solution: The pair of judges who have the nearest approach tocommon tastes in beauty can be obtained in 3C2 =3 ways as follows:

(i) Judge 1and Judge 2, (ii) Judge 2 and Judge 3 and (iii) Judge3 and Judge 1. Now let 21 , RR and 3R denote the ranks assigned bythe first, second and third Judges respectively and let ijR be the rankcorrelation coefficient between the ranks assigned by the ith and jth

Judges, .3,2,1 ji Let ,jiij RRd be the difference of ranks ofan individual given by the ith and jth Judges.

We have 10n .Applying the formula, Spearman’s rank correlation coefficients

are given by,

2121.0327

991020061

1nnd6

1R 2

212

12

Since the correlation coefficient is maximum, thepair of first and third judges has the nearest approach to common



tastes in beauty.

Since 2312 , RR are negative, the pair of judges (1, 2) and (2, 3)have opposite tastes for beauty.

Case II: When ranks are not given

Spearman’s Rank correlation coefficient can also be used evenif we are dealing with variables which are measured quantitativelyi.e., when the pairs of observations in the data set are not ranked asin case I. In such a situation, we shall have to assign ranks to thegiven set of data. The highest (smallest) observation is given the rank1. The next highest (next lowest) observation is given the rank 2 andso on. It is to be noted that the same approach (i.e., either ascendingor descending) should be followed for all the variables underconsiderations.

Example 10.8: Calculate Spearman’s rank correlationcoefficient between advertising cost and sales from the following data:

Advertisementcost (‘000Rs.): 39 65 62 90 82 75 25 98 36 78Sales (lakhs) : 47 53 58 86 62 68 60 91 51 84

Solution: Let the variable X denote the advertisement cost (‘000Rs.) and the variable Y denote the sales (lakhs).

Let us now start ranking from the highest value for both thevariables as given below:



Here .10n

Therefore, Spearman’s rank correlation is

82.0119

99103061

1nnd6

1R 2

2

The result shows a high degree of positive correlation betweenAdvertising cost and sales.

Case III: When ranks are equalWhile ranking observations in the data set by considering either

the highest value or lowest value as rank 1, we may encounter asituation of more than one observations being of equal size. In such acase, the rank to be assigned to individual observations is an averageof the ranks which these individual observations would have got hadthey differed from each other. For example, if two observations areranked equal at third place, then the average rank of (3+4)/2=3.5 isassigned to these two observations. Similarly, if three observationsare ranked equal at third place, then the average rank of (3+4+5)/3=4is assigned to these three observations.

While equal ranks are assigned to a few observations in thedata set, an adjustment is needed in the Spearman’s rank correlationcoefficient formula as given below:



1nn

........mm121mm

121d6

1R 2

23

213

12

where im (i=1, 2, …..) stands for the number of items anobservation is repeated in the data set for both variables.

Example 10.9: A financial analyst wanted to find out whetherinventory turnover influences any company’s earnings per share (inper cent). A random sample of 7 companies listed in a stock exchangewere selected and the following data was recorded for each:

Find the strength of association between inventory turnover andearnings per share. Interpret the findings.

Solution: Let us start ranking from lowest value of both thevariables. Since there are tied ranks, the sum of the tied ranks isaveraged and assigned to each of the tied observations as shownbelow:

It may be noted that a value 5 of variable x is repeated twice



(m1=2) and values 8 and 13 of variable y is also repeated twice, som2 =2 and m3=2. Applying the formula

The result shows a very week positive association betweeninventory turnover and earnings per share.

10.5 PROBABLE ERROR OF CORRELATION

COEFFICIENT

Having determined the value of the correlation coefficient, the nextstep is to find the extent to which it is dependable. Probable error ofcorrelation coefficient usually denoted by ).(. rEP is an old measure oftesting the reliability of an observed value of correlation coefficient in so faras it depends upon the conditions of random sampling.

If r is the observed correlation coefficient in a sample of pairs ofobservations then its standard error, usually denoted by is given by,

nr1)r.(E.S

2

Probable error of the Correlation coefficient is given by,

).(.6745.0).(. rESrEP

We have taken the factor 0.6745 because in a normal distribution 50% ofthe observations lie in the range where is the mean and

is the s.d.

Uses of Probable Error:

The important uses of probable error of correlation coefficient i.e., are given by

(a) may be used to determine the two limits within which here is 50%chance that correlation coefficients of randomly selected samples fromthe same population will lie.

(b) may be used to test if an observed value of sample correlation coefficientis significant of any correlation in the population. The rules for testing the



significance of population correlation coefficient are as below:

(i) If then the population correlation coefficient is not significant.(ii) If then population correlation coefficient is significant.(iii) In other situations nothing can be concluded with certainty.

It is to be mentioned that one should use probable error to test thesignificance of population correlation coefficient when, the number of pairsof observations is fairly large. Moreover, probable error can be applied onlyunder the following situations:

(a) The data must have been drawn from a normal population.(b) The observations included in the sample must be drawn randomly.

Example 10.10: The following are the marks obtained by 10 students inMathematics and Statistics in an examination. Determine the Karl Pearson’scoefficient of correlation for these two series of marks. Calculate theprobable error of this correlation coefficient and examine the reliability(significance) of the correlation coefficient. Also compute the limits withinwhich the population correlation coefficient may be expected to lie.

Solution: Let the marks in Maths and the marks in Stats be denoted by thevariables X and Y respectively. Let us shift both the origin and scale of theoriginal variables X and Y obtain the new variables U and V as given by,

565YV,

560XU

(The scale 5 being common factor

of each of X and Y )



We have, UVXY rr

Now we prepare the following table to compute UVr

Now, UVXY rr

2222 VVnUUn

VUUVn

=

4176104140102214110

=

=0.9031

Again, nrrEP

216745.0).(.

=

109031.016745.0

2

= 0405.01623.3

128155.01623.3

19.06745.0

Reliability of the value of :

We have, and 6× ).(. rEP =6×0.0405=0.243. Since the value of r ismuch higher than the value of , as such the value of is highlysignificant.

Limits for Population correlation coefficient:

=0.9031 0.0405 i.e., 0.8626 and 0.9436



This implies that if we take another sample of size 10 from the samepopulation, then its correlation coefficient can be expected to lie between0.8626 and 0.9436.

Note: When we say is reliable or significant then it usually means that, onaverage, students getting good marks in Mathematics also get good marksin Statistics and students getting poor marks in Mathematics also get poormarks in Statistics. We must not interpret that all the students getting good(poor) marks in Mathematics also get good (poor) marks in Statistics. Ithappens since correlation indicates an average relationship between twoseries only and not between the individual items of the series.

10.6 COEFFICIENT OF DETERMINATION

Coefficient of correlation between two variables is a measure ofdegree of linear relationship that may exist in between them and indicatesthe amount of variation of one variable which is associated with or isaccounted for by another variable. A more useful and readily comprehensiblemeasure for this purpose is the coefficient of determination which indicatesthe percentage of the total variability of the dependent variable that isaccounted for or explained by the independent variable. In other words, thecoefficient of determination gives the ratio of the explained variance to thetotal variance. The coefficient of determination is expressed by the squareof the correlation coefficient, i.e.,. The value of lie between 0 and 0. Forexample, let the two variables, say and be inter-dependent, and variation incauses variation in. Further, let the correlation coefficient between them besay, 0.9. The coefficient of determination, in this situation, is which impliesthat of the variation in the dependent variable is due to variation in theindependent variable or is explained by the variation in. The remaining isdue to or is explained by some other factors.

The various values of coefficient of determination can be interpreted in thefollowing way:

(i) indicates that no variation in can be explained by the variable which inturn indicates that there exists no association between and .(ii) indicates that the values of are completely explained by which in turnindicates that there exists perfect association between and .



(iii) reveals the degree of explained variation in as a result of variation inthe values of . Value of closer to 0 shows low proportion of variation inexplained by . Again, value of closer to 1 shows that value of can predictthe actual value of .

Example 10.11: Five students of a Management Programme at a certainInstitute were selected at random. Their Intelligent Quotient (I.Q.) and themarks obtained by them in the paper in Decision Science (includingStatistics) were as follows:

Calculate the coefficient of determination and interpret the result.

Solution: Here, we may consider I.Q. as the independent variable as ,and Marks in Decision Science as dependent variable Y .This happens sobecause the marks obtained, would generally depend on the I.Q. of astudent.

Now, we prepare the following table for calculation

Now, we have Coefficient of determination = 2r



Where

2y

2y

2x

2x

yxyx

dnddnd

ddnddr

22 763822062650

7206960

809.032.148

12088250

840960

6545.0r2

Which implies that 65.45% of variation in the marks is explained by I.Q. Therest of the 34.55% variation in I.Q. could be due to some other factors likepreparation for the examination by the students, their mental frame duringthe examination, etc.

CHECK YOUR PROGRESS

Q 4: Under what situation rank correlation coefficientis used?Q 5: What is Coefficient of determination? Interpret

the meaning of 49.02 r .

10.7 REGRESSION ANALYSIS

Correlation analysis deals with exploring the correlation that mightexist between two or more variables and indicates the degree and directionof their association, but fails to answer the question:

Is there any functional relationship between two variables? If yes, can it beused to estimate the most likely value of one variable, given the value ofother variable?



Thus the statistical technique that expresses the relationshipbetween two or more variables in the form of an equation to estimate thevalue of a variable, based on the given value of another variable, is calledregression analysis. The variable whose value is estimated using thealgebraic equation is called dependent variable and the variable whose valueis used to estimate this value is called independent variable. The linearalgebraic equation used for expressing a dependent variable in terms ofindependent variable is called linear regression equation.

In many business situations, it has been observed that decisionmaking is based upon the understanding of the relationship between two ormore variables. For example, a sales manager might be interested inknowing the impact of advertising on sales. Here, advertising can beconsidered as an independent variable and sales can be considered as thedependent variable. This is an example of simple linear regression where asingle independent variable is used to predict a single numerical dependentvariable.

The meaning of the term regression is “stepping back towards theaverage.” The term regression was first introduced by Sir Francis Galton in1877. His study on the height of one thousand fathers and sons exhibitedan interesting result where he found that tall fathers tend to have tall sonsand short fathers tend to have short sons. However, the average height ofthe sons of a group of tall fathers was less than that of the fathers. Galtonconcluded that abnormally tall or short parents tend to “regress” or “step-back” to the average population height.

Advantages of Regression Analysis:

Some of the important advantages of regression analysis are given below:

1. Regression analysis helps in developing a regression equation with thehelp of which the value of a dependent variable can be estimated for anygiven value of the independent variable.

2. It helps to determine standard error of estimate to measure the variabilityof values of a dependent line fits the data. When all the points fall on theline, the standard error of estimate becomes zero.



3. When the sample size is large ( 30n ), the interval estimation forpredicting the value of a dependent variable based on standard error ofestimate is considered to be acceptable by changing the values of either

x or

y

. The magnitude of

2r

remains constant regardless of the values ofthe two variables.

Correlation versus Regression:

(a) With the help of correlation one measures the covariation between twovariables. In correlation neither variable may be termed as dependent orindependent variable. Since correlation does not establish a relationshipbetween the two variables as such one cannot estimate the value of onevariable corresponding to a given value of the other variable.

Regression establishes a functional relationship between two variables andhence one can estimate the value of one variable corresponding to a givenvalue of the other variable.

(b) With the help of Correlation analysis, one cannot study which variableis the cause and which variable is the effect. For example, a high degree ofpositive correlation between price and supply does not indicate whethersupply is the effect of price or price is the effect of supply.

Regression analysis, in contrast to correlation, determines the cause-and-effect relationship between x and

y

, that is, a change in the value ofindependent variable

x

causes a corresponding change in the value ofdependent variable

y

if all other factors that affect y remain unchanged.

(c) Correlation coefficient between two variables

x

and

y

is alwayssymmetric. i.e.,

YXXY rr

But regression coefficient is not symmetric in general i.e.,

YXXY bb

.

(d) Correlation coefficient is independent of the change of both origins andscale, regression coefficients are independent of change of origin only butnot of scale.



10.8 REGRESSION LINES

A regression line is the line from which one can get the bestestimated value of the dependent variable corresponding to a given valueof the independent variable. Thus a regression line is the line of best fittedline. The term best fit is interpreted in accordance with the Principle ofLeast Squares which consists in minimizing the sum of the squares of theresiduals or the errors of estimates.

In case of two variables and we usually have two regressionlines because each variable may usually be treated as the dependent aswell as the independent variable. For example, let us consider two variablesnamely, price (P) and supply (S). We know that other conditions remainingsame; if the price of a commodity increases (decreases) the supplies ofthe commodity also increases (decreases). In this case, is theindependent variable and P is the dependent variable. Also, other conditionsremaining same when the supply of a commodity increases (decreases),its price decreases (increases). In this case supply is the independentvariable and price is the dependent variable. Thus for two variables andwe have two regression lines. It is to be noted that

(i) when ,1XYr i.e., when there exists either perfect positive or perfectnegative correlation between x and then both the lines of regressioncoincide.

(ii) when i.e., when x and are uncorrelated then the two lines ofregression become perpendicular to each other.

Thus when we consider the variable as the independent and the variableas dependent then we get the regression equation of y on and similarly

in case of regression equation of on we will have as the independent variableand as the dependent variable. Sometimes, of course, from two correlatedvariables it is not possible to obtain both the regression lines. For example,if the variable denotes the amount of rainfall in some years and the variabledenotes the production of paddy in these years then obviously, can beconsidered only as the independent variable whereas can be consideredonly as the dependent variable.



The regression line of

y

on

x

is that line which gives the bestestimated value of

y

corresponding to a given value of

.x

The regression line of on is that line which gives the best estimatedvalue of corresponding to a given value of

11.8.1 DETERMINATION OF THE REGRESSION LINE

OF

y

ON

x

Let

nn yxyxyx ,,,.........,,, 2211

be n pairs of observationson the two variables

x

and

y

under study. Let

bxay

….....…….(10.12)

be the line of regression (best fit) of

y

on

x

.

For any given point

ii yxP ,1

in the scatter diagram, the error ofestimate or residual as given by the line of best fit (11.12) is ii HP .Now, the x coordinate of

iH

and iP are same viz., ix and since xi

lies on the line (11.12), the coordinate of, i.e., is. Hence the error ofestimate for is given by



MHMPHP iiii

= ii bxay

which is the error for the ith point. We will have such errors forall the points on the scatter diagram. For the points which lie abovethe line, the error would be positive and for the points which lie belowthe line, the error would be negative.

By applying the method of least squares, the unknownconstants and in (11.12) needs to be determined in such a mannerthat the sum of the squares of the errors of estimates is minimum. Inother words, we have to minimize

n

1i

2ii

n

1i

2ii bxayHPE

Subject to variations in a and .

E may also be expressed as

,bx)-a-(y)y-(yE 22e .............................(10.13)

where ey is the estimated value of y as given by (10.12) forgiven value of and summation ( ) being taken over the pairs ofobservations.

Using the principle of maxima and minima in differentialcalculus, will have an optimum (maximum or minimum) forvariations in a and if its partial derivatives w.r.t. and vanish separately.Hence, from (10.13) we get

0aE

and 0

bE

0.2

0.2

bxayb

bxay

bxaya

bxay



On simplifying, we have,

xbnay

…………………(10.14)

And 2xbxaxy …………….....(10.15)

Equations (11.14) and (11.15) are called normal equations. Fromthe given values of x and

y

we calculate

2,, xyx

and xy .Putting these values in (11.14) and (11.15) and solving these twoequations simultaneously for a and

b

we get the values of and whichare given by

22 x)(-xn

y)x)((-xynb ................(10.16)

and xbya

Now putting these values in equation (11.12) we get theregression equation of y on

x

which is given by

xxbyy

where

b

is called the regression coefficient of y on

x

and isgenerally denoted by the symbol

yxb

.

Thus writing yxb for in the above equation we have

xxbyy yx ......................(10.17)

The regression line of

y

on

x

given by (11.17) is to be used toestimate the most probable or average value of for any given valueof

10.8.2 DETERMINATION OF THE REGRESSION LINE

OF x ON y

Let the line of regression of on to be estimated for the givendata is:

ybax

..........................(10.18)



Applying the least squares method in a similar manner asdiscussed in case of regression line of on we have the followingtwo normal equations for determining the values of and .

………………….(10.19)

2ybyaxy …………………..(10.20)

By solving these equations simultaneously for aand andputting these values in equation (11.18) we obtain the regression lineof on given by

………………(10.21)

where

22 yyn

yxxynbxy ……….(10.22)

Equation (10.21) is the regression line of x on where given byequation (10.22) is called regression coefficient of on.

The regression line of on given by (10.17) is to be used to estimatethe most probable or average value of for any given value of

The regression line of on given by (10.21) is to be used to estimatethe most probable or average value of for any given value of

Remark: (i) When there exists either perfect positive correlation orperfect negative correlation between x and y then , and consequently,the regression equation of y on x becomes:

xxyyx

y

xy

xxyy ……………………………..(*)

On the other hand, in such situation the regression equation of x on ybecomes:

yyxxy

x

xy

xxyy

……………… …………..(**)



From equations (*) and (**) we conclude that we have the sameline. Thus if there exists either perfect positive correlation or perfectnegative correlation between the two variables i.e., when

r

1, thetwo regression lines coincide.

(ii) The two lines of regression pass through the common point

yx,

since this point satisfies both the regression equations.

10.8.3 REGRESSION COEFFICIENT

The regression coefficient of y on

x

i.e.,

yxb

gives the amountof increase (decrease) in corresponding to one unit increase(decrease) in when is positive. On the other hand the negative valueof gives the amount of decrease (increase) in corresponding to aunit increase (decrease) in

Similarly, if xyb is positive, it gives the amount of increase(decrease) in x corresponding to the unit increase (decrease) in

y

.Again the negative value of gives the amount of decrease (increase)in x corresponding to a unit increase (decrease) in y

Other expressions of Regression coefficients:

The regression coefficient of y on x i.e., can also be expressed as

2

xyx

y,xCovb

…………………………(A)and

x

yyx rb

… ....………………….(B)

where r is the correlation coefficient between

x

and

.y

Again, we have,

yyxxn1)y,x(Cov

22

x xxn1

2yx

xx

yyxxb)A(



22 xxn

yxxyn

………………………..(C)

We are left with the same equation as obtained in equation (10.16).

Similarly, the regression coefficient of x on can also be expressedas below:

We have,

……..…………………( A )

………………………..( )

=

22 yyn

yxxyn ………………………( )

Which is the same expression as obtained in equation (10.22).

Step deviation method for ungrouped data: When actualmean values and y are in fraction, then calculation of regressioncoefficients can be simplified by taking deviations of x and valuesfrom their assumed means A and B, respectively. Thus when

and BYd y , where A and B are assumed means ofand values, then

)D.......(..........

ddn

ddddnb 2

x2

x

yxyxyx

)D.......(..........ddn

ddddnb 2

y2

y

yxyxxy



Properties of Regression Coefficients:

Property 1: The correlation coefficient is the geometric mean of theregression coefficients.

Proof: We have the regression coefficient of

y

on

x

x

yyx rb

.................................. (10.13)

Similarly, the regression coefficient of

x

on

y

y

xxy rb

.................................. (10.13)

Multiplying (11.13) and (11.14) we get,

y

x

x

yxyyx rrbb

2xyyx rbb

xyyx bbr

Thus the correlation coefficient r is the geometric mean of the tworegression coefficients

yxb

and xyb

Property 2: The correlation coefficient and the two regressioncoefficients are simultaneously positive or simultaneously negative.

Proof: We have,

x

yyx rb

and y

xxy rb

………………(11.14)

Standard deviation being square quantity can never be negative. Herewe assume that .0;0 yx Hence, we have,

0x and 0y

Therefore from (11.14), we observe that when r is positive, both

yxb

and xyb are positive and when is negative, both and are negative.

Thus we can conclude that when both and are positive then



xyyx bbr

and when both yxb and xyb are negative then

xyyx bbr

Property 3: The product of the two regression coefficients cannotexceed unity.

Proof: We have

xyyx bbr and 1r1

If the product of the two regression coefficients exceeds unity then

will exceed unity as the square root of a number greater than one isalso greater than one. In this case r will be greater than 1 if and

xyb are positive and will be less than -1 if and are negative which isimpossible since. This indicates that the product of the two regressioncoefficients cannot exceed unity.

Property 4: The two regression coefficients are independent of thechange of origin but are dependent on the change of scale.

Proof: The property states that if we change the origin of theregression coefficients then the values of the regression coefficientsremain unchanged but if we change their scale then their values getchanged.

Let u and v be the new variables obtained by changing the originand the scale of the original variables and as follows:

kbyv,

haxu

……………….(i).

where hba ,, and k are (>0) are constants.

Since correlation coefficient is independent of the change of origin and scale,we have,



uvxy rr ……………….(ii)

Due to the transformation (i)

ux h and vy k

Now,

u

vuv

u

vuv

x

yxyyx r

kh

hkrrb

i.e., uvyx bkh

b

………………(iii)

Similarly,

v

uuv

v

uuv

y

xxyxy r

hk

khrrb

i.e., uvyx bkh

b ……………………..(iv)

Hence, from equations (iii) and (iv) we conclude that the two regressioncoefficients are independent of g

Property 5: Regression coefficient is not symmetric i.e., in general,

yxxy bb

.

Proof: We have,

y

xxy rb

and x

yyx rb

We observe that, in general yxxy bb .

On the other hand, regression coefficients yxb and xyb and correlationcoefficient r become equal only when

,yx

which usually doesnot occur.



CHECK YOUR PROGRESS

Q 6: State whether the following statements are true or false:(i) Regression analysis is a statistical technique that expresses thefunctional relationship in the form of an equation.(ii) Correlation coefficient is the geometric mean of regression coefficients.(iii) If one of the regression coefficients is greater than one the othermust also be greater than one.(iv) The product of regression coefficients is always more than one.(v) If xyb is negative, then yxb is negative.

10.9 STANDARD ERROR OF AN ESTIMATE

The regression equations enable us to estimate the value of thedependent variable for any given value of the independent variable. Theestimates so obtained are, however, not perfect. A measure of the precisionof the estimates so obtained from the regression equations is provided bythe Standard Error (S.E.) of the estimate. While standard deviation of thevalues of a variable measures the variation or scatteredness of the valuesabout their arithmetic mean, the standard error of estimate measures thevariation of scatteredness of the points or dots of the scatter diagram aboutthe regression line. The more closely the dots cluster around the regressionline, the more representative the line is so far as the relationship betweenthe two variables is concerned and the better is the estimate based on theequation of this line. If all the dots lie on the regression line then there existsno variation about the line and, as a result of which correlation between thevariables will be perfect.

Thus, Standard error (S.E.) of estimate of y for given denoted by isdefined by

2nyy

S2

yx



To simplify the calculations of yxS , the following equivalent formula is used,

2nxybyay

S2

yx

Where a and

b

are respectively the intercept and the slope of the regressionline of y on

x

which are to be determined by using the method of leastsquares.

Similarly, Standard error (S.E.) of estimate of x for given

y

denoted by

xyS

is defined by

2nxx

S2

xy

To simplify the calculations of yxS , the following equivalent formula is used,

2nxybxax

S2

xy

Where a and b

are respectively the intercept and the slope of the regressionline of x on

ywhich are to be determined by using the method of least

squares.

Again a much more convenient formula for numerical computations is givenby

21 rS yyx and 2xxy r1S

Example 10.12: A company is introducing a job evaluation scheme in whichall jobs are graded by points for skill, responsibility, and so on. Monthly payscales (Rs. in 1000’s) are then drawn up according to the number of pointsallocated and other factors such as experience and local conditions. Todate the company has applied this scheme to 9 jobs:



(a) Find the least squares regression line for linking pay scales to points.(b) Estimate the monthly pay for a job graded by 20 points.

Solution: We consider monthly pay (Y ) as the dependent variable and jobgrade points ( X ) as the independent variable. Now, the least squareregression line for linking pay scales to points i.e., the line of regression of

Y on X is given by,

XXbYY yx

Now, we prepare the following table for calculation

(a) Here, 22.159

137

nX

X ; 35.5915.48

nY

Y

Since mean values X and Y are non-integer value, therefore deviationsare taken from assumed mean as done in the above table.



133.02.4353.582

2484915.3240.659

ddn

ddddnb 22

X2

X

YXYXYX

Substituting these values of

YX ,

and YXb in the regression line, we have

22.15133.035.5 XY

326.3133.035.502426.2133.0

XYXY

(b) For job grade point ,20X the estimated average pay scale is given by

986.520133.0326.3133.0326.3 XY

Hence, likely monthly pay for a job with grade points 20 is Rs. 5.986.

Example 10.13: In the estimation of regression equations of two variables

X and Y the following results were obtained

3900,2860,6360,10,70,90 22 xyyxnYX where

YYyXXx ;

Obtain the two regression equations.

Solution: We have the line of regression of Y on X given by,

)i...(....................XXbYY YX

Where,

22 x

xy

XX

YYXXbYX

6132.063603900

From (i) the required regression equation is

906132.070 XY

70188.556132.0 XY

812.146132.0 XY

Similarly, the line of regression of X on Y is given by,



)ii...(....................YYbXX XY

Where,

22 y

xy

YY

YYXXbXY

From (ii) the required regression equation is

90452.953636.1 YX

452.53636.1 YX

Example 10.14: If the two lines of regression are:

03054 yx and 0107920 yx ,

Which of these is the line of regression of x on ? Determine and

y when .3x

Solution: We are given the regression lines as

)....(....................03054 iyx

In order to determine the line of regression of x on weneed to apply the property of regression coefficients, i.e., .In the given problem, let (i) be the line of regression of x on y and (ii) be theline of regression of y on x .

From (i), 45b

430y

45x xy

From (ii), 920b

9107x

920y yx

Now, 17778.245

9202 yxxybbr

But 10 2 r , therefore our assumption is wrong.



Hence (i) is the line of regression of y on

x

and (ii) is the line of regressionof x on

y

.

Assuming (i) as the line of regression of y on

x

we have,

6x54y30x4y5

Regression coefficient of y on x =54

, assuming (ii) as the line of regression of x on

y

we have,

20107y

209x107y9x20

Regression coefficient of x on y =209

36.0209

54b.br xyyx

2

6.036.0r

6.0 r (since both the regression coefficients are positive, r must bepositive.)

Again, we have, y

xxy rb

46.0

354.

r

b xyxy

(since 3x )

Example 10.15: The following data relate to advertising expenditure (Rs. inlakh) and their corresponding sales (Rs. in crore):

Advertising expenditure: 10 12 15 23 20

Sales : 14 17 23 25 21

(a) Find the equation of the least squares line fitting the data.(b) Estimate the value of sales corresponding to advertising expenditure ofRs. 30 lakh.(c) Calculate the standard error of estimate of sales on advertisingexpenditure.

Solution: (a) Let the advertising expenditure be denoted by x and sales byy. Then we obtain the least squares line of

y

on

x

which is of the form givenby,

)i...(....................xxbyy yx



Where

22

xx

yxyxy x

ddn

ddddnb ……………….(ii)

Now we construct the following table for calculation.

(a) Therefore from (i) the regression equation of on is

xy 712.0608.8

which is the required least squares line of sales on advertising expenditure.

(b) The least squares line obtained in part (a) may be applied to estimatethe sales turnover corresponding to the advertising expenditure of Rs. 30lakh as:

(c) The standard error of estimate of sale ( ) on advertising expenditure( y ) denoted by defined by

2

2

nxybyay

S yx …………………….(iii)



Now we make the following table to calculate

yxS

From (ii)

251684712.0100608.82080S yx

= .594.23

11998.8602080

CHECK YOUR PROGRESS

Q 7: State whether the following statements are trueor false:(i) Standard error of estimate is a measure of scatterof the observations about the regression line.

(ii) The standard error of estimate of y on x,

yxS

is equal to 21 ry (iii) Smaller the value of yxS , better the line fits the data.

10.10 LET US SUM UP

Correlation means existence of relationship between variables.When two variables deviate in the same direction then we have positivecorrelation and when they move in the opposite direction we say that hereexists negative correlation between variables.

We have learnt various methods with the help of which we canascertain the existence of relationship between variables. These methodsinclude: (i) Scatter diagram method, (ii) Karl Pearson’s correlation coefficientmethod, (iii) Spearman’s rank correlation coefficient method.



Scatter diagram is a graphic tool to portray the relationship betweenvariables. Karl Pearson’s correlation coefficient measures the strength ofthe linear association between variables with values near zero indicating alack of linearity while values near -1 or +1 suggest linearity. Karl Pearsoncoefficient of correlation is designated by r .

We have also discussed a very important phenomenon inCorrelation and Regression analysis which is termed as Coefficient ofdetermination. It is defined as the fraction of the variation in one variablethat is explained by the variation in the other variable and in other words, itmeasures the proportion of variation in the dependent variable that canbe attributed to independent variable . It ranges from 0 to 1 and is thesquare of the coefficient of correlation. Thus a coefficient of 0.82 suggeststhat 82% of the variation in Y is accounted for by X.

This unit also focuses on the process of developing a model knownas regression model under regression analysis which is used to predictthe value of a dependent variable by at least one independent variable.

10.11 FURTHER READINGS

1) Srivastava, T.N., Rego, S. (2008). Statistics for Management. NewDelhi. Tata McGraw Hill Education Private Limited.

2) Sharma, J.K. (2007). Business Statistics. New Delhi. PearsonEducation Ltd.

3) Hazarika, P.L. (2016). Essential Statistics For Economics AndBusiness Studies. New Delhi. Akansha Publishing House.

4) Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniquesin Business and Economics. New Delhi. Tata McGraw Hill EducationPrivate Limited.

5) Bajpai, N. (2014). Business Statistics. New Delhi. Pearson EducationLtd.



10.12 ANSWERS TO CHECK YOUR

PROGRESS

Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.

Ans. to Q No 2:

22YYXX

YYXXr

9.07232

43

Ans. to Q No 3: We have, YX

YXCovr

),(

Y66.3

25.0 (since 36)( 2 XXVar , therefore 6X ) 4.2Y Ans. to Q No 4: Rank correlation coefficient is used in a situation in whichquantitative measure of certain qualitative factors such as judgment,leadership, colour, tastes etc. cannot be fixed, but individual observationscan be arranged in a definite order.

Ans. to Q No 5: Coefficient of determination is a statistical measure of theproportion of the variation in the dependent variable that is explained byindependent variable.

Coefficient of determination 49.02 r or 49% indicates that only 49%of the variation in the dependent variable y can be accounted for in terms ofvariable x.The remaining 51% of the variability may be due to other factors.

Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.

Ans. to Q No 7: (i) True, (ii) False, (iii) True.



10.13 MODEL QUESTIONS

1. What is correlation? Define positive, negative and zero correlation.2. What is a scatter diagram? Discuss by means of suitable scatterdiagrams different types of correlation that may exist between the variablesin bivariate data.3. What is Karl Pearson’s correlation coefficient? How would you interpretthe value of a coefficient correlation?4. Distinguish between the coefficient of determination and the coefficientof correlation. How would you interpret the value of a coefficient ofdetermination?5. What is rank correlation coefficient method? Bring out its usefulness.6. Explain the concept of regression and point out its usefulness in dealingwith business problems.7. Write a short note on the probable error of correlation coefficient.8. What is linear regression? Why are there two regression lines? Whendo these become identical?9. Show that yxxy bbr 2 .10. Find the coefficient of correlation from the following data:Cost : 39 65 62 90 82 75 25 98 36 78Sales : 47 53 58 86 62 68 60 91 51 84Also interpret the result.

11. From the following data, calculate Karl Pearson’s correlation coefficient

i)

9

1

9

1

29

1

2.193,346,120

i iiii

ii YYXXYYXX

12. Calculate the coefficient of correlation and its probable error from thefollowing data:

X: 1 2 3 4 5 6 7 8 9 10Y: 20 16 14 10 10 9 8 7 6 513. Two departmental managers ranked a few trainees according to theirperceived abilities. The ranking are given below:



Calculate an appropriate correlation coefficient to measure the consistencyin the ranking.14. You are given below the following information about advertisementexpenditure and sales:

Correlation coefficient is 0.8.(a) Obtain both the regression equations.(b) Find the likely sales when advertisement expenditures Rs. 25 crore.(c) What should be the advertisement budget if the company wants to attainsales target of Rs. 150 crore?

15. A company believes that the number of salespersons employed is agood predictor of sales. The following table exhibits sales (in thousandsRs.) and the number of salespersons employed for different years:

Obtain a simple regression model to predict sales based on the number ofsalespersons employed.

16. The HR manager of a multinational company wants to determine therelationship between experience and income of employees. The followingdata are collected from 14 randomly selected employees.

(a) Develop a regression model to predict income based on the years ofexperience.(b) Calculate the coefficient of determination and interpret the result.(c) Calculate the standard error of estimate.(d) Predict the income of an employee who has 22 years of experience.

*** ***** ***

unit 10: simple correlation and regression

Documents

Transcript of unit 10: simple correlation and regression