unit 10: simple correlation and regression
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of unit 10: simple correlation and regression
213Statistics for Management
Simple Correlation and Regression Unit 10
UNIT 10: SIMPLE CORRELATION AND REGRESSION
UNIT STRUCTURE10.1 Learning Objectives
10.2 Introduction
10.3 Correlation Analysis
10.3.1 Correlation and Causation
10.3.2 Types of Correlation
10.4 Methods of Measuring Correlation
10.4.1 Scatter diagram Method
10.4.2 Karl Pearson’s correlation coefficient method
10.4.3 Spearman’s Rank Correlation Coefficient
10.5 Probable error of correlation coefficient
10.6 Coefficient of determination
10.7 Regression Analysis
10.8 Regression Lines
10.8.1 Determination of the regression line of y on x
10.8.2 Determination of the regression line of x on x
10.8.3 Regression coefficient
10.9 Standard Error Of Estimate
10.10 Let Us Sum Up
10.11 Further Readings
10.12 Model Questions
10.1 LEARNING OBJECTIVES
After going through this unit you will be able to
• learn how correlation analysis expresses the quantitatively the
degree and direction of the association between two variables
214 Statistics for Management
Simple Correlation and RegressionUnit 10
• compute and interpret different measures of correlation namely Karl
Pearson’s Correlation Coefficient, Spearman’s Rank Correlation
Coefficient
• use regression analysis for estimating the relationship between
variables
• use least square method for estimating equation to predict future
values of the dependent variable
10.2 INTRODUCTIONIn many business environments, we come across problems or
situations where two variables seem to move in the same direction such
as both are increasing or decreasing. At times an increase in one variable
is accompanied by a decline in another. Thus, if two variables are such that
when one changes the other also changes then the two variables are said
to be correlated. The knowledge of such a relationship is important to make
inferences from the relationship between variables in a given situation. For
example, a marketing manager may be interested to investigate the degree
of relationship between advertising expenditure and the sales volume. The
manager would like to know whether money that he is going to spent on
advertising is justified or not in terms of sales generated.
Correlation analysis is used as a statistical technique to ascertain
the association between two quantitative variables. Usually, in correlation,
the relationship between two variables is expressed by a pure number i.e.,
a number without having any unit of measurement. This pure number is
referred to as coefficient of correlation or correlation coefficient which
indicates the strength and direction of statistical relationship between
variables.
It may be noted that correlation analysis is one of the most widely
employed statistical devices adopted by applied statisticians and has been
used extensively not only in biological problems but also in agriculture,
economics, business and several other fields. In this unit we shall introduce
correlation analysis for two variables.
215Statistics for Management
Simple Correlation and Regression Unit 10
The importance of examining the relationship between two or more
variables can be stated in the form of following questions and accordingly
requires the statistical devices to arrive at conclusions:
(i) Does there exist an association between two or more variables? If
exists, what is the form and the degree of that relationship?
(ii) Is the relationship strong or significant enough to be useful to arrive
at a desirable conclusion?
(iii) Can the relationship be used for predictions of future events, that
is, to forecast the most likely value of a dependent variable
corresponding to the given value of independent variable or
variables?
The first two questions can be answered with the help of correlation
analysis while the final question can be answered by using the regression
analysis.
In case of correlation analysis, the data on values of two variables
must come from sampling in pairs, one for each of the two variables.
10.3 CORRELATION ANALYSIS
By the term ‘correlation’ we mean the relationship between two
variables. Two variables are said to be correlated if the change in one variable
results in a corresponding change in the other variable. In the practical field
we need to investigate the type of relationship that might exist between the
ages of husbands and wives, the heights of fathers and sons, the amount
of rainfall and the volume of production of a particular crop, the price of a
commodity and the demand for it, and so on. Similarly, we may state the
example of price of a commodity and the demand for it. If the price of a
commodity increases, there is a decline in its demand. Thus there exists
correlation between the two variables price and demand. In correlation, we
study the nature and the degree of relationship between the two variables.
Uses of Correlation Analysis:
In spite of certain limitations correlation analysis is a widely used
statistical device. With the help of correlation analysis one can ascertain
216 Statistics for Management
Simple Correlation and RegressionUnit 10
the existence as well as degree and direction of relations between two
variables. It is an indispensable tool of analysis for the people of Economics
and Business. Variables in Economics and Business are usually interrelated.
In order to study the nature (positive or negative) and degree (low, moderate
or high) of relationship between any two of such related variables correlation
analysis is used. In reality, besides Business and Economics, it is
extensively used in various other branches.
10.3.1 CORRELATION AND CAUSATION
Correlation analysis helps us to have an idea about the
degree and direction of the relationship between the two variables
under study. However, it fails to reflect upon the cause and effect
relationship between the variables. If there exist a cause and effect
relationship between the two variables, they are bound to vary in
sympathy with each other and, therefore, there is bound to be a high
degree of correlation between them. In other words, causation always
implies correlation. However, the converse is not true i.e., even a fairly
high degree of correlation between the two variables need not imply a
cause and effect relationship between them. The high degree of
correlation between the variables may be due to the following reasons:
(i) Mutual dependence: The phenomenon under study may
inter influence each other. Such situations are usually observed in
data relating to economic and business situations. For example,
variables like price, supply, and demand of a commodity are mutually
correlated. According to the principle of economics, as the price of a
commodity increases, its demand decreases, so price influences
the demand level. But if demand of a commodity increases due to
growth in population, then its price also increases. In this case
increased demand makes an effect on the price. However, the amount
of export of a commodity is influenced by an increase or decrease in
custom duties but the reverse is normally not true.
(ii) Pure chance: It may happen that a small randomly
selected sample from a bivariate distribution (i.e., a distribution in
217Statistics for Management
Simple Correlation and Regression Unit 10
which each unit of the series assumes two values) may show a fairly
high degree of correlation though, actually, such a relationship may
not exist in the universe. Such correlation may be due to chance
fluctuations. Moreover, the conscious or unconscious bias on the part
of the investigator, in the selection of the sample may also result in
high degree of correlation in the sample. It may be noted that in both
the phenomena a fairly high degree of correlation may be observed,
though it is not possible to conceive them as being causally related.
(iii) Influence of external factors: A high degree of
correlation may be observed between two variables due to the effect
or inter action of a third variable or a number of variables on each of
these variables. For example, a fairly high degree of correlation may
be observed between the yield per hectare of two crops, say , rice
and potato, due to the effect of a number of factors like favourable
weather conditions, fertilizers used, irrigation facilities, etc., on each
of them. But none of the two is the cause of the other.
10.3.2 TYPES OF CORRELATION
Correlation may be broadly classified into the following three
types:
(a) Positive, Negative and Zero Correlation,
(b) Linear and Non-linear Correlation,
(c) Simple, Partial and Multiple Correlation .
(a) Positive, Negative and Zero Correlation,
Positive Correlation: Two variables are said to be positively
or directly correlated if the values of the two variables deviate or move
in the same direction i.e., if the increase in the values of one variable
results, on an average, in a corresponding increase in the values of
the other variable or if a decrease in the values of one variable results,
on an average, in a corresponding decrease in the values of the other
variable. For example, there exists positive correlation between the
following pairs of variables.
218 Statistics for Management
Simple Correlation and RegressionUnit 10
(i) The income and expenditure of a family on luxury items,
(ii) Advertising expenditure and the sales volume of a company,
(iii) Amount of rainfall and yield of a crop,
(iv) Height and weight of a student
(v) Price and supply of a commodity,
(vi) Temperature and sale of cold drinks on different days
of a month in summer.
When the changes in two related variables are exactly
proportional and are in the same direction then we say that there is
perfect positive correlation between them. For example, there exists
perfect positive correlation between the following pairs of sets of data
where each set of data may be assumed to be the values of a variable.
X: 10 20 30 40 50
Y: 2 4 6 8 10
U: 40 35 30 25 20 15 10
V: 14 12 10 8 6 4 2
Negative Correlation: The correlation between two variables
is said to be negative or inverse if the variables deviate in the opposite
direction i.e., if the increase (decrease) in the values of one variable
results, on an average, in a corresponding decrease (increase) in the
values of the other variable. The following pairs of variables are
negatively correlated:
(i) Price and demand for a commodity,
(ii) Volume and pressure of a perfect gas,
(iii) Sale of woolen garments and the day temperature,
(iv) Number of workers and time required to complete a
work
When the changes in two related variables are exactly
proportional but are in the opposite directions then we say that there
is perfect negative correlation between them. The correlation between
each of the following pairs of variables is perfectly negative:
219Statistics for Management
Simple Correlation and Regression Unit 10
X: 60 50 40 30 20
Y: 2 4 6 8 10
U: 0 1 2 3
V: 2 -3 -8 -13
Zero Correlation: Two variables are said to have zero
correlation or no correlation if they tend to change with no connection
to each other. In such situation the variables are said to be
uncorrelated. For example, one should expect zero correlation
between the yield of crop and the heights of students, or between
price of rice and demand for sugar.
(b) Linear and Non-linear Correlation: The correlation
between two variables is said to be linear if corresponding to a unit
change in one variable, there is a constant change in the other variable
over the entire range of the values. The following example illustrates
a linear correlation between the two variables X and Y.
X: 10 20 30 40 50
Y: 40 60 80 100 120
When these pairs of values X and Y are plotted on a graph
paper, the line obtained by joining the points would be a straight line.
In general, two variables X and Y are said to be linearly related
if the relationship that exists between the two variables is of the form
given by,
Y= a +bX
Where ‘b’ is the slope and ‘a’ the intercept.
On the other hand, a non-linear correlation indicates an absolute
change in one of the variable values with respect to changes in values
of another variable. In other words, correlation is said to be non-linear
when the amount of change in the values of one variable does not
bear a constant ratio to the amount of change in the corresponding
values of another variable. The following example illustrates a non-
220 Statistics for Management
Simple Correlation and RegressionUnit 10
linear correlation between the given variables.
X: 8 9 9 10 10 28 29 30
Y: 80 130 170 150 230 560 460 600
When these pair of values are plotted on a graph paper, the
line obtained by joining these points would not be a straight line, rather
it would be curvi-linear.
(c) Simple, Partial and Multiple Correlation: The distinction
amongst Simple, Partial and Multiple Correlation depends upon the
number of variables involved under study.
In Simple correlation only two variables are introduced to
study the relationship between them. A study on income with respect
to saving only, or sales revenue with respect to amount of money
spent on advertisement etc. are a few examples studied under Simple
Correlation.
When the study involves more than two variables then it is a
problem of either Partial or Multiple Correlation. In Partial Correlation,
we study relationship between two variables while the effect of other
variable is held constant. In other words, in Partial Correlation we
study the linear relationship between a dependent variable and one
particular independent variable out of a set of independent variables
when all other variables are held constant. For example, suppose our
study involves three variables X1, X2 and Y where X1 is the number of
hours studied, X2 is I.Q. and Y is the marks secured in the examination.
Now if we study the relationship between number of hours (X1) and
marks obtained (Y) by the student keeping the effect of I.Q. (X2)
constant then it is a problem of Partial Correlation.
In Multiple Correlation three or more than three variables are
studied simultaneously. For example, the study of the relationship
between the production of a particular crop on one side and rainfall
and use of fertilizer on the other side falls under Multiple Correlation.
221Statistics for Management
Simple Correlation and Regression Unit 10
CHECK YOUR PROGRESSQ 1: State whether the following statements
are true or false:(i) Correlation helps to formulate
the relationship between the variables.(ii) If the
relationship between variables x and y is positive, then as the variable
y decreases, the variable x increases.(iii) In a negative relationship
as x increases, y decreases.(iv)Multiple correlation deals with
studying three or more than three variables simultaneously
10.4 METHODS OF MEASURING CORRELATION
Here we shall confine our discussion to the methods of measuring only
linear relationships only. The commonly used methods for studying the
correlation between two variables are :
(i) Scatter Diagram method
(ii) Karl Pearson’s correlation coefficient method
(iii) Spearman’s Rank correlation method
(iv) Concurrent Deviation method
10.4.1 SCATTER DIAGRAM METHOD
Scatter Diagram is one of the simplest methods of diagrammatic
representation of a bivariate distribution and used to study the nature
(i.e., positive, negative and zero) and degree (i.e., weak or strong) of
correlation between two variables. A scatter diagram can be obtained
on a graph paper by plotting observed pairs of values of variables x
and y, considering the independent variable values on the x-axis and
the dependent variable values on the y-axis. Suppose we are given n
pairs of values nn yxyxyx ,....,,.........,,, 2211 of two variables X and
Y. These n points may be plotted as dots (
) in the
xy
plane. The
diagram of dots so obtained is known as scatter diagram. From scatter
diagram we can form a fairly good, though rough idea about the
existence of relationship between the two variables. After plotting the
points in the plane we may have one of the types of scatter diagrams
as shown below:
223Statistics for Management
Simple Correlation and Regression Unit 10
Now with the help of Scatter Diagram, we can interpret the
correlation between the two variables as:
(i) If the points are very close to each other, then we can expect
a fairly good amount of correlation between the two variables. On the
other hand, if there appears to be no obvious pattern of the points of
the scatter diagram then it indicates that there is either no correlation
or very low amount of correlation between the variables.
(ii) If the points on the scatter diagram reveal any trend (either
upward or backward), the variables are said to be correlated and the
variables are uncorrelated if no trend is revealed.
(iii) If there is an upward trend rising from lower left hand corner
to the upper right hand corner then we expect a positive correlation
because in such a situation both the variables move in the same
direction. On the other hand, if the points on the scatter diagram depict
a downward trend starting from upper left hand corner to the lower
right hand corner, the correlation is negative since in this case the
values of the two variables run in the opposite direction.
(iv) In particular, if all the points lie on a straight line starting
from the left bottom and going up towards the right top, the correlation
is perfect and positive, and if all the points lie on a straight line starting
from left top and coming down to right bottom, the correlation is perfect
and negative.
Remark: 1.The Scatter Diagram method enables us to form a
rough idea of the nature of the relationship between the two variables
simply by inspection of the graph. However, this method is not suitable
to situations involving large number of observations.
2. The method of scatter diagram provides information only about
the nature of the relationship that is whether it is positive or negative
and whether it is high or low but fails to provide an exact measure of
the extent of the relationship between the two variables.
224 Statistics for Management
Simple Correlation and RegressionUnit 10
Example 10.1: The percentage examination scores of 10
students in Data analysis and Economics were as follows. Draw a
scatter diagram for the data and comment on the nature of correlation.
Student : A B C D E F G H I J
Data Analysis: 65 90 52 44 95 36 48 63 80 15
Economics : 62 71 58 58 64 40 42 66 67 55
Solution: A scatter diagram will give a preliminary indication of
whether linear correlation exists. We plot the ordered pairs (65, 62),
(90, 71),………., (15, 55) as shown in the following figure.
Since the points are very close to each other, we may expect a
high degree of correlation. Further, since the points reveal an upward
trend starting from left bottom to top right hand corner, the correlation
is positive. Hence, we conclude that there exists a high degree of
positive correlation between the scores of the students in Data Analysis
and Economics.
225Statistics for Management
Simple Correlation and Regression Unit 10
10.4.2 KARL PEARSON’S CORRELATION
COEFFICIENT METHOD
The scatter diagram gives a rough indication of the nature and
extent/strength of the relationship between the two variables. The
quantitative measurement of the degree of linear relationship between
two variables, say ‘x’ and ‘y’, is given by a parameter called correlation
coefficient. It was developed by Karl Pearson. Karl Pearson;s method
of measuring correlation between two variables is called the coefficient
of correlation or correlation coefficient. It is also known as product
moment coefficient. For a set of
n
pairs of values of
x
and
y
, Karl
Pearson’s correlation coefficient, usually denoted by
YXr ,
or xyr or
simply r is defined by,
)()(
,YVarXVar
YXCovr
……………..(10.1)
Or, YX
YXCOVr
),(
where YYXXn
YXCov 1),(
22 1),( XX
nXVar X
22 1),( YY
nYVar Y
Substituting these values in the definition of, We have
22 YYXX
YYXXr
..................... (10.2)
226 Statistics for Management
Simple Correlation and RegressionUnit 10
2222 YYnXXn
yXXYn
..................... (10.3)
2222 YnYXnX
YXnXY
..................... (10.3/)
Step deviation method for ungrouped data: When actual
mean values x and y are in fraction, the calculation of Pearson’scorrelation coefficient can be simplified by taking deviations of x and
values from their assumed means A and B, respectively. Thus when
and BYd y , where A and B are assumed means ofand values, the formula (11.2) becomes
)5.10.......(..........2222
yyxx
yxyx
fdfdnfdfdn
fdfddfdnr
Assumptions of Using Pearson’s Correlation Coefficient:
Karl Pearson’s correlation coefficient XYr is based on the
following four assumptions:
(i) It is appropriate to calculate when both variables x and are
measured on an interval or a ratio scale.
(ii) The random variables and Y are normally distributed.
(iii) There is a linear relationship between the variables.
(iv) There is a cause and effect relationship between two
variables that influences the distributions of both the variables.
Merits and Limitations of Correlation Coefficient: The
correlation coefficient is a numerical number lying between -1 and +1
that summarizes the magnitude as well as direction of association
between two variables. The chief merits of this method are given
below:
(i) Karl Pearson’s coefficient of correlation is a widely used
statistical device.
227Statistics for Management
Simple Correlation and Regression Unit 10
(ii) It summarizes the degree (high, moderate or low) in one
figure.
(iii) It is based on all the observations.
However, analysis based on Pearsonian coefficient is subject
to certain severe limitations which are presented as:
(i) A value of which is near to zero does not necessarily indicate
that the two variables X and Y are uncorrelated, but it merely indicates
that there is no linear relationship between them. There may be a
curvilinear or some complex relationship between the two variables
which Pearson’s formula cannot detect as it is an instrument of
measuring linear correlation ship only.
(ii) Correlation is only a measure of the nature and degree of
relationship between two variables and it gives no indication of the
kind of cause and effect relationship that may exist between the two
variables. It fails to identify the variables as dependent or independent
variables. Correlation theory simply seeks to discover if a covariation
between two variables exists or not. Statistical correlation technique
may reveal a very close relationship between two variables, but it
cannot tell us about cause and effect relationship between them or
which variable causes other to react.
(iii) Two uncorrelated variables may exhibit a high degree of
correlation between them. For example, the data relating to the yield
of rice and wheat may show a fairly high degree of positive correlation
although there is no connection between the two variables, viz., yield
of rice and yield of wheat. This may be due to the favourable impact
of extraneous factors like weather conditions, fertilizers used, irrigation
facilities, improved variety of seeds etc. on both of them.
(iv) Sometimes high correlation between two variables may be
entirely spurious. However, such a high correlation may exist due to
228 Statistics for Management
Simple Correlation and RegressionUnit 10
chance and consequently such correlations are termed as chance
correlations.
Properties of Correlation Coefficient:
Some of the important properties of Correlation coefficient are
given below:
(a) Correlation coefficient is a pure number.
(b) Correlation coefficient is independent of change of origin
and scale of measurement.
We shall now establish property (b).
Proof: Suppose we want to study the relationship between two
variables X and Y. Let these variables be transformed to the new
variables U and V by the change of origin and scale viz.,
haXU
kbYV
....…………………….(10.6)
where a and b are known as assumed mean or origin and h and k are
known as scale of measurement.
Therefore from (11.5) we have,
hUaX and ………………………...(10.7)
Summing both sides and dividing by we get,
and VkbY ………………………….(10.8)
Subtracting (11.7) from (11.6) we have,
UUhXX and VVkYY
Putting these values in equation (11.2) we get,
2222XY
VVkUUh
VVkUUhrr
229Statistics for Management
Simple Correlation and Regression Unit 10
22VVUUhk
VVUUhk
22VVUU
VVUU
UVr
Since UVXY rr therefore the correlation coefficient between the
two original variables X and Y is equal to the correlation coefficient
between the new variables U and V (where the new variables U and V
are obtained from X and Y respectively after changing the origin and
scale of X and Y). Hence, we conclude that Correlation coefficient is
independent of change of origin and scale of measurement.
(c) Correlation coefficient lies between -1 and +1, i.e.,1ror1r1
Proof: Let us introduce two variables X and Y with their arithmetic
means X and Y and with standard deviations X and Y respectively.
Let us now consider the sum of squares
2
YX
YYXX
Which is always non-negative.
i.e., 0YYXX2
YX
0YYXX2YYXX
YX
2
Y
2
X
230 Statistics for Management
Simple Correlation and RegressionUnit 10
Now dividing by n , we get
0r22
and
and
r1 and r 1
Remarks: 1. This property provides us a check on our calcula-
tions. If any problem, the obtained value of lies outside the limits
this implies that there is some mistake in our calculations
2. indicates perfect positive correlation between the
variables and r = — 1 indicates perfect negative correlation between
the variables.
(d) If X and Y are two independent variables, ,0XYr but the
converse is not true.
Proof: We have
yx
)y,X(Covr
22
YYEXXE
YYXXE
………………………(10.9)
231Statistics for Management
Simple Correlation and Regression Unit 10
Now, YXYXYXXYEYYXXE
YXEYEXXEYXYE
= YXYXXYXYE
= YXXYE (Since X and Y are given to be indepen-dent,
= YXYEXE )()( YEXEXYE )
= YXYX
0YYXXE
Therefore from (11.9),
0
YYEXXE
0r22
Thus when two variables are independent, they are uncorrelatedi.e., 0XYr
But the converse is not true:
If two variables are uncorrelated then they may not beindependent. Since we know that Karl Pearson’s correlation coefficientis a measure of only linear correlation between two variables,therefore 0XYr indicates that there exists no linear relationshipbetween the variables. There may, however, exist a strong non-linearor curvilinear relationship between x and
y
even though
.0XYr
(e) Correlation coefficient is symmetric, i.e., YXXY rr .
Proof: We have,
....(*)....................YYXX
YYXXr
22XY
232 Statistics for Management
Simple Correlation and RegressionUnit 10
Now interchanging X and Y we get,
YXr
22
XXYY
XXYY
).......(**....................YYXX
YYXXr
22YX
From (*) and (**) we find that
Interpretation of Various values of Correlation Coefficient:
The interpretations of various values of XYr are as follows:
(i) 10 XYr implies that there is positive correlation between
X and Y. The closer the value of XYr to 1, the stronger is the positive
correlation.
(ii) 1XYr implies that there exists perfect and positive
correlation between the variables.
(iii) 01 XYr implies that there is negative correlation
between X and Y. The closer the value of to-1, the stronger is the
negative correlation.
(iv) indicates that the correlation between the variables is perfect
and negative.
(v) means that there is no correlation between the variables
and hence the variables are said to be uncorrelated.
CHECK YOUR PROGRESS
Q 2: Given ,43YYXX
,322
XX 722
YY
What is correlation coefficient r ?Q 3: Given =0.25, Cov(X, Y)=3.6, Var(X)=36 then what is S.D.(Y)
233Statistics for Management
Simple Correlation and Regression Unit 10
Example 10.2: The following data gives indices of industrial production
and number of registered unemployed people (in lakh). Determine Karl
Pearson’s correlation coefficient
Solution: To calculate the Karl Pearson’s correlation coefficient we prepare
the following table:
;1048
832n
XX
158
120n
YY
Thus Karl Pearson’s Correlation Coefficient,
22XY
YYXX
YYXXr
18412092
Since coefficient of correlation 619.0r is moderately negative, it
indicates that there is a moderately large inverse correlation between the
two variables. Hence, we conclude that as the production index increases,
the number of unemployed decreases and vice-versa.
234 Statistics for Management
Simple Correlation and RegressionUnit 10
Example 10.3: The following data relate to age of employees and the number
of days they reported sick in a month
Calculate Karl Pearson’s coefficient of correlation and interpret it.
Solution: Let age and sick days be represented by variables X and Y
respectively. Then Karl Pearson’s Correlation coefficient
22XY
YYXX
YYXXr
Now we prepare the following table for calculation
235Statistics for Management
Simple Correlation and Regression Unit 10
6.41046
nX
X
41040
nY
Y
Therefore we have,
22XY
YYXX
YYXXr
= 641092230
= 0.870
Since coefficient of correlation is closer to 1 and positive, therefore age ofemployees and number of sick days are positively correlated to a high degree.Hence we conclude that as the age of an employee increases, he is likelyto go on sick leave more often than others.
Example 10.4: The following data gives Sales and Net Profit for some ofthe top Auto-makers during the quarter July-September 2006. Find thecorrelation coefficient.
Solution: Let the average sales and average net profit of the givenautomobiles be denoted by X and Y respectively. Then correlation coefficientis given by
2222 YYnXXn
YXXYnr
[Using equation (10.3)]
5.00
236 Statistics for Management
Simple Correlation and RegressionUnit 10
Now we make the following table for calculation:
Therefore correlation coefficient is given by,
22 17450.5436820073988
17420061448r
= 302764349240000591843480049152
= 132161918414352
= 9608629.1145063175.138
14352
= 80578.15922
14352
= 0.90135
Correlation Coefficient in case of Grouped Data:
In case of Bivariate frequency, if we are to deal with large volume of datathen these are classified in the form of a two-way frequency table known asbivariate table or correlation table. Here for each of the variables, the valuesare classified into different classes following the same considerations as inthe case of univariate distribution. If there are m classes for the values ofthe variable and n classes for the values of the variable thenthere will be nm cells in the two-way table. Now we shall discuss thecalculation of Karl Pearson’s correlation coefficient with the help of thefollowing example,
Example 10.5: Family income and its percentage spent on food in the caseof 100 families gave the following bivariate frequency distribution. Calculate
5.00
237Statistics for Management
Simple Correlation and Regression Unit 10
the coefficient of correlation.
Solution: Let us denote the income (in Rs.) and the food expenditure (%)by the variables X and Y respectively. Now to calculate Karl Pearson’scoefficient of correlation, we follow the steps given by,
Step 1: Find the mid points of various classes for X and Y series.
Step 2: Change the origin and scale in X series and Y series to the newvariables
u
and
v
by using the transformations:
100450X
hAXu
and 55.17Y
kBYv
where x and
y
denote mid points of X series and Y series respectively andsimilarly
h
and k denote magnitude of the classes of X and Y seriesrespectively.
Step 3: For each class of X, find the total of cell frequencies of all theclasses of Y and similarly for each class of Y find the total of cell frequenciesof all the classes of X.
Step 4: Multiply the frequencies of X by the corresponding values of thevariable u and find the sum
.fu
Step 5: Multiply the frequencies of Y by the corresponding values of thevariable v and find the sum
.fv
Step 6: Multiply the frequency of each cell by the corresponding values of
u and
v
and write the product
vuf
within a square in the right hand topcorner for each cell.
238 Statistics for Management
Simple Correlation and RegressionUnit 10
The above calculations are presented below in the table:
2222 fvfvNfufuN
fvfufuvN
=
)100200100()0120100(
1000481002
10000120004800
1000020000120004800
4381.012000
48
Step 7: Add together all the figures in the top corner squares as obtained instep 6 to get the last column uvf for each of the X and Y series. Finally, findthe total of the last column to get .fuv
8. Multiply the values of fu and fv by the corresponding values of u andto get the columns for and 2fv . Add these values to obtain 2fu and
2fv .
239Statistics for Management
Simple Correlation and Regression Unit 10
Since Correlation coefficient is independent of change of origin and scaleof measurement, therefore, .4381.0 uvXY rr
10.4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT
So far, we have confined our discussion with correlation betweentwo variables, which can be measured and quantified in appropriateunits of money, time, etc. However, sometimes, the data on twovariables is given in the form of the ranks of two variables based onsome criterion. Here we introduce the method to study correlationbetween ranks of the variables rather than their absolute values. Thismethod was developed by the British psychologist Charles EdwardSpearman in 1904 .In other words, this method is used in a situationin which quantitative measure of certain qualitative factors such asjudgement, brands personalities, TV programmes, leadership, colour,taste, cannot be fixed, but individual observations can be arranged ina definite order. The ranking is assigned by using a set of ordinal ranknumbers, with 1 for the individual observation ranked first either interms of quantity or quality; and n for the individual observation rankedlast in a group of n pairs of observations. Mathematically, Spearman’srank correlation coefficient is defined as:
)1(d6
-1R 2
2
nn ......................................(10.11)
Where R Rank correlation coefficient.
d = the difference between the pairs of ranks of the same individual in the two characteristics . n =the number of pairs.
Advantages and Disadvantages of Spearman’sCorrelation coefficient method:
Advantages:
(i) It is easy to understand and its application is simpler thanPearson’s method.
(ii) It can be used to study correlation when variables areexpressed in qualitative terms like beauty, intelligence, honesty,
240 Statistics for Management
Simple Correlation and RegressionUnit 10
efficiency and so on.(iii) It is appropriate to measure the association between two
variables if the data type is at least ordinal scaled (ranked).(iv) The sample data of values of two variables is converted
into ranks either in ascending order or descending order for calculatingdegree of correlation between two variables.
Disadvantages:
(i) Values of both variables are assumed to be normallydistributed and describing a linear relationship rather than non-linearrelationship.
(ii) It is not applicable in case of bivariate frequency distribution.(iii) It needs a large computational time when number of pairs
of values of two variables exceed 30.
Case I: When ranks are given
When observations in a data set are already arranged in aparticular order (rank), consider the differences in pairs of observationsto determine Square these differences and obtain the total 2d .Finally apply the formula (11.11) to calculate correlation coefficient.
Example 10.6: An office has 12 clerks. The long service clerksfeel that they should have a seniority increment based on length ofservice built into their salary structure. An assessment of theirefficiency by their departmental manager and the personneldepartment produces a ranking of efficiency. This is shown belowtogether with a ranking of their length of service.
Ranking according tolength of service : 1 2 3 4 5 6 7 8 9 10 11 12Ranking accordingto efficiency : 2 3 5 1 9 10 11 12 8 7 6 4
Do the data support the clerks’ claim for seniority increment.
Solution: To determine whether the data support the clerks’
241Statistics for Management
Simple Correlation and Regression Unit 10
claim, we use Spearman’s correlation coefficient which is given by
1nnd6
1R 2
2
Since in the given data, the ranks are already been assigned,therefore we prepare the following table for calculation.
Therefore, Spearman’s correlation coefficient is given by,
378.0171610681
1121217861R 2
Thus from the result we observe that there exist a low degreeof positive correlation between length of service and efficiency.Therefore the claim of the clerks for a seniority increment based onlength of service is not justified.
Example 10.7: Ten competitors in a beauty contest are rankedby three judges in the following order:
1st Judge : 1 6 5 10 3 2 4 9 7 72nd Judge : 3 5 8 4 7 10 2 1 6 93rd Judge : 6 4 9 8 1 2 3 10 5 9
Use the rank correlation coefficient to determine which pair ofJudges has the nearest approach to common tastes in beauty.
242 Statistics for Management
Simple Correlation and RegressionUnit 10
Solution: The pair of judges who have the nearest approach tocommon tastes in beauty can be obtained in 3C2 =3 ways as follows:
(i) Judge 1and Judge 2, (ii) Judge 2 and Judge 3 and (iii) Judge3 and Judge 1. Now let 21 , RR and 3R denote the ranks assigned bythe first, second and third Judges respectively and let ijR be the rankcorrelation coefficient between the ranks assigned by the ith and jth
Judges, .3,2,1 ji Let ,jiij RRd be the difference of ranks ofan individual given by the ith and jth Judges.
We have 10n .Applying the formula, Spearman’s rank correlation coefficients
are given by,
2121.0327
991020061
1nnd6
1R 2
212
12
Since the correlation coefficient is maximum, thepair of first and third judges has the nearest approach to common
243Statistics for Management
Simple Correlation and Regression Unit 10
tastes in beauty.
Since 2312 , RR are negative, the pair of judges (1, 2) and (2, 3)have opposite tastes for beauty.
Case II: When ranks are not given
Spearman’s Rank correlation coefficient can also be used evenif we are dealing with variables which are measured quantitativelyi.e., when the pairs of observations in the data set are not ranked asin case I. In such a situation, we shall have to assign ranks to thegiven set of data. The highest (smallest) observation is given the rank1. The next highest (next lowest) observation is given the rank 2 andso on. It is to be noted that the same approach (i.e., either ascendingor descending) should be followed for all the variables underconsiderations.
Example 10.8: Calculate Spearman’s rank correlationcoefficient between advertising cost and sales from the following data:
Advertisementcost (‘000Rs.): 39 65 62 90 82 75 25 98 36 78Sales (lakhs) : 47 53 58 86 62 68 60 91 51 84
Solution: Let the variable X denote the advertisement cost (‘000Rs.) and the variable Y denote the sales (lakhs).
Let us now start ranking from the highest value for both thevariables as given below:
244 Statistics for Management
Simple Correlation and RegressionUnit 10
Here .10n
Therefore, Spearman’s rank correlation is
82.0119
99103061
1nnd6
1R 2
2
The result shows a high degree of positive correlation betweenAdvertising cost and sales.
Case III: When ranks are equalWhile ranking observations in the data set by considering either
the highest value or lowest value as rank 1, we may encounter asituation of more than one observations being of equal size. In such acase, the rank to be assigned to individual observations is an averageof the ranks which these individual observations would have got hadthey differed from each other. For example, if two observations areranked equal at third place, then the average rank of (3+4)/2=3.5 isassigned to these two observations. Similarly, if three observationsare ranked equal at third place, then the average rank of (3+4+5)/3=4is assigned to these three observations.
While equal ranks are assigned to a few observations in thedata set, an adjustment is needed in the Spearman’s rank correlationcoefficient formula as given below:
245Statistics for Management
Simple Correlation and Regression Unit 10
1nn
........mm121mm
121d6
1R 2
23
213
12
where im (i=1, 2, …..) stands for the number of items anobservation is repeated in the data set for both variables.
Example 10.9: A financial analyst wanted to find out whetherinventory turnover influences any company’s earnings per share (inper cent). A random sample of 7 companies listed in a stock exchangewere selected and the following data was recorded for each:
Find the strength of association between inventory turnover andearnings per share. Interpret the findings.
Solution: Let us start ranking from lowest value of both thevariables. Since there are tied ranks, the sum of the tied ranks isaveraged and assigned to each of the tied observations as shownbelow:
It may be noted that a value 5 of variable x is repeated twice
246 Statistics for Management
Simple Correlation and RegressionUnit 10
(m1=2) and values 8 and 13 of variable y is also repeated twice, som2 =2 and m3=2. Applying the formula
The result shows a very week positive association betweeninventory turnover and earnings per share.
10.5 PROBABLE ERROR OF CORRELATION
COEFFICIENT
Having determined the value of the correlation coefficient, the nextstep is to find the extent to which it is dependable. Probable error ofcorrelation coefficient usually denoted by ).(. rEP is an old measure oftesting the reliability of an observed value of correlation coefficient in so faras it depends upon the conditions of random sampling.
If r is the observed correlation coefficient in a sample of pairs ofobservations then its standard error, usually denoted by is given by,
nr1)r.(E.S
2
Probable error of the Correlation coefficient is given by,
).(.6745.0).(. rESrEP
We have taken the factor 0.6745 because in a normal distribution 50% ofthe observations lie in the range where is the mean and
is the s.d.
Uses of Probable Error:
The important uses of probable error of correlation coefficient i.e., are given by
(a) may be used to determine the two limits within which here is 50%chance that correlation coefficients of randomly selected samples fromthe same population will lie.
(b) may be used to test if an observed value of sample correlation coefficientis significant of any correlation in the population. The rules for testing the
247Statistics for Management
Simple Correlation and Regression Unit 10
significance of population correlation coefficient are as below:
(i) If then the population correlation coefficient is not significant.(ii) If then population correlation coefficient is significant.(iii) In other situations nothing can be concluded with certainty.
It is to be mentioned that one should use probable error to test thesignificance of population correlation coefficient when, the number of pairsof observations is fairly large. Moreover, probable error can be applied onlyunder the following situations:
(a) The data must have been drawn from a normal population.(b) The observations included in the sample must be drawn randomly.
Example 10.10: The following are the marks obtained by 10 students inMathematics and Statistics in an examination. Determine the Karl Pearson’scoefficient of correlation for these two series of marks. Calculate theprobable error of this correlation coefficient and examine the reliability(significance) of the correlation coefficient. Also compute the limits withinwhich the population correlation coefficient may be expected to lie.
Solution: Let the marks in Maths and the marks in Stats be denoted by thevariables X and Y respectively. Let us shift both the origin and scale of theoriginal variables X and Y obtain the new variables U and V as given by,
565YV,
560XU
(The scale 5 being common factor
of each of X and Y )
248 Statistics for Management
Simple Correlation and RegressionUnit 10
We have, UVXY rr
Now we prepare the following table to compute UVr
Now, UVXY rr
2222 VVnUUn
VUUVn
=
4176104140102214110
=
=0.9031
Again, nrrEP
216745.0).(.
=
109031.016745.0
2
= 0405.01623.3
128155.01623.3
19.06745.0
Reliability of the value of :
We have, and 6× ).(. rEP =6×0.0405=0.243. Since the value of r ismuch higher than the value of , as such the value of is highlysignificant.
Limits for Population correlation coefficient:
=0.9031 0.0405 i.e., 0.8626 and 0.9436
249Statistics for Management
Simple Correlation and Regression Unit 10
This implies that if we take another sample of size 10 from the samepopulation, then its correlation coefficient can be expected to lie between0.8626 and 0.9436.
Note: When we say is reliable or significant then it usually means that, onaverage, students getting good marks in Mathematics also get good marksin Statistics and students getting poor marks in Mathematics also get poormarks in Statistics. We must not interpret that all the students getting good(poor) marks in Mathematics also get good (poor) marks in Statistics. Ithappens since correlation indicates an average relationship between twoseries only and not between the individual items of the series.
10.6 COEFFICIENT OF DETERMINATION
Coefficient of correlation between two variables is a measure ofdegree of linear relationship that may exist in between them and indicatesthe amount of variation of one variable which is associated with or isaccounted for by another variable. A more useful and readily comprehensiblemeasure for this purpose is the coefficient of determination which indicatesthe percentage of the total variability of the dependent variable that isaccounted for or explained by the independent variable. In other words, thecoefficient of determination gives the ratio of the explained variance to thetotal variance. The coefficient of determination is expressed by the squareof the correlation coefficient, i.e.,. The value of lie between 0 and 0. Forexample, let the two variables, say and be inter-dependent, and variation incauses variation in. Further, let the correlation coefficient between them besay, 0.9. The coefficient of determination, in this situation, is which impliesthat of the variation in the dependent variable is due to variation in theindependent variable or is explained by the variation in. The remaining isdue to or is explained by some other factors.
The various values of coefficient of determination can be interpreted in thefollowing way:
(i) indicates that no variation in can be explained by the variable which inturn indicates that there exists no association between and .(ii) indicates that the values of are completely explained by which in turnindicates that there exists perfect association between and .
250 Statistics for Management
Simple Correlation and RegressionUnit 10
(iii) reveals the degree of explained variation in as a result of variation inthe values of . Value of closer to 0 shows low proportion of variation inexplained by . Again, value of closer to 1 shows that value of can predictthe actual value of .
Example 10.11: Five students of a Management Programme at a certainInstitute were selected at random. Their Intelligent Quotient (I.Q.) and themarks obtained by them in the paper in Decision Science (includingStatistics) were as follows:
Calculate the coefficient of determination and interpret the result.
Solution: Here, we may consider I.Q. as the independent variable as ,and Marks in Decision Science as dependent variable Y .This happens sobecause the marks obtained, would generally depend on the I.Q. of astudent.
Now, we prepare the following table for calculation
Now, we have Coefficient of determination = 2r
251Statistics for Management
Simple Correlation and Regression Unit 10
Where
2y
2y
2x
2x
yxyx
dnddnd
ddnddr
22 763822062650
7206960
809.032.148
12088250
840960
6545.0r2
Which implies that 65.45% of variation in the marks is explained by I.Q. Therest of the 34.55% variation in I.Q. could be due to some other factors likepreparation for the examination by the students, their mental frame duringthe examination, etc.
CHECK YOUR PROGRESS
Q 4: Under what situation rank correlation coefficientis used?Q 5: What is Coefficient of determination? Interpret
the meaning of 49.02 r .
10.7 REGRESSION ANALYSIS
Correlation analysis deals with exploring the correlation that mightexist between two or more variables and indicates the degree and directionof their association, but fails to answer the question:
Is there any functional relationship between two variables? If yes, can it beused to estimate the most likely value of one variable, given the value ofother variable?
252 Statistics for Management
Simple Correlation and RegressionUnit 10
Thus the statistical technique that expresses the relationshipbetween two or more variables in the form of an equation to estimate thevalue of a variable, based on the given value of another variable, is calledregression analysis. The variable whose value is estimated using thealgebraic equation is called dependent variable and the variable whose valueis used to estimate this value is called independent variable. The linearalgebraic equation used for expressing a dependent variable in terms ofindependent variable is called linear regression equation.
In many business situations, it has been observed that decisionmaking is based upon the understanding of the relationship between two ormore variables. For example, a sales manager might be interested inknowing the impact of advertising on sales. Here, advertising can beconsidered as an independent variable and sales can be considered as thedependent variable. This is an example of simple linear regression where asingle independent variable is used to predict a single numerical dependentvariable.
The meaning of the term regression is “stepping back towards theaverage.” The term regression was first introduced by Sir Francis Galton in1877. His study on the height of one thousand fathers and sons exhibitedan interesting result where he found that tall fathers tend to have tall sonsand short fathers tend to have short sons. However, the average height ofthe sons of a group of tall fathers was less than that of the fathers. Galtonconcluded that abnormally tall or short parents tend to “regress” or “step-back” to the average population height.
Advantages of Regression Analysis:
Some of the important advantages of regression analysis are given below:
1. Regression analysis helps in developing a regression equation with thehelp of which the value of a dependent variable can be estimated for anygiven value of the independent variable.
2. It helps to determine standard error of estimate to measure the variabilityof values of a dependent line fits the data. When all the points fall on theline, the standard error of estimate becomes zero.
253Statistics for Management
Simple Correlation and Regression Unit 10
3. When the sample size is large ( 30n ), the interval estimation forpredicting the value of a dependent variable based on standard error ofestimate is considered to be acceptable by changing the values of either
x or
y
. The magnitude of
2r
remains constant regardless of the values ofthe two variables.
Correlation versus Regression:
(a) With the help of correlation one measures the covariation between twovariables. In correlation neither variable may be termed as dependent orindependent variable. Since correlation does not establish a relationshipbetween the two variables as such one cannot estimate the value of onevariable corresponding to a given value of the other variable.
Regression establishes a functional relationship between two variables andhence one can estimate the value of one variable corresponding to a givenvalue of the other variable.
(b) With the help of Correlation analysis, one cannot study which variableis the cause and which variable is the effect. For example, a high degree ofpositive correlation between price and supply does not indicate whethersupply is the effect of price or price is the effect of supply.
Regression analysis, in contrast to correlation, determines the cause-and-effect relationship between x and
y
, that is, a change in the value ofindependent variable
x
causes a corresponding change in the value ofdependent variable
y
if all other factors that affect y remain unchanged.
(c) Correlation coefficient between two variables
x
and
y
is alwayssymmetric. i.e.,
YXXY rr
But regression coefficient is not symmetric in general i.e.,
YXXY bb
.
(d) Correlation coefficient is independent of the change of both origins andscale, regression coefficients are independent of change of origin only butnot of scale.
254 Statistics for Management
Simple Correlation and RegressionUnit 10
10.8 REGRESSION LINES
A regression line is the line from which one can get the bestestimated value of the dependent variable corresponding to a given valueof the independent variable. Thus a regression line is the line of best fittedline. The term best fit is interpreted in accordance with the Principle ofLeast Squares which consists in minimizing the sum of the squares of theresiduals or the errors of estimates.
In case of two variables and we usually have two regressionlines because each variable may usually be treated as the dependent aswell as the independent variable. For example, let us consider two variablesnamely, price (P) and supply (S). We know that other conditions remainingsame; if the price of a commodity increases (decreases) the supplies ofthe commodity also increases (decreases). In this case, is theindependent variable and P is the dependent variable. Also, other conditionsremaining same when the supply of a commodity increases (decreases),its price decreases (increases). In this case supply is the independentvariable and price is the dependent variable. Thus for two variables andwe have two regression lines. It is to be noted that
(i) when ,1XYr i.e., when there exists either perfect positive or perfectnegative correlation between x and then both the lines of regressioncoincide.
(ii) when i.e., when x and are uncorrelated then the two lines ofregression become perpendicular to each other.
Thus when we consider the variable as the independent and the variableas dependent then we get the regression equation of y on and similarly
in case of regression equation of on we will have as the independent variableand as the dependent variable. Sometimes, of course, from two correlatedvariables it is not possible to obtain both the regression lines. For example,if the variable denotes the amount of rainfall in some years and the variabledenotes the production of paddy in these years then obviously, can beconsidered only as the independent variable whereas can be consideredonly as the dependent variable.
255Statistics for Management
Simple Correlation and Regression Unit 10
The regression line of
y
on
x
is that line which gives the bestestimated value of
y
corresponding to a given value of
.x
The regression line of on is that line which gives the best estimatedvalue of corresponding to a given value of
11.8.1 DETERMINATION OF THE REGRESSION LINE
OF
y
ON
x
Let
nn yxyxyx ,,,.........,,, 2211
be n pairs of observationson the two variables
x
and
y
under study. Let
bxay
….....…….(10.12)
be the line of regression (best fit) of
y
on
x
.
For any given point
ii yxP ,1
in the scatter diagram, the error ofestimate or residual as given by the line of best fit (11.12) is ii HP .Now, the x coordinate of
iH
and iP are same viz., ix and since xi
lies on the line (11.12), the coordinate of, i.e., is. Hence the error ofestimate for is given by
256 Statistics for Management
Simple Correlation and RegressionUnit 10
MHMPHP iiii
= ii bxay
which is the error for the ith point. We will have such errors forall the points on the scatter diagram. For the points which lie abovethe line, the error would be positive and for the points which lie belowthe line, the error would be negative.
By applying the method of least squares, the unknownconstants and in (11.12) needs to be determined in such a mannerthat the sum of the squares of the errors of estimates is minimum. Inother words, we have to minimize
n
1i
2ii
n
1i
2ii bxayHPE
Subject to variations in a and .
E may also be expressed as
,bx)-a-(y)y-(yE 22e .............................(10.13)
where ey is the estimated value of y as given by (10.12) forgiven value of and summation ( ) being taken over the pairs ofobservations.
Using the principle of maxima and minima in differentialcalculus, will have an optimum (maximum or minimum) forvariations in a and if its partial derivatives w.r.t. and vanish separately.Hence, from (10.13) we get
0aE
and 0
bE
0.2
0.2
bxayb
bxay
bxaya
bxay
257Statistics for Management
Simple Correlation and Regression Unit 10
On simplifying, we have,
xbnay
…………………(10.14)
And 2xbxaxy …………….....(10.15)
Equations (11.14) and (11.15) are called normal equations. Fromthe given values of x and
y
we calculate
2,, xyx
and xy .Putting these values in (11.14) and (11.15) and solving these twoequations simultaneously for a and
b
we get the values of and whichare given by
22 x)(-xn
y)x)((-xynb ................(10.16)
and xbya
Now putting these values in equation (11.12) we get theregression equation of y on
x
which is given by
xxbyy
where
b
is called the regression coefficient of y on
x
and isgenerally denoted by the symbol
yxb
.
Thus writing yxb for in the above equation we have
xxbyy yx ......................(10.17)
The regression line of
y
on
x
given by (11.17) is to be used toestimate the most probable or average value of for any given valueof
10.8.2 DETERMINATION OF THE REGRESSION LINE
OF x ON y
Let the line of regression of on to be estimated for the givendata is:
ybax
..........................(10.18)
258 Statistics for Management
Simple Correlation and RegressionUnit 10
Applying the least squares method in a similar manner asdiscussed in case of regression line of on we have the followingtwo normal equations for determining the values of and .
………………….(10.19)
2ybyaxy …………………..(10.20)
By solving these equations simultaneously for aand andputting these values in equation (11.18) we obtain the regression lineof on given by
………………(10.21)
where
22 yyn
yxxynbxy ……….(10.22)
Equation (10.21) is the regression line of x on where given byequation (10.22) is called regression coefficient of on.
The regression line of on given by (10.17) is to be used to estimatethe most probable or average value of for any given value of
The regression line of on given by (10.21) is to be used to estimatethe most probable or average value of for any given value of
Remark: (i) When there exists either perfect positive correlation orperfect negative correlation between x and y then , and consequently,the regression equation of y on x becomes:
xxyyx
y
xy
xxyy ……………………………..(*)
On the other hand, in such situation the regression equation of x on ybecomes:
yyxxy
x
xy
xxyy
……………… …………..(**)
259Statistics for Management
Simple Correlation and Regression Unit 10
From equations (*) and (**) we conclude that we have the sameline. Thus if there exists either perfect positive correlation or perfectnegative correlation between the two variables i.e., when
r
1, thetwo regression lines coincide.
(ii) The two lines of regression pass through the common point
yx,
since this point satisfies both the regression equations.
10.8.3 REGRESSION COEFFICIENT
The regression coefficient of y on
x
i.e.,
yxb
gives the amountof increase (decrease) in corresponding to one unit increase(decrease) in when is positive. On the other hand the negative valueof gives the amount of decrease (increase) in corresponding to aunit increase (decrease) in
Similarly, if xyb is positive, it gives the amount of increase(decrease) in x corresponding to the unit increase (decrease) in
y
.Again the negative value of gives the amount of decrease (increase)in x corresponding to a unit increase (decrease) in y
Other expressions of Regression coefficients:
The regression coefficient of y on x i.e., can also be expressed as
2
xyx
y,xCovb
…………………………(A)and
x
yyx rb
… ....………………….(B)
where r is the correlation coefficient between
x
and
.y
Again, we have,
yyxxn1)y,x(Cov
22
x xxn1
2yx
xx
yyxxb)A(
260 Statistics for Management
Simple Correlation and RegressionUnit 10
22 xxn
yxxyn
………………………..(C)
We are left with the same equation as obtained in equation (10.16).
Similarly, the regression coefficient of x on can also be expressedas below:
We have,
……..…………………( A )
………………………..( )
=
22 yyn
yxxyn ………………………( )
Which is the same expression as obtained in equation (10.22).
Step deviation method for ungrouped data: When actualmean values and y are in fraction, then calculation of regressioncoefficients can be simplified by taking deviations of x and valuesfrom their assumed means A and B, respectively. Thus when
and BYd y , where A and B are assumed means ofand values, then
)D.......(..........
ddn
ddddnb 2
x2
x
yxyxyx
)D.......(..........ddn
ddddnb 2
y2
y
yxyxxy
261Statistics for Management
Simple Correlation and Regression Unit 10
Properties of Regression Coefficients:
Property 1: The correlation coefficient is the geometric mean of theregression coefficients.
Proof: We have the regression coefficient of
y
on
x
x
yyx rb
.................................. (10.13)
Similarly, the regression coefficient of
x
on
y
y
xxy rb
.................................. (10.13)
Multiplying (11.13) and (11.14) we get,
y
x
x
yxyyx rrbb
2xyyx rbb
xyyx bbr
Thus the correlation coefficient r is the geometric mean of the tworegression coefficients
yxb
and xyb
Property 2: The correlation coefficient and the two regressioncoefficients are simultaneously positive or simultaneously negative.
Proof: We have,
x
yyx rb
and y
xxy rb
………………(11.14)
Standard deviation being square quantity can never be negative. Herewe assume that .0;0 yx Hence, we have,
0x and 0y
Therefore from (11.14), we observe that when r is positive, both
yxb
and xyb are positive and when is negative, both and are negative.
Thus we can conclude that when both and are positive then
262 Statistics for Management
Simple Correlation and RegressionUnit 10
xyyx bbr
and when both yxb and xyb are negative then
xyyx bbr
Property 3: The product of the two regression coefficients cannotexceed unity.
Proof: We have
xyyx bbr and 1r1
If the product of the two regression coefficients exceeds unity then
will exceed unity as the square root of a number greater than one isalso greater than one. In this case r will be greater than 1 if and
xyb are positive and will be less than -1 if and are negative which isimpossible since. This indicates that the product of the two regressioncoefficients cannot exceed unity.
Property 4: The two regression coefficients are independent of thechange of origin but are dependent on the change of scale.
Proof: The property states that if we change the origin of theregression coefficients then the values of the regression coefficientsremain unchanged but if we change their scale then their values getchanged.
Let u and v be the new variables obtained by changing the originand the scale of the original variables and as follows:
kbyv,
haxu
……………….(i).
where hba ,, and k are (>0) are constants.
Since correlation coefficient is independent of the change of origin and scale,we have,
263Statistics for Management
Simple Correlation and Regression Unit 10
uvxy rr ……………….(ii)
Due to the transformation (i)
ux h and vy k
Now,
u
vuv
u
vuv
x
yxyyx r
kh
hkrrb
i.e., uvyx bkh
b
………………(iii)
Similarly,
v
uuv
v
uuv
y
xxyxy r
hk
khrrb
i.e., uvyx bkh
b ……………………..(iv)
Hence, from equations (iii) and (iv) we conclude that the two regressioncoefficients are independent of g
Property 5: Regression coefficient is not symmetric i.e., in general,
yxxy bb
.
Proof: We have,
y
xxy rb
and x
yyx rb
We observe that, in general yxxy bb .
On the other hand, regression coefficients yxb and xyb and correlationcoefficient r become equal only when
,yx
which usually doesnot occur.
264 Statistics for Management
Simple Correlation and RegressionUnit 10
CHECK YOUR PROGRESS
Q 6: State whether the following statements are true or false:(i) Regression analysis is a statistical technique that expresses thefunctional relationship in the form of an equation.(ii) Correlation coefficient is the geometric mean of regression coefficients.(iii) If one of the regression coefficients is greater than one the othermust also be greater than one.(iv) The product of regression coefficients is always more than one.(v) If xyb is negative, then yxb is negative.
10.9 STANDARD ERROR OF AN ESTIMATE
The regression equations enable us to estimate the value of thedependent variable for any given value of the independent variable. Theestimates so obtained are, however, not perfect. A measure of the precisionof the estimates so obtained from the regression equations is provided bythe Standard Error (S.E.) of the estimate. While standard deviation of thevalues of a variable measures the variation or scatteredness of the valuesabout their arithmetic mean, the standard error of estimate measures thevariation of scatteredness of the points or dots of the scatter diagram aboutthe regression line. The more closely the dots cluster around the regressionline, the more representative the line is so far as the relationship betweenthe two variables is concerned and the better is the estimate based on theequation of this line. If all the dots lie on the regression line then there existsno variation about the line and, as a result of which correlation between thevariables will be perfect.
Thus, Standard error (S.E.) of estimate of y for given denoted by isdefined by
2nyy
S2
yx
265Statistics for Management
Simple Correlation and Regression Unit 10
To simplify the calculations of yxS , the following equivalent formula is used,
2nxybyay
S2
yx
Where a and
b
are respectively the intercept and the slope of the regressionline of y on
x
which are to be determined by using the method of leastsquares.
Similarly, Standard error (S.E.) of estimate of x for given
y
denoted by
xyS
is defined by
2nxx
S2
xy
To simplify the calculations of yxS , the following equivalent formula is used,
2nxybxax
S2
xy
Where a and b
are respectively the intercept and the slope of the regressionline of x on
ywhich are to be determined by using the method of least
squares.
Again a much more convenient formula for numerical computations is givenby
21 rS yyx and 2xxy r1S
Example 10.12: A company is introducing a job evaluation scheme in whichall jobs are graded by points for skill, responsibility, and so on. Monthly payscales (Rs. in 1000’s) are then drawn up according to the number of pointsallocated and other factors such as experience and local conditions. Todate the company has applied this scheme to 9 jobs:
266 Statistics for Management
Simple Correlation and RegressionUnit 10
(a) Find the least squares regression line for linking pay scales to points.(b) Estimate the monthly pay for a job graded by 20 points.
Solution: We consider monthly pay (Y ) as the dependent variable and jobgrade points ( X ) as the independent variable. Now, the least squareregression line for linking pay scales to points i.e., the line of regression of
Y on X is given by,
XXbYY yx
Now, we prepare the following table for calculation
(a) Here, 22.159
137
nX
X ; 35.5915.48
nY
Y
Since mean values X and Y are non-integer value, therefore deviationsare taken from assumed mean as done in the above table.
267Statistics for Management
Simple Correlation and Regression Unit 10
133.02.4353.582
2484915.3240.659
ddn
ddddnb 22
X2
X
YXYXYX
Substituting these values of
YX ,
and YXb in the regression line, we have
22.15133.035.5 XY
326.3133.035.502426.2133.0
XYXY
(b) For job grade point ,20X the estimated average pay scale is given by
986.520133.0326.3133.0326.3 XY
Hence, likely monthly pay for a job with grade points 20 is Rs. 5.986.
Example 10.13: In the estimation of regression equations of two variables
X and Y the following results were obtained
3900,2860,6360,10,70,90 22 xyyxnYX where
YYyXXx ;
Obtain the two regression equations.
Solution: We have the line of regression of Y on X given by,
)i...(....................XXbYY YX
Where,
22 x
xy
XX
YYXXbYX
6132.063603900
From (i) the required regression equation is
906132.070 XY
70188.556132.0 XY
812.146132.0 XY
Similarly, the line of regression of X on Y is given by,
268 Statistics for Management
Simple Correlation and RegressionUnit 10
)ii...(....................YYbXX XY
Where,
22 y
xy
YY
YYXXbXY
From (ii) the required regression equation is
90452.953636.1 YX
452.53636.1 YX
Example 10.14: If the two lines of regression are:
03054 yx and 0107920 yx ,
Which of these is the line of regression of x on ? Determine and
y when .3x
Solution: We are given the regression lines as
)....(....................03054 iyx
In order to determine the line of regression of x on weneed to apply the property of regression coefficients, i.e., .In the given problem, let (i) be the line of regression of x on y and (ii) be theline of regression of y on x .
From (i), 45b
430y
45x xy
From (ii), 920b
9107x
920y yx
Now, 17778.245
9202 yxxybbr
But 10 2 r , therefore our assumption is wrong.
269Statistics for Management
Simple Correlation and Regression Unit 10
Hence (i) is the line of regression of y on
x
and (ii) is the line of regressionof x on
y
.
Assuming (i) as the line of regression of y on
x
we have,
6x54y30x4y5
Regression coefficient of y on x =54
, assuming (ii) as the line of regression of x on
y
we have,
20107y
209x107y9x20
Regression coefficient of x on y =209
36.0209
54b.br xyyx
2
6.036.0r
6.0 r (since both the regression coefficients are positive, r must bepositive.)
Again, we have, y
xxy rb
46.0
354.
r
b xyxy
(since 3x )
Example 10.15: The following data relate to advertising expenditure (Rs. inlakh) and their corresponding sales (Rs. in crore):
Advertising expenditure: 10 12 15 23 20
Sales : 14 17 23 25 21
(a) Find the equation of the least squares line fitting the data.(b) Estimate the value of sales corresponding to advertising expenditure ofRs. 30 lakh.(c) Calculate the standard error of estimate of sales on advertisingexpenditure.
Solution: (a) Let the advertising expenditure be denoted by x and sales byy. Then we obtain the least squares line of
y
on
x
which is of the form givenby,
)i...(....................xxbyy yx
270 Statistics for Management
Simple Correlation and RegressionUnit 10
Where
22
xx
yxyxy x
ddn
ddddnb ……………….(ii)
Now we construct the following table for calculation.
(a) Therefore from (i) the regression equation of on is
xy 712.0608.8
which is the required least squares line of sales on advertising expenditure.
(b) The least squares line obtained in part (a) may be applied to estimatethe sales turnover corresponding to the advertising expenditure of Rs. 30lakh as:
(c) The standard error of estimate of sale ( ) on advertising expenditure( y ) denoted by defined by
2
2
nxybyay
S yx …………………….(iii)
271Statistics for Management
Simple Correlation and Regression Unit 10
Now we make the following table to calculate
yxS
From (ii)
251684712.0100608.82080S yx
= .594.23
11998.8602080
CHECK YOUR PROGRESS
Q 7: State whether the following statements are trueor false:(i) Standard error of estimate is a measure of scatterof the observations about the regression line.
(ii) The standard error of estimate of y on x,
yxS
is equal to 21 ry (iii) Smaller the value of yxS , better the line fits the data.
10.10 LET US SUM UP
Correlation means existence of relationship between variables.When two variables deviate in the same direction then we have positivecorrelation and when they move in the opposite direction we say that hereexists negative correlation between variables.
We have learnt various methods with the help of which we canascertain the existence of relationship between variables. These methodsinclude: (i) Scatter diagram method, (ii) Karl Pearson’s correlation coefficientmethod, (iii) Spearman’s rank correlation coefficient method.
272 Statistics for Management
Simple Correlation and RegressionUnit 10
Scatter diagram is a graphic tool to portray the relationship betweenvariables. Karl Pearson’s correlation coefficient measures the strength ofthe linear association between variables with values near zero indicating alack of linearity while values near -1 or +1 suggest linearity. Karl Pearsoncoefficient of correlation is designated by r .
We have also discussed a very important phenomenon inCorrelation and Regression analysis which is termed as Coefficient ofdetermination. It is defined as the fraction of the variation in one variablethat is explained by the variation in the other variable and in other words, itmeasures the proportion of variation in the dependent variable that canbe attributed to independent variable . It ranges from 0 to 1 and is thesquare of the coefficient of correlation. Thus a coefficient of 0.82 suggeststhat 82% of the variation in Y is accounted for by X.
This unit also focuses on the process of developing a model knownas regression model under regression analysis which is used to predictthe value of a dependent variable by at least one independent variable.
10.11 FURTHER READINGS
1) Srivastava, T.N., Rego, S. (2008). Statistics for Management. NewDelhi. Tata McGraw Hill Education Private Limited.
2) Sharma, J.K. (2007). Business Statistics. New Delhi. PearsonEducation Ltd.
3) Hazarika, P.L. (2016). Essential Statistics For Economics AndBusiness Studies. New Delhi. Akansha Publishing House.
4) Lind, D.A., Marshal, W.G., Wathen, S.A. (2009) Statistical Techniquesin Business and Economics. New Delhi. Tata McGraw Hill EducationPrivate Limited.
5) Bajpai, N. (2014). Business Statistics. New Delhi. Pearson EducationLtd.
273Statistics for Management
Simple Correlation and Regression Unit 10
10.12 ANSWERS TO CHECK YOUR
PROGRESS
Ans. to Q No 1: (i) False, (ii) False, (iii) True, (iv) True.
Ans. to Q No 2:
22YYXX
YYXXr
9.07232
43
Ans. to Q No 3: We have, YX
YXCovr
),(
Y66.3
25.0 (since 36)( 2 XXVar , therefore 6X ) 4.2Y Ans. to Q No 4: Rank correlation coefficient is used in a situation in whichquantitative measure of certain qualitative factors such as judgment,leadership, colour, tastes etc. cannot be fixed, but individual observationscan be arranged in a definite order.
Ans. to Q No 5: Coefficient of determination is a statistical measure of theproportion of the variation in the dependent variable that is explained byindependent variable.
Coefficient of determination 49.02 r or 49% indicates that only 49%of the variation in the dependent variable y can be accounted for in terms ofvariable x.The remaining 51% of the variability may be due to other factors.
Ans. to Q No 6: (i) True, (ii) True, (iii) False, (iv) False, (v) True.
Ans. to Q No 7: (i) True, (ii) False, (iii) True.
274 Statistics for Management
Simple Correlation and RegressionUnit 10
10.13 MODEL QUESTIONS
1. What is correlation? Define positive, negative and zero correlation.2. What is a scatter diagram? Discuss by means of suitable scatterdiagrams different types of correlation that may exist between the variablesin bivariate data.3. What is Karl Pearson’s correlation coefficient? How would you interpretthe value of a coefficient correlation?4. Distinguish between the coefficient of determination and the coefficientof correlation. How would you interpret the value of a coefficient ofdetermination?5. What is rank correlation coefficient method? Bring out its usefulness.6. Explain the concept of regression and point out its usefulness in dealingwith business problems.7. Write a short note on the probable error of correlation coefficient.8. What is linear regression? Why are there two regression lines? Whendo these become identical?9. Show that yxxy bbr 2 .10. Find the coefficient of correlation from the following data:Cost : 39 65 62 90 82 75 25 98 36 78Sales : 47 53 58 86 62 68 60 91 51 84Also interpret the result.
11. From the following data, calculate Karl Pearson’s correlation coefficient
i)
9
1
9
1
29
1
2.193,346,120
i iiii
ii YYXXYYXX
12. Calculate the coefficient of correlation and its probable error from thefollowing data:
X: 1 2 3 4 5 6 7 8 9 10Y: 20 16 14 10 10 9 8 7 6 513. Two departmental managers ranked a few trainees according to theirperceived abilities. The ranking are given below:
275Statistics for Management
Simple Correlation and Regression Unit 10
Calculate an appropriate correlation coefficient to measure the consistencyin the ranking.14. You are given below the following information about advertisementexpenditure and sales:
Correlation coefficient is 0.8.(a) Obtain both the regression equations.(b) Find the likely sales when advertisement expenditures Rs. 25 crore.(c) What should be the advertisement budget if the company wants to attainsales target of Rs. 150 crore?
15. A company believes that the number of salespersons employed is agood predictor of sales. The following table exhibits sales (in thousandsRs.) and the number of salespersons employed for different years:
Obtain a simple regression model to predict sales based on the number ofsalespersons employed.
16. The HR manager of a multinational company wants to determine therelationship between experience and income of employees. The followingdata are collected from 14 randomly selected employees.
(a) Develop a regression model to predict income based on the years ofexperience.(b) Calculate the coefficient of determination and interpret the result.(c) Calculate the standard error of estimate.(d) Predict the income of an employee who has 22 years of experience.
*** ***** ***