Statistic for Business
Transcript of Statistic for Business
VIETNAM NATIONAL UNIVERSITY
INTERNATIONAL UNIVERSITY
-------o0o-------
GROUP PROJECT
SUBJECT: STATISTICS FOR BUSINESS
LECTURER: MR. TRÖÔNG BAÙ HUY
GROUP: 04
MEMBERS’ NAME:
1.HOÀ NGOÏC TRAÂN (BABAIU10146)
2.VOÕ ÑAËNG LAN ANH (BABAIU10155)
3.LYÙ THÒ MAI HÖÔNG (BABAIU10018)
4.VOÕ NHÖ QUYØNH (BABAIU10116)
5.THAÅM HOÀNG VAÂN (BABAIU10201)
QUESTION 1:
The question requires us to prove that there
is any indication of mean differences across the
plants, therefore, we will use ananlysis of
variance (ANOVA) to solve this problem, becaue
ANOVA is the statistical method for determining
the existence of differences among several
population means. And besides, if we can prove
that there are mean differences across the
plants, we will conduct Tukey Pairwise-
Comparisons testing to determine which plants
appear to differ from which others, because the
Tukey Pairwise-Comparisons test allows us to
compare every pair of population means with a
single level of significance.
Assume normally distributed distributions,
independent random samples, and equal variances.
Ho : µ1 = µ2 = µ3 = µ4 = µ5
H1 : Not all µi (I = 1, 2, 3, 4, 5) are equal.
Using the computer and template of ANOVA, we
have following ANOVA table:
Α5%
SOURCE SS Df MS F FCritical p-
valueBetwee
n
26.007 4 6.501
8
1.383
4
2.594
3
0.256
1Within 197.39
7
42 4.699
9Total 223.40
4
46
Based on the ANOVA table, we have the test
statistics value:
FT = F – ratio = 1.3834
And based on the level of significance (α =
0.05), the critical value is:
Fc = F(0.05, 4, 42) = 2.5943
Thus, at 0.05 level of significance, we
cannot reject the null hypothesis Ho. That means,
based on the ANOVA table and the hypothesis
testing we have no evidence to indicate that
there are mean differences across the plants.
And hence, the further analysis to determine
which plants appear to differ from which others
is invalid. So, it is not necessay to conduct a
Tukey Pairwise-Comparions Testing.
AnnualSales($
million),Y
Numberof
RetailOutlets,
X1
Numberof
Automobiles
Registered
(millions),X2
PersonalIncome
($billions
),X3
AverageAge ofAutomobi
les(years),
X4
Numberof
Supervisors, X5
37.702 1739 9.27 85.4 3.5 924.196 1221 5.86 60.7 5 532.055 1846 8.81 68.1 4.4 73.611 120 3.81 20.2 4 517.625 1096 10.31 33.8 3.5 745.919 2290 11.62 95.1 4.1 1329.6 1687 8.86 69.3 4.1 158.114 241 6.28 16.3 5.9 1120.116 649 7.77 34.9 5.5 1612.994 1427 10.92 15.1 4.1 10
QUESTION 2 :
a. Find the data and test the independence among independent variables:
This is secondary data we collected from a
large automobile in the US. A manager of that
company supposes that there is relationship
between annual sales ($ million) and five
independent variables respectively number of
retail outlets, number of automobiles registered
(millions), personal income ($ billions), average
age of automobiles (years), and number of
supervisors. He wants us to use multiple
regression method to test whether his hypothesis
is correct or not.
Source: “Homework on Simulation and Regression”, link http://www.docstoc.com/docs/104769702/Homework-on-Simulation-and-Regression#
Using the computer and template of multiple regression method, we have following correlation matrix.
Numberof
RetailOutlets
,X1
Numberof
Automobiles
Registered
Personal
Income($
billions),X3
AverageAge of
Automobiles
(years),X4
Numberof
Supervisors, X5
(millions),X2
Number of Retail Outlets,X1
1.0000
Number of Automobiles Registered (millions),X2
0.7731 1.0000
Personal Income ($ billions),X3
0.8249 0.4062 1.0000
Average Age of Automobiles (years), X4
-0.4894 -0.4452 -0.3495 1.0000
Number of Supervisors, X5
0.1833 0.3895 0.1546 0.2907 1.0000
We see that the correlation between 2 pairs
of variables: X2 - X1 and X3 - X1 are both larger
than 0.5 (the number we can choose randomly.)
Hence, there are positive correlations between
these two pairs of variables which mean that when
X1 increase, X2 also increase with the proportion
is approximately 0.7731 and 0.8249 respectively.
We also realize that the average age of
automobiles is negative relationship with most of
other variables. We can easily understand the
reason: if the average age of automobile
increases, people don’t need to buy automobile
anymore; therefore, the number of retailers who
sale automobile will decrease (negative
relationship) and the same reason for number of
automobile registered.
a. Establish regression relationship, write down the
regression equation.
The estimated equation relationship between the annual sales and five different variables is: = β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β0We denotes: Y is the annual sales, X1 is the Number of Retail Outlets, X2 is the Number of Automobiles Registered (millions), X3 is PersonalIncome ($ billions), X4 is Average Age of Automobiles (years), and X5 is Number of Supervisors. Based on the thinking of manager, we have this hypothesis:
Ho: β1=β2= β3= β4= β5=0H1: Not all βi (i = 1, 2, 3, 4, 5) are equal to 0Using the computer and template of multiple regression method, we have following table.
ANOVA TableSource
SS df MS F FCritical p-value
Regn.
1594.313012
5 318.8626024
148.637162
6.256056502
0.0001
Error
8.580965837
4 2.145241459
Total
1602.893978
9 178.0993308
We can see that the R2 and adjusted R2
is quite the same, which means that this data is fitted with regression model we are applying.Also, we see that p-value is less than α = 1% and
therefore we reject the null hypothesis. Hence,
there is at least one variable of 5 variables
(Annual Sales, Number of Retail Outlets, Number
of Automobiles Registered, Personal Income,
Average Age of Automobiles, Number of
Supervisors) is related to variable annual sale
(Y).
Reject Ho, accept H1.
We see that the p-value of X5 > 5%, then this
variable may not be related to Y. X5 may be
dropped from the regression equation.
Intercept
Numberof
RetailOutlets,
X1
Numberof
Automobiles
Registered
(millions),X2
PersonalIncome($
billions),X3
AverageAge ofAutomobi
les(years),
X4
Numberof
Supervisors, X
B -19.528
-0.000584
1.725068562
0.409445011
2.0038068
-0.023384
s1.4646
64R2 0.9946 Adjuste
d R20.987955
3 085 123s(b) 5.2093
590.002532575
0.52701926
0.042267488
0.847784475
0.179784914
T -3.7487
-0.230629055
3.273255253
9.686996565
2.363580437
-0.130067213
p-value
0.0200 0.8289 0.0307 0.0006 0.0774 0.9028
After dropping X5 from the regression equation,
we conduct another test for significance of
remaining parameters. We have results:
Intercept
NumberofRetailOutlets,X1
Number ofAutomobilesRegistered(millions),X2
PersonalIncome($billions),X3
AverageAge ofAutomobiles(years),X4
B -19.176
-0.0005 1.6825 0.4071 1.9414
s(b) 3.98829
0.00208 0.3701 0.0344 0.6265
T -4.8081
-0.2171 4.5456 11.836 3.0987
p-value
0.0048 0.8367 0.0061 0.0001 0.0269
Once again, we perform ANOVA for testing for theexistence of a linear relationship between Y andany of the Xi (i=1,2,3,4)
Source SS df MS F FCriticalp-value
Regn. 1594.28 4 398.57 231.26 5.1922 0.0000
Error 8.61726 5 1.7235Total 1602.89 9 178.1R2 0.9946 Adjusted R2 0.9903S 1.3128
Both R2 and the adjusted R2 increase
significantly when the number of supervisors - X5
is dropped. In addition, R2 and the adjusted R2
are close to each other in value. Thus, this
multiple regression model is fitted.
In table 2, we see that p-value of X1 > 5%,
then this variable may not be related to Y. X1 may
be dropped from the regression equation.
After dropping X1, we conduct another test
for significance of remaining parameters. We have
results:
Intercept
Number ofAutomobilesRegistered(millions),X2
PersonalIncome($billions),X3
AverageAge ofAutomobiles (years),X4
B -18.947 1.61618 0.4006 1.9632s(b) 3.52764 0.19194 0.0152 0.5672T -5.371 8.42032 26.331 3.4609
p-value 0.0017 0.0002 0.0000 0.0135
Once again, we perform ANOVA for testing for theexistence of a linear relationship between Y andany of the Xi (i=2,3,4)Source
SS Df MS F FCritical
p-value
Regn.
1594.2
3 531.4
366.54
4.7571
0.0000
Error
8.69853
6 1.4498
Total
1602.89
9 178.1
R2 and the adjusted R2 both increase
significantly when the number of supervisors - X5
is dropped. In addition, R2 and the adjusted R2
are close to each other in value. Thus, this
multiple regression model is fitted.
Regression Equation: ($ million) = -18.947 +
1.61618X2 + 0.4006X3 + 1.9632X4
Use the regression equation to estimate new value of
dependent variables
S 1.2041R2 0.9946 Adjusted R2 0.9919
Let us predict annual sales when number of
automobiles registered is at a level of 8
millions, personal income is at a level of $10
billion and average age of automobiles is 6
years. Remember that those numbers (8 millions,
$10 billion and 6 years) is just predicted.
Y = -18.947 + 1.61618X2 + 0.4006X3 + 1.99632X4
Y = -18.947 + (1.61618)*8 + (0.4006)*10 +
(1.99632)*6 = 9.966 ($ millions)
Notes: To complete the exercise, we have read adding information
from book Statistical techniques in business and economy,
besides information we have learnt from our textbook Business
Statistics. From that book, the author said that instead of omitting
all independent variables that have p-value are greater than 0.05
(5%), we can respectively omit each one and check others to
ensure that the remained variables have or do not have
relationship with dependent variable (Y). That is the reason why in
report, we omit respectively the variable X5, X1 but not omitting
both of them at the same time.
QUESTION 3: Analyze case study 18 “ the Nine Nations of North America (Text book page 684)
To determine whether the nine-nations segmentation a useful alternative to the quadrants or the Census Bureau divisions of the country, we use the chi –square test for independent to find out the result of the nine nation segmentation different to the others to come to the conclusion if it can replace the quadrants or the Census Bureau divisions of the country
Ho: the distribution of value and the Nine nation segmentation are independent. H1 : the distribution of value and the Nine nation the segmentation are not independent.The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.
At the 0.01 level of significance, p-value > α. We cannot reject H0. There is statistical evidence that the
distribution of value and the Nine nation segmentation are independent.
Ho : the distribution of value and the Quadrantssegmentation are independent.H1: the distribution of value and the Quadrants segmentation are not independent.
The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.
There is no statistical evidence that the distribution of value and the Quadrant segmentation are independent.
Ho : the distribution of value and the Census Bureau segmentation are independent.H1: the distribution of value and the Census Bureau segmentation are not independent.
The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.
At the 0.01 level of significance, p-value < α. We can reject H0. There is no statistical evidence that the
distribution of value and the Census Bureau segmentation are independent.
After analyze the results presented in the exhibits, the nine-nation segmentation a usefulalternative to the quadrants o the Census Bureau divisions of the country because base onthe chi-square test of independent , the nine-nation has presented a different result from the quadrants and Census Bureau. In the nine nation case, the segmentation independent to the value and in the quadrants and Census
Bureau the value is dependent to the segmentation.
QUESTION 4 : The chi- square goodness- of- fit test may beapplied to testing any hypothesis about thedistribution of a population.H0 : The population has a normal distribution.H1 : The population is not normally distributed.We begin by defining boundaries with knownprobabilities for the standard normal randomvariable Z. We know that the probability that thevalue of Z will be between -1 and +1 is about0.68. We also know that the probability that Zwill be between -2 and +2 is about 0.95, and weknow other such probabilities. We may useAppendix C, Table 2, to find more exactprobabilities. Let us use the table and defineseveral nonoverlapping intervals for Z with knownprobabilities. We will form intervals of aboutthe same probability. The figure below shows onepossible partition of the standard normaldistribution to intervals and theirprobabilities, obtained from table 2.The partition was obtained as follows. We knowthat the are under the curve between 0 and 1 is0.3413 (from table 2). Looking for an area ofabout half that size, 0.1700, we find that theappropriate point is z=0.44. A similarrelationship exists on the negative side of thenumber line. Thus, using just the values 0.44 and
1 and their negatives, we get a completepartition of the Z scale into the six intervals:-∞ to -1, with associated probability of 0.1587;-1 to -0.44, with probability 0.1713; -0.44 to 0,with probability 0.1700; 0 to 0.44; withprobability 0.1700; 0.44 to1, with probability0.1713; and, finally, 1 to ∞, with probability0.1587. Breakdowns into other intervals may alsobe used.Intervals and Their Standard NormalProbabilities.
Now we transform the Z scale values to intervalboundaries for the original problem. Taking ẋ ands as if they were the mean and the standarddeviation of the population, we use thetransformation X= µ + σZ with ẋ= 945.38 and s=11.156 substituted for the unknown
parameters. The Z value boundaries we justobtained are substituted into the transformation,giving us the following cell boundaries:
x1 = 945.38 + (-1)(11.156) = 934.224x2 = 945.38 + (-0.44)(11.156) =
940.47136x3 = 945.38 + (0)(11.156) = 945.38x4 = 945.38 + (0.44)(11.156) = 950.28864x5 = 945.38 + (1)(11.156) = 956.536
Cells and Their Expected Counts0-934.2
934.22-940.5
940.47-945.4
945.38-950.3
950.29-956.5
956.54andabove
15.87
17.13 17 17 17.13 15.87
The cells and their expected counts are given inthe above table. Cells boundaries are broken atthe nearest cent. Recall that the expected countin each cell is equal to the cell probabilitytimes the sample size E¡ = np¡. In this question,the p¡ are obtained from the normal table andare, in order, 0.1587, 0.1713, 0.1700, 0.1700,0.1713, and 0.1587. Multiplying theseprobabilities by n= 100 gives us the expectedcounts. Note that all expected cell counts areabove 5, and, therefore, the chi- squaredistribution is an adequate approximation to thedistribution of the test statistic X² in equation
under the null hypothesis.Observed Cell Counts0-934.2
934.22-940.5
940.47-945.4
945.38-950.3
950.29-956.5
956.54andabove
14 16 21 16 15 18
Table gives the observed counts of the scores amountsfalling in each of the cells. The table was obtained bythe analyst by looking at each data point in the sampleand classifying the amount into one of the chosencategories.Cell
¡ O¡
E¡ O¡-E¡
(O¡-E¡)²
(O¡-E¡)²/E¡
0-934.2
1 14
15.87
-1.87
3.5
0.22
934.22-940.5
2 16
17.13
-1.13
1.28
0.08
940.47-945.4
3 21
17.00
4 16
0.94
945.38
4 1 17.0
- 1 0.0
-950.3
6 0 1 6
950.29-956.5
5 15
17.13
-2.13
4.5
0.26
956.54andabove
6 18
15.87
2.13
4.5
0.28
1.84
To facilitate the computation of the chi- squarestatistic, we arrange the observed and expectedcell counts in a single table and show thecomputations necessary for obtaining the value ofthe test statistic. This has been done in theabove table. The sum of all the entries in thelast column in the table is the value of the chi-square statistic. The appropriate distributionhas k – 3 = 6 – 3 = 3 degrees of freedom. We nowconsult the chi- square table, Appendix C, table4, and we find that the computed statistic valueX² = 1.84 falls in the nonrejection region forany level of α in the table. At 0.05 level ofsignificance, we have p- value p= 0.6051 > α =0.05 so we can not reject H0 . There is therefore
no statistical evidence that the population isnot normally distributed.
QUESTION 5: The Thorndikes have submitted a bid to be thesole supplier of swimming goggles for the U.S.Olympic team. OptiView, Inc. has been supplyingthe goggles for many years, and the Olympiccommittee has said it will switch to Thorndikeonly if the Thorndike goggles are found to besignificantly better in a standard leakage test.For purposes of fairness, the committee has purchased 16examples from each manufacturer in the retailmarketplace. Testing involves installing the goggles on asurface that simulates the face of a swimmer, thensubmitting them to increasing water pressure (expressedin meters of water depth) until the goggles leak. Bothcompanies have received copies of the test results andhave an opportunity to offer their respective commentsbefore the final decision is made. Ted Thorndike has justreceived his company’s copy of the results, rounded tothe nearest meter of water depth.
Thorndike Goggles (meters)82
117
91
95
110
81
101
108
106
114
106
95
101
92
94
108
OptiView Goggles (meters)73
95
83
106
70
103
86
100
92
108
94
77
109
90
107
73
(The greater the number of meters before leakage, the better thequality of the goggles)
1. Based on analysis of these data, formulate acommentary that Ted Thorndike might wish to maketo the committee.2. Based on analysis of these data, formulate acommentary that OptiView might wish to make tothe committee.3. What would be your recommendation to thecommittee?
1. To determine whether the Thorndikes goggles door do not have better quality than the Optiviewgoggles, we can use the template for testingdifference in means and the unequal variances t –test. From the template we have: “Thordikes goggles” issample 1 and “Optiview goggles” is sample 2
Thus, the null and alternative hypothesizes are:H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0 (As Thorndikesgoggles are not better than Optiview goggles)H1: µ1 > µ 2 or µ1 - µ 2 > 0 (As Thorndikesgoggles are better than Optiview goggles)We conduct a hypothesis test using template:
As the printout template above, we can see thatwith the level of significance α = 0.05, the p –value for this test is 0.0296. We can see that
the p – value is smaller than α – value. Thus wecan reject the null hypothesis H0. It means thatthe Thorndikes goggles are better than Optiviewgoggles, though there is still a 2.96%probability that Optiview goggles testperformance would be this much better thanThorndike goggles. To conclude based on thesedata, Ted Thorndike should ask the committee toswitch to his company’s product because the datashows that Thorndike goggles is better thanOptiview goggles.
2. We conduct a hypothesis testing again. Thistime, we have “Optiview goggles” is sample 1 and“Thordikes goggles” is sample 2. From thetemplate we have:
Thus, the null and alternative hypothesizes are:H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0 (As Optiview gogglesare not better than Thorndikes goggles)H1: µ1 > µ 2 or µ1 - µ 2 > 0 (As Optiview gogglesare better than Thorndikes goggles)We conduct a hypothesis test using template:
As the printout template above, we can see thatwith the level of significance α = 0.05, the p –value for this test is 0.9704, which is largerthan α – value. Thus we cannot reject the nullhypothesis H0. It means that the Optiview goggles
may be worse than Thorndikes goggles. If Optiviewgoggles really is worse than Thorndikes goggles,there is still a 2.96% probability that Thorndikegoggles test performance would be worse thanOptiview. To conclude based on these data,Optiview must show the committee that they willimprove their company’s product as soon aspossible, and pursue the committee to continueuse their product.
3. After using the data for 2 hypothesis tests,we get enough evidence to prove that theThorndikes goggles is better than the Optiviewgoggles, at the probability of 97.04%. Theprobability that Optiview goggles is better thanthe Thorndikes goggles is small, which is 2.96%.Thus, we can see that there is no meaning incontinue using Optiview goggles, which has lowerquality than Thorndikes goggles. So, base on thedata of result of the quality test betweenOptiview goggles and Thorndikes goggles, weshould recommend the committee to switch fromOptiview goggles to Thorndikes goggles, sinceThorndikes goggles is more potential thanOptiview goggles.