Statistic for Business

VIETNAM NATIONAL UNIVERSITY

INTERNATIONAL UNIVERSITY

-------o0o-------

GROUP PROJECT

SUBJECT: STATISTICS FOR BUSINESS

LECTURER: MR. TRÖÔNG BAÙ HUY

GROUP: 04

MEMBERS’ NAME:

1.HOÀ NGOÏC TRAÂN (BABAIU10146)

2.VOÕ ÑAËNG LAN ANH (BABAIU10155)

3.LYÙ THÒ MAI HÖÔNG (BABAIU10018)

4.VOÕ NHÖ QUYØNH (BABAIU10116)

5.THAÅM HOÀNG VAÂN (BABAIU10201)

QUESTION 1:

The question requires us to prove that there

is any indication of mean differences across the

plants, therefore, we will use ananlysis of

variance (ANOVA) to solve this problem, becaue

ANOVA is the statistical method for determining

the existence of differences among several

population means. And besides, if we can prove

that there are mean differences across the

plants, we will conduct Tukey Pairwise-

Comparisons testing to determine which plants

appear to differ from which others, because the

Tukey Pairwise-Comparisons test allows us to

compare every pair of population means with a

single level of significance.

Assume normally distributed distributions,

independent random samples, and equal variances.

Ho : µ1 = µ2 = µ3 = µ4 = µ5

H1 : Not all µi (I = 1, 2, 3, 4, 5) are equal.

Using the computer and template of ANOVA, we

have following ANOVA table:

Α5%

SOURCE SS Df MS F FCritical p-

valueBetwee

n

26.007 4 6.501

8

1.383

4

2.594

3

0.256

1Within 197.39

7

42 4.699

9Total 223.40

4

46

Based on the ANOVA table, we have the test

statistics value:

FT = F – ratio = 1.3834

And based on the level of significance (α =

0.05), the critical value is:

Fc = F(0.05, 4, 42) = 2.5943

Thus, at 0.05 level of significance, we

cannot reject the null hypothesis Ho. That means,

based on the ANOVA table and the hypothesis

testing we have no evidence to indicate that

there are mean differences across the plants.

And hence, the further analysis to determine

which plants appear to differ from which others

is invalid. So, it is not necessay to conduct a

Tukey Pairwise-Comparions Testing.

AnnualSales($

million),Y

Numberof

RetailOutlets,

X1

Numberof

Automobiles

Registered

(millions),X2

PersonalIncome

($billions

),X3

AverageAge ofAutomobi

les(years),

X4

Numberof

Supervisors, X5

37.702 1739 9.27 85.4 3.5 924.196 1221 5.86 60.7 5 532.055 1846 8.81 68.1 4.4 73.611 120 3.81 20.2 4 517.625 1096 10.31 33.8 3.5 745.919 2290 11.62 95.1 4.1 1329.6 1687 8.86 69.3 4.1 158.114 241 6.28 16.3 5.9 1120.116 649 7.77 34.9 5.5 1612.994 1427 10.92 15.1 4.1 10

QUESTION 2 :

a. Find the data and test the independence among independent variables:

This is secondary data we collected from a

large automobile in the US. A manager of that

company supposes that there is relationship

between annual sales ($ million) and five

independent variables respectively number of

retail outlets, number of automobiles registered

(millions), personal income ($ billions), average

age of automobiles (years), and number of

supervisors. He wants us to use multiple

regression method to test whether his hypothesis

is correct or not.

Source: “Homework on Simulation and Regression”, link http://www.docstoc.com/docs/104769702/Homework-on-Simulation-and-Regression#

Using the computer and template of multiple regression method, we have following correlation matrix.

Numberof

RetailOutlets

,X1

Numberof

Automobiles

Registered

Personal

Income($

billions),X3

AverageAge of

Automobiles

(years),X4

Numberof

Supervisors, X5

http://www.docstoc.com/docs/104769702/Homework-on-Simulation-and-Regression

http://www.docstoc.com/docs/104769702/Homework-on-Simulation-and-Regression

(millions),X2

Number of Retail Outlets,X1

1.0000

Number of Automobiles Registered (millions),X2

0.7731 1.0000

Personal Income ($ billions),X3

0.8249 0.4062 1.0000

Average Age of Automobiles (years), X4

-0.4894 -0.4452 -0.3495 1.0000

Number of Supervisors, X5

0.1833 0.3895 0.1546 0.2907 1.0000

We see that the correlation between 2 pairs

of variables: X2 - X1 and X3 - X1 are both larger

than 0.5 (the number we can choose randomly.)

Hence, there are positive correlations between

these two pairs of variables which mean that when

X1 increase, X2 also increase with the proportion

is approximately 0.7731 and 0.8249 respectively.

We also realize that the average age of

automobiles is negative relationship with most of

other variables. We can easily understand the

reason: if the average age of automobile

increases, people don’t need to buy automobile

anymore; therefore, the number of retailers who

sale automobile will decrease (negative

relationship) and the same reason for number of

automobile registered.

a. Establish regression relationship, write down the

regression equation.

The estimated equation relationship between the annual sales and five different variables is: = β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β0We denotes: Y is the annual sales, X1 is the Number of Retail Outlets, X2 is the Number of Automobiles Registered (millions), X3 is PersonalIncome ($ billions), X4 is Average Age of Automobiles (years), and X5 is Number of Supervisors. Based on the thinking of manager, we have this hypothesis:

Ho: β1=β2= β3= β4= β5=0H1: Not all βi (i = 1, 2, 3, 4, 5) are equal to 0Using the computer and template of multiple regression method, we have following table.

ANOVA TableSource

SS df MS F FCritical p-value

Regn.

1594.313012

5 318.8626024

148.637162

6.256056502

0.0001

Error

8.580965837

4 2.145241459

Total

1602.893978

9 178.0993308

We can see that the R2 and adjusted R2

is quite the same, which means that this data is fitted with regression model we are applying.Also, we see that p-value is less than α = 1% and

therefore we reject the null hypothesis. Hence,

there is at least one variable of 5 variables

(Annual Sales, Number of Retail Outlets, Number

of Automobiles Registered, Personal Income,

Average Age of Automobiles, Number of

Supervisors) is related to variable annual sale

(Y).

Reject Ho, accept H1.

We see that the p-value of X5 > 5%, then this

variable may not be related to Y. X5 may be

dropped from the regression equation.

Intercept

Numberof

RetailOutlets,

X1

Numberof

Automobiles

Registered

(millions),X2

PersonalIncome($

billions),X3

AverageAge ofAutomobi

les(years),

X4

Numberof

Supervisors, X

B -19.528

-0.000584

1.725068562

0.409445011

2.0038068

-0.023384

s1.4646

64R2 0.9946 Adjuste

d R20.987955

3 085 123s(b) 5.2093

590.002532575

0.52701926

0.042267488

0.847784475

0.179784914

T -3.7487

-0.230629055

3.273255253

9.686996565

2.363580437

-0.130067213

p-value

0.0200 0.8289 0.0307 0.0006 0.0774 0.9028

After dropping X5 from the regression equation,

we conduct another test for significance of

remaining parameters. We have results:

Intercept

NumberofRetailOutlets,X1

Number ofAutomobilesRegistered(millions),X2

PersonalIncome($billions),X3

AverageAge ofAutomobiles(years),X4

B -19.176

-0.0005 1.6825 0.4071 1.9414

s(b) 3.98829

0.00208 0.3701 0.0344 0.6265

T -4.8081

-0.2171 4.5456 11.836 3.0987

p-value

0.0048 0.8367 0.0061 0.0001 0.0269

Once again, we perform ANOVA for testing for theexistence of a linear relationship between Y andany of the Xi (i=1,2,3,4)

Source SS df MS F FCriticalp-value

Regn. 1594.28 4 398.57 231.26 5.1922 0.0000

Error 8.61726 5 1.7235Total 1602.89 9 178.1R2 0.9946 Adjusted R2 0.9903S 1.3128

Both R2 and the adjusted R2 increase

significantly when the number of supervisors - X5

is dropped. In addition, R2 and the adjusted R2

are close to each other in value. Thus, this

multiple regression model is fitted.

In table 2, we see that p-value of X1 > 5%,

then this variable may not be related to Y. X1 may

be dropped from the regression equation.

After dropping X1, we conduct another test

for significance of remaining parameters. We have

results:

Intercept

Number ofAutomobilesRegistered(millions),X2

PersonalIncome($billions),X3

AverageAge ofAutomobiles (years),X4

B -18.947 1.61618 0.4006 1.9632s(b) 3.52764 0.19194 0.0152 0.5672T -5.371 8.42032 26.331 3.4609

p-value 0.0017 0.0002 0.0000 0.0135

Once again, we perform ANOVA for testing for theexistence of a linear relationship between Y andany of the Xi (i=2,3,4)Source

SS Df MS F FCritical

p-value

Regn.

1594.2

3 531.4

366.54

4.7571

0.0000

Error

8.69853

6 1.4498

Total

1602.89

9 178.1

R2 and the adjusted R2 both increase

significantly when the number of supervisors - X5

is dropped. In addition, R2 and the adjusted R2

are close to each other in value. Thus, this

multiple regression model is fitted.

Regression Equation: ($ million) = -18.947 +

1.61618X2 + 0.4006X3 + 1.9632X4

Use the regression equation to estimate new value of

dependent variables

S 1.2041R2 0.9946 Adjusted R2 0.9919

Let us predict annual sales when number of

automobiles registered is at a level of 8

millions, personal income is at a level of $10

billion and average age of automobiles is 6

years. Remember that those numbers (8 millions,

$10 billion and 6 years) is just predicted.

Y = -18.947 + 1.61618X2 + 0.4006X3 + 1.99632X4

Y = -18.947 + (1.61618)*8 + (0.4006)*10 +

(1.99632)*6 = 9.966 ($ millions)

Notes: To complete the exercise, we have read adding information

from book Statistical techniques in business and economy,

besides information we have learnt from our textbook Business

Statistics. From that book, the author said that instead of omitting

all independent variables that have p-value are greater than 0.05

(5%), we can respectively omit each one and check others to

ensure that the remained variables have or do not have

relationship with dependent variable (Y). That is the reason why in

report, we omit respectively the variable X5, X1 but not omitting

both of them at the same time.

QUESTION 3: Analyze case study 18 “ the Nine Nations of North America (Text book page 684)

To determine whether the nine-nations segmentation a useful alternative to the quadrants or the Census Bureau divisions of the country, we use the chi –square test for independent to find out the result of the nine nation segmentation different to the others to come to the conclusion if it can replace the quadrants or the Census Bureau divisions of the country

Ho: the distribution of value and the Nine nation segmentation are independent. H1 : the distribution of value and the Nine nation the segmentation are not independent.The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.

At the 0.01 level of significance, p-value > α. We cannot reject H0. There is statistical evidence that the

distribution of value and the Nine nation segmentation are independent.

Ho : the distribution of value and the Quadrantssegmentation are independent.H1: the distribution of value and the Quadrants segmentation are not independent.

The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.

At the 0.01 level of significance, p-value < α. We can reject H0.

There is no statistical evidence that the distribution of value and the Quadrant segmentation are independent.

Ho : the distribution of value and the Census Bureau segmentation are independent.H1: the distribution of value and the Census Bureau segmentation are not independent.

The computations of expected counts, test statistic and degrees of freedom are done using Excel template. From those computed values, we can find the p-value.

At the 0.01 level of significance, p-value < α. We can reject H0. There is no statistical evidence that the

distribution of value and the Census Bureau segmentation are independent.

After analyze the results presented in the exhibits, the nine-nation segmentation a usefulalternative to the quadrants o the Census Bureau divisions of the country because base onthe chi-square test of independent , the nine-nation has presented a different result from the quadrants and Census Bureau. In the nine nation case, the segmentation independent to the value and in the quadrants and Census

Bureau the value is dependent to the segmentation.

QUESTION 4 : The chi- square goodness- of- fit test may beapplied to testing any hypothesis about thedistribution of a population.H0 : The population has a normal distribution.H1 : The population is not normally distributed.We begin by defining boundaries with knownprobabilities for the standard normal randomvariable Z. We know that the probability that thevalue of Z will be between -1 and +1 is about0.68. We also know that the probability that Zwill be between -2 and +2 is about 0.95, and weknow other such probabilities. We may useAppendix C, Table 2, to find more exactprobabilities. Let us use the table and defineseveral nonoverlapping intervals for Z with knownprobabilities. We will form intervals of aboutthe same probability. The figure below shows onepossible partition of the standard normaldistribution to intervals and theirprobabilities, obtained from table 2.The partition was obtained as follows. We knowthat the are under the curve between 0 and 1 is0.3413 (from table 2). Looking for an area ofabout half that size, 0.1700, we find that theappropriate point is z=0.44. A similarrelationship exists on the negative side of thenumber line. Thus, using just the values 0.44 and

1 and their negatives, we get a completepartition of the Z scale into the six intervals:-∞ to -1, with associated probability of 0.1587;-1 to -0.44, with probability 0.1713; -0.44 to 0,with probability 0.1700; 0 to 0.44; withprobability 0.1700; 0.44 to1, with probability0.1713; and, finally, 1 to ∞, with probability0.1587. Breakdowns into other intervals may alsobe used.Intervals and Their Standard NormalProbabilities.

Now we transform the Z scale values to intervalboundaries for the original problem. Taking ẋ ands as if they were the mean and the standarddeviation of the population, we use thetransformation X= µ + σZ with ẋ= 945.38 and s=11.156 substituted for the unknown

parameters. The Z value boundaries we justobtained are substituted into the transformation,giving us the following cell boundaries:

x1 = 945.38 + (-1)(11.156) = 934.224x2 = 945.38 + (-0.44)(11.156) =

940.47136x3 = 945.38 + (0)(11.156) = 945.38x4 = 945.38 + (0.44)(11.156) = 950.28864x5 = 945.38 + (1)(11.156) = 956.536

Cells and Their Expected Counts0-934.2

934.22-940.5

940.47-945.4

945.38-950.3

950.29-956.5

956.54andabove

15.87

17.13 17 17 17.13 15.87

The cells and their expected counts are given inthe above table. Cells boundaries are broken atthe nearest cent. Recall that the expected countin each cell is equal to the cell probabilitytimes the sample size E¡ = np¡. In this question,the p¡ are obtained from the normal table andare, in order, 0.1587, 0.1713, 0.1700, 0.1700,0.1713, and 0.1587. Multiplying theseprobabilities by n= 100 gives us the expectedcounts. Note that all expected cell counts areabove 5, and, therefore, the chi- squaredistribution is an adequate approximation to thedistribution of the test statistic X² in equation

under the null hypothesis.Observed Cell Counts0-934.2

934.22-940.5

940.47-945.4

945.38-950.3

950.29-956.5

956.54andabove

14 16 21 16 15 18

Table gives the observed counts of the scores amountsfalling in each of the cells. The table was obtained bythe analyst by looking at each data point in the sampleand classifying the amount into one of the chosencategories.Cell

¡ O¡

E¡ O¡-E¡

(O¡-E¡)²

(O¡-E¡)²/E¡

0-934.2

1 14

15.87

-1.87

3.5

0.22

934.22-940.5

2 16

17.13

-1.13

1.28

0.08

940.47-945.4

3 21

17.00

4 16

0.94

945.38

4 1 17.0

- 1 0.0

-950.3

6 0 1 6

950.29-956.5

5 15

17.13

-2.13

4.5

0.26

956.54andabove

6 18

15.87

2.13

4.5

0.28

1.84

To facilitate the computation of the chi- squarestatistic, we arrange the observed and expectedcell counts in a single table and show thecomputations necessary for obtaining the value ofthe test statistic. This has been done in theabove table. The sum of all the entries in thelast column in the table is the value of the chi-square statistic. The appropriate distributionhas k – 3 = 6 – 3 = 3 degrees of freedom. We nowconsult the chi- square table, Appendix C, table4, and we find that the computed statistic valueX² = 1.84 falls in the nonrejection region forany level of α in the table. At 0.05 level ofsignificance, we have p- value p= 0.6051 > α =0.05 so we can not reject H0 . There is therefore

no statistical evidence that the population isnot normally distributed.

QUESTION 5: The Thorndikes have submitted a bid to be thesole supplier of swimming goggles for the U.S.Olympic team. OptiView, Inc. has been supplyingthe goggles for many years, and the Olympiccommittee has said it will switch to Thorndikeonly if the Thorndike goggles are found to besignificantly better in a standard leakage test.For purposes of fairness, the committee has purchased 16examples from each manufacturer in the retailmarketplace. Testing involves installing the goggles on asurface that simulates the face of a swimmer, thensubmitting them to increasing water pressure (expressedin meters of water depth) until the goggles leak. Bothcompanies have received copies of the test results andhave an opportunity to offer their respective commentsbefore the final decision is made. Ted Thorndike has justreceived his company’s copy of the results, rounded tothe nearest meter of water depth.

Thorndike Goggles (meters)82

117

91

95

110

81

101

108

106

114

106

95

101

92

94

108

OptiView Goggles (meters)73

95

83

106

70

103

86

100

92

108

94

77

109

90

107

73

(The greater the number of meters before leakage, the better thequality of the goggles)

1. Based on analysis of these data, formulate acommentary that Ted Thorndike might wish to maketo the committee.2. Based on analysis of these data, formulate acommentary that OptiView might wish to make tothe committee.3. What would be your recommendation to thecommittee?

1. To determine whether the Thorndikes goggles door do not have better quality than the Optiviewgoggles, we can use the template for testingdifference in means and the unequal variances t –test. From the template we have: “Thordikes goggles” issample 1 and “Optiview goggles” is sample 2

Thus, the null and alternative hypothesizes are:H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0 (As Thorndikesgoggles are not better than Optiview goggles)H1: µ1 > µ 2 or µ1 - µ 2 > 0 (As Thorndikesgoggles are better than Optiview goggles)We conduct a hypothesis test using template:

As the printout template above, we can see thatwith the level of significance α = 0.05, the p –value for this test is 0.0296. We can see that

the p – value is smaller than α – value. Thus wecan reject the null hypothesis H0. It means thatthe Thorndikes goggles are better than Optiviewgoggles, though there is still a 2.96%probability that Optiview goggles testperformance would be this much better thanThorndike goggles. To conclude based on thesedata, Ted Thorndike should ask the committee toswitch to his company’s product because the datashows that Thorndike goggles is better thanOptiview goggles.

2. We conduct a hypothesis testing again. Thistime, we have “Optiview goggles” is sample 1 and“Thordikes goggles” is sample 2. From thetemplate we have:

Thus, the null and alternative hypothesizes are:H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0 (As Optiview gogglesare not better than Thorndikes goggles)H1: µ1 > µ 2 or µ1 - µ 2 > 0 (As Optiview gogglesare better than Thorndikes goggles)We conduct a hypothesis test using template:

As the printout template above, we can see thatwith the level of significance α = 0.05, the p –value for this test is 0.9704, which is largerthan α – value. Thus we cannot reject the nullhypothesis H0. It means that the Optiview goggles

may be worse than Thorndikes goggles. If Optiviewgoggles really is worse than Thorndikes goggles,there is still a 2.96% probability that Thorndikegoggles test performance would be worse thanOptiview. To conclude based on these data,Optiview must show the committee that they willimprove their company’s product as soon aspossible, and pursue the committee to continueuse their product.

3. After using the data for 2 hypothesis tests,we get enough evidence to prove that theThorndikes goggles is better than the Optiviewgoggles, at the probability of 97.04%. Theprobability that Optiview goggles is better thanthe Thorndikes goggles is small, which is 2.96%.Thus, we can see that there is no meaning incontinue using Optiview goggles, which has lowerquality than Thorndikes goggles. So, base on thedata of result of the quality test betweenOptiview goggles and Thorndikes goggles, weshould recommend the committee to switch fromOptiview goggles to Thorndikes goggles, sinceThorndikes goggles is more potential thanOptiview goggles.

Statistic for Business

Documents

Transcript of Statistic for Business