Pairing Things Down with Multiple Comparisons

15 Май
0

In This Chapter

^ When and how to follow up ANOVA with multiple comparisons ^ Comparing two well-known multiple comparison procedures

Ou’re comparing the means of not two, but K Independent populations, and you find out (using ANOVA — see Chapter 9) that you reject Ho: All the population means are equal, and you conclude Ha: At least two of the population means are different. Now you gotta know — which of those populations are different? Answering this question requires a follow-up procedure to ANOVA called Multiple comparisons, Which makes sense because you want to compare the multiple means you have and see which ones are different.

In this chapter, you figure out when you need to use a multiple comparison procedure. You see two of the most well-known multiple comparison procedures: Fisher’s LSD (least significant difference) and Tukey’s test. They can help you answer that burning question: So some of the means are different, but which ones are different?

Following Up after ANOVA

This section runs through the ANOVA procedure in the case where Ho is rejected and leads you to the next step: multiple comparisons.

Suppose you want to compare the average number of cell-phone minutes used per month for children and young adults, where the age groups are the following:

Y

V Group 1: 19 years old and under Group 2: 20-39 years old

Group 3: Adult males 40-59 years old

Group 4: Adult females 60 years old and over

You collect data on a random sample of 10 people from each group (where no one knows anyone else to keep independence), and you record the number of minutes each person used their cell phone in one month. The first ten lines of a hypothetical data set are shown in Table 10-1.

Table 10-1 One Month’s Cell Phone Minutes for Four Age Groups

19 and Under

20-39

40-59

60 and Over

(Group 1)

(Group 2)

(Group 3)

(Group 4)

800

250

700

200

850

350

700

120

800

375

750

150

650

320

650

90

750

430

550

20

680

380

580

150

800

325

700

200

750

410

700

130

690

450

590

160

710

390

650

30

The means and standard deviations of the sample data are shown in Figure 10-1, as well as confidence intervals for each of the population means separately (see Chapter 3 for info on confidence intervals). Looking at Figure 10-1, it appears that all four means are different, with 19 and under heading the pack, with 40- to 59-year-olds not far behind, and with 20- to 39-year-olds and those over 60 bringing up the rear (in that order).

Knowing that man can’t live by sample results alone, you decide that ANOVA is needed to see whether any differences that appear in the samples can be extended to the population (see Chapter 9). By using the ANOVA procedure, you test whether the average cell minutes used is the same across all groups. The results of the ANOVA, using the data from Table 10-1, are shown in Figure 10-2.

Figure 10-1:

Basic statistics and

Confidence intervals for the cellphone data.

Individual 95% CIs For Mean Based on Pooled StDev

Level

N

Mean

StDev

Group

1

10

7 48.00

64.60

Group

2

10

3 6 8.00

59.08

Group

3

10

6 5 7.00

64.99

Group

4

10

125.00

62.41

——-1———–1———–1———–1—-

(-*-)

(-*-)

(-*-)

(-*-)

——-1———–1———–1———–1—-

200 400 600 800

Looking at Figure 10-2, the F-test for equality of all four population means has a p-value of 0.000, meaning it is less then 0.001. That says at least two of these groups have a significant difference in their cell-phone use (see Chapter 9 for info on the F-test and its results).

Figure 10-2:

ANOVA results for comparing cell-phone use for four age groups.

One-way ANOVA: Group 1, Group 2, Group 3, Group 4

Source

DF

SS

MS F P

Factor

3

2 416 010

805337 204.13 0.000

Error

36

142030

3945

Total

39

2558040

S = 6 2.81

R-Sq = 94.5%

R-Sq(adj) = 93.99%

Okay, so what’s your next question? You just found out that the average number of cell-phone minutes per month isn’t the same across these four groups. Remember, this doesn’t mean all four groups are different (see Chapter 9). However, it does mean that at least two groups are significantly different in their cell-phone use. So your questions are: Which groups are different, and how are they different?

Determining which populations have differing means after ANOVA has been rejected involves a new data-analysis technique called Multiple comparisons. While many different multiple comparison procedures are out there, statisticians have their favorites, which I present in the next section.

Don’t attempt to explore the data with a multiple comparison procedure if the test for equality of the populations isn’t rejected. In this case, you must conclude that you don’t have enough evidence to say the population means aren’t equal, so you must stop there. Always look at the P-value of the F-test on the ANOVA output before moving on to conduct any multiple comparisons.

PInPoIntinG DiffeRinG Means With Fisher and Tukey

You’ve conducted ANOVA to see whether a group of K Populations have the same mean, and you rejected Ho. You conclude that at least two of those populations have different means. But you don’t have to stop there; you can go on to find out how many and which means are different by conducting multiple comparison tests.

In this section, you see two of the most well-known multiple comparison procedures: Fisher’s paired differences (also known as Fisher’s test Or Fisher’s LSD) And Tukey’s simultaneous confidence intervals (also known as Tukey’s test).

Although I only discuss two procedures in this section, tons of other multiple comparison procedures are out there. Although the other procedures’ methods differ a great deal, their overall goal is the same: to figure out which population means differ by comparing their sample means.

Fishing for differences With Fisher’s LSD

In this section, I outline Fisher’s LSD and apply it to the cell-phone example.

Examining Fisher’s LSD procedure

Suppose you’re comparing K Population means. Fisher’s LSD (short for Least significant difference) Conducts a t-test on each of the —^—h pairs of populations in the study, each one at level a = 0.05. For example, if you have four

Populations labeled A, B, C, D, you would have 4 ^40—— = 6 t-tests to perform: A versus B; A versus C; A versus D; B versus C; B versus D; and C versus D.

The number of tests is calculated by knowing that you have K Possible means for the first one in the pair, then K - 1 left for the second one in the pair. Because the order of the means doesn’t matter, you can divide by 2 to avoid overcounting.

Fisher’s LSD is very straightforward, easy to conduct, and easy to understand. However, Fisher’s LSD has some issues. Because each T-test is conducted at a level 0.05, each test done has a 5 percent chance of making a Type I error (rejecting Ho when you shouldn’t have — see Chapter 3). Although a 5-percent error rate for each test doesn’t seem too bad, the errors have a multiplicative effect as the number of tests increases. For example, the chance of making at least one Type I error with six T-tests, each at level a = 0.05, is 26.50 percent, which would be your Overall error rate For the procedure.

You could help lower the error rate for Fisher’s test if you lower the value of A For each test from 0.05 to, say, 0.01. However, doing so makes it harder to reject Ho for each pair of means. A lower value of A Also doesn’t solve the error-rate problem; it just slows it down for a bit, until the number of tests gets larger, and the error rate goes back up again.

If you want or need to know how I arrived at the number 26.50 percent as the overall error rate in that last example, here it goes: The probability of making a Type I error for each test is 0.05. The chance of making at least one error in six tests equals one minus the probability of making no errors in six tests. The chance of not making an error in one test is 1 – A = 0.95. The chance of no error in six tests is this quantity times itself six times, or (0.95)6, which equals 0.735. Now take one minus this quantity to get 1 – 0.735 = 0.2650 or 26.50 percent.

To conduct Fisher’s LSD, go to Stat>ANOVA>One-way or One-way unstacked. (If your data appear in two columns with Column 1 representing the population number and Column 2 representing the response, just click One-way because your data is stacked. If your data is shown in K Columns, one for each of the K Populations, click One-way unstacked.) In either case, the next step is to highlight the data for the groups you’re comparing and click Select. Then click on Comparisons. Click on Fisher’s. The individual error rate is listed at 5 (percent), which is typical. If you want to change it, type in the desired error rate (between 0.5 and 0.001) and click OK. You may type in your error rate as a decimal, 0.05, or as a number greater than one, such as 5. Numbers greater than one are interpreted as a percentage.

Applying Fisher’s LSD to cell phones

An ANOVA procedure was done on the cell-phone data presented in Table 10-1 to compare the mean number of minutes used for four age groups. Looking at Figure 10-2, you see Ho (all the populations means are equal) was rejected. The next step is to conduct multiple comparisons by using Fisher’s LSD to see which population means differ. Figure 10-3 shows the Minitab output.

The first block of results shows "Group 1 subtracted from" where Group 1 = age 19 and under. Each line after that represents the other age groups (Group 2 = 20- to 39-year-olds, Group 3 = 40- to 59-year-olds, and Group 4 = 60 and over). Each line shows the results of comparing the mean for the other group minus the mean for Group 1. For example, the first line shows Group 2 being compared with Group 1.

Moving to the right in that same row, you see the confidence interval for the difference in these two means, which turns out to be -436.97 to -323.03. Because 0 isn’t contained in this interval, you conclude that these two means are different in the populations also. You can also say, because this difference U,2 - U4 Is negative, that U,2 Is less than U4. Or, a better way to think of it may be that U1 is greater than |U2. That is, Group 1′s mean is greater than Group 2′s

Mean.

Figure 10-3:

Output showing Fisher’s LSD applied to the cellphone data.

Fisher

95% Individual Confidence Intervals

All Pairwise Comparisons

Simultaneous confidence level — 80.32%

Group

1

Subtracted from:

Lower Center

Upper

———-1———–1———–1———–+

Group

2

-436.97 -380.00

-323.03

(*-)

Group

3

-147.97 -91.00

-34.03

(*-)

Group

4

-679.97 -623.00

-566.03

(*-)

———+———+———+———+

-350 0 350 700

Group

2

Subtracted from:

Lower Center

Upper

———+———+———+——— +

Group

3

232.03 289.00

3 45.97

(*-)

Group

4

-299.97 -243.00

-186.03

(-*-)

———+———+———+——— +

-350 0 350 700

Group

3

Subtracted from:

Lower Center

Upper

———+———+———+——— +

Group

4

-588.97 -532.00

-475.03

(-*)

———3-5+0———+0——–3-50+——–7-0+0

Each subsequent row in the "Group 1 subtracted from" section of Figure 10-3 shows similar results. None of the confidence intervals contain 0, so you conclude that the mean cell-phone use for Group 1 isn’t equal to the mean cellphone use for any other group. Moreover, because all confidence intervals are in negative territory, you can conclude that the mean cell-phone use time for those 19 and under is greater than all the others. This process continues as you move down through the output until all six pairs of means are compared. Then you put them all together into one conclusion.

For example, in the second portion of the output, Group 2 is subtracted from Groups 3 and 4. You see the confidence interval for the "Group 3" line is 232.03, 345.97; this gives possible values for Group 3′s mean minus Group 2′s mean. The interval is entirely positive, so conclude that Group 3′s mean is greater than Group 2′s mean (according to this data). On the next line, the interval for Group 4 minus Group 2 is -299.97 to -186.03. All these numbers are negative, so conclude Group 4′s mean is less than Group 2′s. Combine conclusions to say that Group 3′s mean is greater than Group 2′s, which is greater than Group 4′s.

In the cell-phone example, none of the means are equal to each other, and based on the signs of confidence intervals and the results of all the individual pairwise comparisons, the following order of cell-phone mean usage prevails: U4 > U,3 > U,2> u,4. (Hypothetical data aside, it might be the case that 40- to 59-year-olds use a lot of cell phone time because of their jobs.)

Notice near the top of Figure 10-3 that you see "simultaneous confidence level = 80.32 percent." That means the overall error rate for this procedure is 1 – 0.8032 = 0.1968, which is close to 20 percent.

Separating the turkeys with TUkey’s test

This section dives into Tukey’s test and applies it to the cell-phone example.

Setting up Tukey’s test

The basic idea behind Tukey’s test is to provide a series of simultaneous confidence intervals for the differences in the means. It still examines all possible pairs of means and keeps the overall error rate (also known as the Familywise error rate) At a (like Fishers LSD), but it also keeps the individual Type I error rate for each pair of means at a as well. This difference takes care of a lot of issues raised with Fisher’s LSD procedure (refer to the preceding section).

Although the details of the formulas used for Tukey’s test are beyond the scope of this book, they’re not based on the t-test, but rather something called a Stu-dentized range statistic, Which is based on the highest and lowest means in the group, and their difference. The individual error rates are held at 0.05 because Tukey developed a cutoff value for his test statistic, which is based on all pair-wise comparisons (no matter how many means are in each group).

If you calculate the results by hand, you can look at tables to make your conclusions. However, all applications I have ever seen both in the classroom and outside of it use a computer for these calculations. (For sanity’s sake, I suggest you do the same.)

To conduct Tukey’s test, go to Stat>ANOVA>One-way or One-way unstacked. (If your data appears in two columns with Column 1 representing the population number and Column 2 representing the response, just click One-way because your data is stacked. If your data is shown in K Columns, one for each of the K Populations, click One-way unstacked.) The next step is to highlight the data for the groups you’re comparing and click Select. Then click on Comparisons. Click on Tukey’s. The familywise (overall) error rate is listed at 5 (percent), which is typical. If you want to change it, type in the desired error rate (between 0.5 and 0.001) and click OK. You may type in your error rate as a decimal, such as 0.05, or as a number greater than one, such as 5. Numbers greater than one are interpreted as a percentage.

Doing Tukey’s test on the cell phone data

The Minitab output for comparing the groups regarding cell-phone use by using Tukey’s test appears in Figure 10-4. Looking at Figure 10-4, you see that its results can be interpreted in the same was as for Figure 10-3. Some of the numbers in the confidence intervals are different, but in this case, the main conclusions are the same: Those 19 and under use their cell phones most, followed by 40- to 59-year-olds, then 20- to 39-year-olds, and finally those 60 and over.

The results of Fisher and Tukey don’t always agree, usually because the overall error rate of Fisher’s procedure is larger than Tukey’s (except when only two means are involved). Most statisticians I know prefer Tukey’s procedure over Fisher’s. That doesn’t mean they don’t have other procedures they like even better than Tukey’s, but Tukey’s is the most common procedure, and many people like to use it.

Figure 10-4:

Output for Tukey’s test used to compare cell-phone usage.

Tukey

95% Simultaneous Confidence Intervals

All Pairwise Comparisons

Individual confidence level — 98.93%

Group

1

Subtracted from:

Lower Center

Upper

+———–1———–1———–1———-

Group

2

-455.68 -380.00

-304.32

(-*-)

Group

3

-166.68 -91.00

-15.32

(-*-)

Group

4

-698.68 -623.00

-547.32

(-*-)

-7

)0 —0 0 –0

Group

2

Subtracted from:

Lower Center

Upper

+———O———O———O———

Group

3

213.32 289.00

3 6 4.68

(-*-)

Group

4

-318.68 -243.00

-16 7.32

(-*-)

0———0———0———0———

-700 -350 0 350

Group

3

Subtracted from:

Lower Center

Upper

Group

4

-607.68 -532.00

-456.32

0———0———0———0———

-7

0 -350 0 350

Another multiple comparison procedure is listed on Minitab’s repertoire after you ask it to do multiple comparisons. This procedure is called Dunnett’s test. Dunnett’s test is a special multiple comparison procedure used in a designed experiment that contains a control group. The test compares each treatment group to the control group and determines which treatments do better than others that way. Dunnett’s test is better able to find real differences in this situation than other multiple comparison procedures, because it focuses only on the differences between each treatment and the control — not the differences between every single pair of treatments in the entire study.

Комментарии закрыты.