In This Chapter

^ Understanding what goodness-of-fit really means

^ Using the Chi-square model to test for goodness-of-fit

^ Looking at the conditions for goodness-of-fit tests

Any phenomena in life may appear to be random in the short term, but actually occur according to some preconceived, preselected, or predestined model over the long term. For example, while you don’t know whether it will rain tomorrow, your local meteorologist can give you her model for the percentage of days that it rains, snows, is sunny, or cloudy, based on the last five years. Whether or not this model is still relevant this year is anyone’s guess, but it’s a model nonetheless. As another example, a biologist can produce a model for predicting the number of goslings raised by a pair of geese per year, even though you have no idea what the pair in your backyard will do. Is his model correct? Here’s your chance to find out.

In this chapter, you build models for the proportion of outcomes that fall into each category for a categorical variable. You then test these models by collecting data and comparing what you observe in your data to what you expect from the model. You do this through a goodness-of-fit test that’s based on the Chi-square distribution. In a way, a goodness-of-fit test is likened to a reality check of a model for categorical data.

Finding the Goodness-of-Fit Statistic

The general idea of a Goodness-of-fit Procedure involves determining what you expect to find and comparing it to what you actually observe in your own

Sample through the use of a test statistic. This test statistic is called the Goodness-of-fit test statistic, Because it measures how well your model (what you expected) fits your actual data (what you observed).

In this section, you see how to figure out the numbers that you should expect in each category given your proposed model, and you also see how to put those expected values together with your observed values to form the goodness-of-fit test statistic.

What’s observed Versus what’s expected

For an example of something that can be observed versus what’s expected, look no further than a bag of tasty M&M’S Milk Chocolate Candies. (A ton of different kinds of M&M’S are out there, and each kind has its own variation of colors and tastes. But for this study, any reference I give to M&M’S is to the original milk chocolate candy – my favorite.) The percentage of each color of M&M’S that appear in a bag is something Mars (the company that makes M&M’S) spends a lot of time thinking about. Mars does have specific percentages of each color that they want in their M&M’S bags, which it determines through comprehensive marketing research based on what people like and want to see. Mars then posts their current percentages for each color of M&M’S on their Web site. Table 15-1 shows the percentage of M&M’S of each color in 2006.

Table 15-1

Expected Percentage of Each Color of M&M’S

Milk Chocolate Candies (2006)

Color

Percentage

Brown

13%

Yellow

14%

Red

13%

Blue

24%

Orange

20%

Green

16%

Now that you know what to expect from a bag of M&M’S, the next question is how does Mars deliver? If you open a bag of M&M’S right now, would you get the percentages of each color that you’re supposed to get? You know from your previous studies in statistics that sample results vary (for a quick review of this idea, see Chapter 3). So you can’t expect each bag of M&M’S to

Have exactly the correct number of each color of M&M’S as listed in Table 15-1. However, in order to keep customers happy, Mars should get close to the expectations. How can you determine how close they do get?

You now know what percentages are expected to fall into each category in the entire population of all M&M’S (that means every single M&M’S Milk Chocolate Candy that’s currently being made), from Table 15-1. This set of percentages is called the Expected model For the data. You want to see whether the percentages in the expected model are actually occurring in the packages you buy. To start this process, you can take a sample of M&M’S (after all, you can’t check every single one in the population) and make a table showing what percentage of each color you observed. Then you can compare this table of observed percentages to the expected model.

The expected percentages are either given to you, as they are for the M&M’S, or you can figure them out by using math techniques. For example, if you’re examining a single die to determine whether or not it’s a fair die, you know that if the die is fair, you should expect % of the outcomes to fall into each category of 1, 2, 3, 4, 5, and 6.

As an example, I examined one 1.69-ounce bag of plain, milk-chocolate M&M’S (tough job, but someone has to do it), and you can see my results in Table 15-2. (Think of this bag as a random sample of M&M’S, even though it’s not technically the same as reaching into a silo filled with M&M’S and pulling out a true random sample of 1.69 ounces. For the sake of argument, one bag is okay.)

Table 15-2

Percentage of M&M’S Observed in One Bag (1.69 oz.)

Color

Number Observed

Percentage Observed

Brown

4

7.14%

Yellow

10

17.86%

Red

4

7.14%

Blue

10

17.86%

Orange

15

26.79%

Green

13

23.21%

TOTAL

56

100.00%

Now you look at what I observed in my sample (Table 15-2) and compare it to what I expected to get (Table 15-1, last column). Notice that I observed a lower percentage of brown and red M&M’S than expected and a lower percentage of blues than expected. I also observed a higher percentage of yellow, orange,

And green M&M’S than expected. You know that sample results vary by random chance, from sample to sample, and that the difference I observed may just be due to this chance variation. But could the differences indicate that the expected percentages, reported by Mars, aren’t being followed?

It stands to reason that if the differences between what you observed and what you expected are small, you should attribute that difference to chance and let the expected model stand. On the other hand, if the differences between what you observed and what you expected are large enough, you may have enough evidence to indicate that the expected model has some problems. How do you know which conclusion to make? The operative phrase is "if the differences are large enough." You need to quantify this term Large enough. Doing so takes a bit more machinery, so keep reading.

Calculating the goodness-of-fit statistic

The goodness-of-fit statistic is one number that puts together the total amount of difference between what you expect in each cell compared to the number you observe. The term Cell Is used to express each individual category within a table format. For example, with the M&M’S example, the first column of Tables 15-1 and 15-2 contain six cells, one for each color of M&M’S. For any cell, the number of items you observe in that cell is called the Observed cell count. The number of items you expect in that cell (under the given model) is called the Expected cell count For that cell. You get the expected cell count by taking the expected cell percentage times the sample size.

The expected cell count is just a proportion of the total, so it doesn’t have to be a whole number. For example, if you roll a fair die 200 times, you should expect to roll ones >6, or 16.67 percent, of the time. In terms of the number of ones you expect, it should be 0.1667 * 200 = 33.33. Use the 33.33 in your calculations for goodness-of-fit; don’t round to a whole number. Your final answer is more accurate that way.

The reason the goodness-of-fit statistic is based on the Number In each cell rather than the Percentage In each cell is because percents are a bit deceiving. If you know that 8 out of 10 people support a certain view, that’s 80 percent. But 80 out of 100 is also 80 percent. Which one would you feel is a more precise statistic? The 80 out of 100 percent, because it uses more information. Using percents alone disregards the sample size. Using the counts (the number in each group) keeps track of the amount of precision you have.

For example, if you roll a fair die, you expect the percentage of ones to be 34 If you roll that fair die 600 times, the expected Number Of ones will be 36 * 600 = 100. That number (100) is the expected cell count for the cell that represents the outcome of one. If you roll this die 600 times and get 95 ones, then 95 is the observed cell count for that cell.

The formula for the goodness-of-fit statistic is given by the following: ! ( ° EEh Where E Is the expected number in a cell and O Is the observed

All cells *-*

Number in a cell. The steps for this calculation are as follows:

1. For the first cell, find the expected number for that cell (E) By taking the percentage expected in that cell times the sample size.

2. Take the observed value in the first cell (O) Minus the number of items that are expected in that cell (E).

3. Square that difference.

4. Divide the answer by the number that’s expected in that cell.

5. Repeat steps 1 through 4 for each cell.

6. Add up the results to get the goodness-of-fit statistic.

The reason you divide by the expected cell count in the goodness-of-fit statistic (step four) is to take into account the magnitude of any differences you find. For example, if you expect 100 items to fall in a certain cell and you get 95, the difference is 5. But in terms of a percentage, this difference is only %oo = 5 percent. However, if you expected 10 items to fall into that cell and you observed 5 items, the difference is still 5, but in terms of a percentage, it’s %<> = 50 percent. This difference is much larger in terms of its impact. The goodness-of-fit statistic operates much like a percentage difference. The only added element is to square the difference to make it positive. (That’s done because whether you expected 10 and got 15, or whether you expected 10 and got 5 makes no difference to others, you’re still off by 50 percent.)

Table 15-3 shows the step-by-step calculation of the goodness-of-fit statistic for the M&M’S example, where O Indicates observed cell counts and E Indicates expected cell counts. To get the expected cell counts, you take the expected percentages shown in Table 15-1 and multiply by 56, because 56 is the number of M&M’S I had in my sample. The observed cell counts are the ones found in my sample, shown in Table 15-2.

Table 15-3

Goodness-of-Fit Statistic for M&M’S Example

Color O

E

O – E

(O – E)2

(O- E)) E

Brown 4

0.13 * 56 = 7.28

4 – 7.28 = -3.28

10.76

1.48

Yellow 10

0.14 * 56 = 7.84

10 – 7.84 = 2.16

4.67

0.60

Red 4

0.13 * 56 = 7.28

4 – 7.28 = -3.28

10.76

1.48

(continued)

Table 15-3 (continued)

Color

O

E

O -

E

(O – E)2

(O- E)) E

Blue

10

0.24 * 56 = 13.44

10 -

- 13.44 = -3.44

11.83

0.88

Orange

15

0.20 * 56 = 11.20

15 -

- 11.20 = 3.80

14.44

1.29

Green

13

0.16 * 56 = 8.96

13 -

- 8.96 = 4.04

16.32

1.82

TOTAL

56

56

7.55

The goodness-of-fit statistic for the M&M’S example turns out to be 7.55, the bolded number in the lower-right corner of Table 15-3. This number represents the total squared difference between what I expected and what I observed, adjusted for the magnitude of each expected cell count. The next question is how to interpret this value of 7.55. Is it large enough to indicate that colors of M&M’S in the bag aren’t following the percentages posted by Mars? The next section addresses how to make sense of these results.

Interpreting the Goodness-of-Fit Statistic By Using Chi-Square

After you get your goodness-of-fit statistic, your next job is to interpret it. To do this, you need to figure out the possible values you could have gotten and where your statistic fits in among them. You can accomplish this task with a Chi-square goodness-of-fit test.

The values of a goodness-of-fit statistic actually follow a Chi-square distribution with K - 1 degrees of freedom, where K Is the number of categories in your particular population (see Chapter 14 for the full details on Chi-square). You can use the Chi-square table (Table A-3 in the Appendix) to determine how far out your particular goodness-of-fit statistic is, compared to all the others that were possible to get. If your Chi-square statistic is large compared to other values on the Chi-square distribution, the model doesn’t fit; there’s too much of a difference between what you observed and what you expected under the model. However, if your goodness-of-fit statistic is small, you can’t reject the model. (What constitutes a high or low value of a Chi-square test statistic varies for each problem.) This section provides the details on using the Chi-square distribution to test for goodness-of-fit.

The goodness-of-fit statistic follows the main characteristics of the Chi-square distribution. The smallest possible value of the goodness-of-fit statistic is zero. If the M&M’s found in my sample (continuing the example from the previous section) followed the exact percentages found in Table 15-1, the goodness-of-fit statistic would be zero. That’s because the observed counts and the expected counts would be the same, so the values of the observed cell count minus the expected cell count would all be zero, so calculating the goodness-of-fit statistic here would result in zero.

The largest possible value of Chi-square isn’t specified, although some values are more likely to occur than others. Each Chi-square distribution has its own set of likely values, as you can see in Figure 15-1. (Figure 15-1 shows a simulated Chi-square distribution with 6 – 1 = 5 degrees of freedom (relevant to the M&M’s example). This figure basically gives a breakdown of all the possible values you could have for the goodness-of-fit statistic in this situation and how often they occur. You can see on Figure 15-1 that a Chi-square test statistic of 7.55 isn’t unusually high, indicating that the model for M&M’s colors probably can’t be rejected. However, more particulars are needed before you can formally make that conclusion.

Checking the conditions before you start

Every statistical technique seems to have a catch, and this case is no exception. In order to use the Chi-square distribution to interpret your goodness-of-fit statistic, you have to be sure you have enough information to work with

In each cell. The stat gurus usually recommend that the expected count for each cell turns out to be greater than or equal to five. If it doesn’t, one option is to combine categories together to increase the numbers.

In the M&M’S example, the expected cell counts are all above seven (see Table 15-3), so the conditions are met. If this weren’t the case, you could use a larger sample size, because you calculate the expected cell counts by taking the expected percentage in that cell times the sample size. If you increase the sample size, you increase the expected cell count. A higher sample size also increases your chances of detecting a real deviation from the model. This idea is related to the power of the test (see Chapter 3 for information on power).

After you collect your data, it’s not really right to go back and take a new and larger sample. It’s best to set up your sample size ahead of time, and you can do this by determining what sample size you need to get the expected cell counts to be at least five. For example, if you roll a fair die, you expect 3*6 of the outcomes to be ones. If you only take a sample of six rolls, you have an expected cell count of >6 * 6 = 1, which isn’t enough. However, if you roll the die 30 times, your expected cell count is >6 * 30 = 5, which is just enough to meet the condition.

The steps of the Chi-square goodness-of-fit test

Assuming the necessary condition is met (see the previous section), you can get down to actually conducting a formal goodness-of-fit test.

The general version of the null hypothesis for the goodness-of-fit test is Ho: The model holds for all categories, versus the alternative hypothesis Ha: The model doesn’t hold for at least one category. Each situation will dictate what proportions should be listed in Ho for each category. (For example, if you’re rolling a fair die, you have Ho: proportion of 1s = >6; proportion of 2s = >6; . . . ; proportion of 6s = J6.)

Following are the general steps for the Chi-square goodness-of-fit test, with the M&M’S example illustrating how you can carry out each step:

1. Write down Ho using the percentages that you expect in your model for each category.

Using a subscript to indicate the proportion (p) Of M&M’s you expect to fall into each category (see Table 15-1), your null hypothesis is Ho: pBrown = 0.13, pYellow = 0.14, pReD = 0.13, pBlue = 0.24, pOrange = 0.20, and Pgreen = 0.16. All these proportions must hold in order for the model to be upheld.

2. Write your Ha: This model doesn’t hold for at least one of the percentages.

Your alternative hypothesis, Ha, in this case, would be: One (or more) of the probabilities given in Ho isn’t correct. In other words you know that at least one of the colors of M&M’S has a different proportion of colors than what is stated in the model.

3. Calculate the goodness-of-fit statistic using the steps in the previous section.

The goodness-of-fit statistic for M&M’S, from the previous section, is 7.55. As a reminder, you take the observed number in each cell minus the expected number in that cell, square it, and divide by the expected number in that cell. Do that for every cell in the table and add up the results. For the M&M’S example that total is equal to 7.55, the goodness-of-fit statistic.

4. Look up the Chi-square distribution with K - 1 degrees of freedom, where K Is the number of categories you have (use Table A-3 in the Appendix).

You compare this statistic (7.55) to the Chi-square distribution with 6 – 1 = 5 degrees of freedom (because you have K = 6 possible colors of M&M’S).

Looking at Figure 15-1 you can see that the value of 7.55 is nowhere near the high end of this distribution, so you likely don’t have enough evidence to reject the model provided by Mars for M&M’S colors.

5. Find the p-value of your goodness-of-fit statistic.

You can use Table A-3 in the Appendix to find the p-value (the probability of being beyond your test statistic; see Chapter 3) of your test statistic using the Chi-square distribution. (For more info on the Chi-square distribution, see Chapter 14.)

Because the Chi-square table (Table A-3 in the Appendix) can only list a certain number of results for each of the degrees of freedom, the exact P-value for your test statistic may fall between two P-values listed on the table.

To find the P-value for the test statistic in the M&M’S example (7.55), you go to Table A-3 (Appendix) and find the row for 5 degrees of freedom and look at the numbers (the degrees of freedom is K - 1 = 6 – 1 = 5, where K Is the number of categories). You see that the number 7.55 is less than the first value in the row (9.24), which has a p-value of 0.10. (Find the P-value by looking at the column heading above the number.) So the P-value for 7.55, which is the area to the right of 7.55 on Figure 15-1, must be greater than 0.10, because 7.55 is to the left of 9.24 on that Chi-square distribution.

Many computer programs exist (online or via a graphing calculator) that will find exact p-values for a Chi-square test, saving time and headaches when you have access to them (the technology, not the headaches). Using one such online "p-value calculator" I found that the exact p-value for the goodness-of-fit test for the M&M’S example (test statistic 7.55, 5 degrees of freedom for Chi-square) is 0.1828 = 0.18. To find online p-value calculators, simply type in the name of the distribution and the word P-value In an Internet search engine. For this example, type in Chi-square p-value.

6. If your p-value is less than your predetermined cutoff (a), Reject Ho. The model doesn’t hold. If your p-value is greater than A, You can’t reject the model.

A typical value of a is 0.05. Some data analysts might use a higher value (up to 0.10) and others might go lower (for example 0.010.) See Chapter 3 for more information on choosing a and comparing your P-value to it.

Going again to the M&M’S example, the p-value, 0.18, is greater than 0.05, so you fail to reject Ho. You can’t say the model is wrong. So, Mars does appear to deliver on the percentages of M&M’S of each color, as advertised. At least you can’t say they don’t. (I’m sure Mars already knew that.)

ABE# While some hypothesis tests are two-sided tests, the goodness-of-fit test is

Always a right-tailed test, meaning that you have a greater than sign (>) in the alternative hypothesis, Ha (see Chapter 3 for the skinny on hypothesis testing). You’re only looking at the right tail of the Chi-square distribution when you’re doing a goodness-of-fit test. That’s because a small value of the goodness-of-fit statistic means that the observed data and the expected model don’t differ much, so you stick with the model. If the value of the goodness-of-fit statistic is way out on the right tail of the Chi-square distribution, however, that’s a different story. That situation means the difference between what you observed and what you expected is larger than what you should get by chance, and, therefore, you have enough evidence to say the expected model is wrong.

In This Chapter

^ Testing for independence in the population (not just the sample) ^ Using the Chi-square distribution

^ Discovering the connection between the Z-test and the Chi-square test

Ou’ve seen these hasty judgments before — people who collect one sample of data and try to use it to make conclusions about the whole population. When it comes to two qualitative variables (where data falls into categories and don’t represent measurements), the problem seems to be even more widespread.

For example, a TV news show finds that out of 1,000 presidential voters, 200 females are voting Republican, 300 females are voting Democrat, 300 males are voting Republican, and 200 males are voting Democrat. The news anchor shows the data and then states that 30 percent (30%,000) of all voters are females voting Democrat (and so on for the other counts). This conclusion is misleading. It is true that in this sample of 1,000 voters, 30 percent of them are females voting Democrat. However, this result doesn’t automatically mean that 30 percent of the entire population of voters are females voting Democrat. Results change from sample to sample.

People often understand that they can expect sample results to change, yet they don’t seem to realize that some conclusions come out differently due to even small changes in the sample results. For example, if you ask ten people about their views on an issue, you may get six people in favor (the majority) and four against. But the next time you take a sample of ten people, the results may reverse, and you’ll have four people in favor and six people against (the majority). This inconsistency is especially prone to happening if the sample size is small.

In this chapter, you see how to move beyond just summarizing the sample results from a two-way table (discussed in Chapter 13) to using those results in a hypothesis test to make conclusions about an entire population. This process

Requires a new probability distribution called the Chi-square distribution, Which you get very familiar with in this chapter. You also find out how to answer a very popular question among researchers: Are these two categorical (qualitative) variables independent (not related to each other) in the entire population?

A Hypothesis Test for Independence

A recent survey conducted by American Demographics asked men and women about the color of their next house. The results showed that 36 percent of the men wanted to paint their houses white, and 25 percent of the women wanted to paint their houses white. Table 14-1 illustrates the results from a sample of 1,000 people (500 men and 500 women).

Table 14-1

Gender and House-Paint Preference:

Observed Cell Counts

White Paint Nonwhite Paint Marginal Row Totals

Men

180 320 500

Women

125 375 500

Marginal Column Totals

305 695 1,000 (Grand Total)

The Marginal row totals Represent the total number in each row; the Marginal column totals Represent the total number in each column (see Chapter 13 for more information on row and column marginal totals). Notice that of the males, the percentage who want to paint their houses white is "%0 = 0.36, or 36 percent, as stated previously. And the percentage of females who want to paint their houses white is 1:%>0 = 0.25, or 25 percent. (Both of these percentages represent conditional probabilities as explained in Chapter 13.)

The American Demographics report concluded from this data that ". . . men and women agree on exterior house paint colors; the main exception being the top male choice, white (36 percent would paint their next house white versus 25 percent of women)." This type of conclusion is commonly formed, but it’s an overgeneralization of the results at this point. You know that in this sample, more men wanted to paint their houses white than women, but is 180 really that different from 125, with a sample size of 1,000 people whose results will vary the next time you do the survey? How do you know these results carry over to the population of all men and women? That question can’t be answered without a formal statistical procedure called a Hypothesis test (see Chapter 3 for the basics on hypothesis tests).

To show that men and women in the population differ according to favorite house color, first note that you have two qualitative variables — gender (male or female) and paint color (white or nonwhite). What you really want to know is whether these two variables are related to each other or not. If they are related, then favorite paint color depends on gender, which means these two variables are dependent. If they aren’t related, then favorite paint color doesn’t depend on gender, and the two variables are independent.

To test whether two qualitative variables are independent, you need a Chi-square test. The steps for the Chi-square test are the following, with full details supplied in the next sections (note that Minitab can conduct this test for you also, from step three on down):

1. Collect your data and summarize it in a two-way table.

These numbers represent the observed cell counts. (For more on two-way tables, see Chapter 13.)

2. Set up your null hypothesis, Ho: Variables are independent; and the alternative hypothesis, Ha: Variables are dependent.

3. Calculate the expected cell counts under the assumption of independence.

The expected cell count for a cell is the row total times the column total divided by the grand total.

4. Check the conditions of the Chi-square test before proceeding; each expected cell count must be greater than or equal to five.

5. Figure the Chi-square test statistic.

This statistic finds the observed cell count minus the expected cell count, squares the difference, and divides it by the expected cell count. Do these steps for each cell and then add them all up.

6. Look up your test statistic on the Chi-square table (Table A-3 in the Appendix) and find the p-value (or one that’s close).

7. If your result is less than your prespecified cutoff ( the A Level), usually 0.05, reject Ho and conclude dependence of the two variables.

If your result is greater than the a level, fail to reject Ho; the variables can’t be deemed dependent.

Ski

To conduct a Chi-square test in Minitab, enter your data in the spreadsheet exactly as it appears in your two-way table (see Chapter 13 for setting up a two-way table for qualitative data). Go to Stat>Tables>Chi-Square Test. Click on the two variable names in the left-hand box corresponding to your column variables in the spreadsheet. They appear in the Columns Contained in the Table box. Then click on OK.

Collecting and organizing the data

The first step toward any data analysis is collecting your data. In the case of two categorical (qualitative) variables, you collect data on the two variables at the same time for each person. In the house-color example from the previous section, you note each person’s gender, and then ask each person his or her preference for exterior house color. Keeping the data together in pairs (for example: male, white paint; female, nonwhite paint), you then organize it into a two-way table where the rows represent the categories of one qualitative variable (for example, males and females for gender), and the columns represent the categories of the other qualitative variable (for example, white paint and nonwhite paint).

The data for the house-paint example is organized in Table 14-1. You can see by looking at the grand total in the lower-right-hand corner of the table that 1,000 people participated in the survey; you see by the row totals that the 1,000 people were comprised of 500 men and 500 women. The connection between the two pieces of information collected is kept by organizing the data into one two-way table versus two individual tables, one for gender and one for house-paint preference. That way, you can look at the relationship between the two variables. (For the full details on organizing and interpreting the results from a two-way table, see Chapter 13.)

Determining the hypotheses

Every hypothesis test (whether it be a Chi-square test or some other test) has two hypotheses:

A Null hypothesis, Which you have to believe unless someone showed you otherwise. The notation for this hypothesis is Ho.

An Alternative hypothesis, Which you want to conclude in the event that you can’t support the null hypothesis anymore. The notation for this hypothesis is Ha.

For a full discussion of hypothesis testing, see my other book Statistics For Dummies (Wiley) or your intro stats textbook. For a quick review, see Chapter 3 of this book.

In the case where you’re testing for the independence of two qualitative variables, the null hypothesis is when no relationship exists between them. In other words, they’re independent. The alternative hypothesis is when the two variables are related, or dependent.

For the paint color example from the previous section, you write Ho: gender and paint color are independent versus Ha: gender and paint color are dependent. You have now completed step two of the Chi-square test.

FIGurINg expected cell counts

When you’ve collected your data and set up your two-way table (for example, see Table 14-1), you already know what the observed values are for each cell in the table. Now you need something to compare them to. You’re now ready for step three of the Chi-square test —finding expected cell counts. The null hypothesis says that the two variables X And Y Are independent. That’s the same as saying X And Y Have no relationship. Assuming independence, you can determine which numbers should be in each cell of the table by using a formula for what is called the expected cell counts. (Each individual square in a two-way table is called a Cell, And the number that falls into each cell is called the Cell count; See Chapter 13 for more information.)

Standing alone: Independent data

In general, Independence Means that you can find no major difference in the way the rows look, as you move down a column. That is, the proportion of the data falling into each column across the row is about the same for each row. So to find the expected cell counts for any two-way table, take the row total times the column total divided by the grand total, and do this process for each cell in the table.

Table 14-2 shows an example of independent data from a two-way table. Suppose that in this case the table represents data collected from men and women regarding whether they agree with a certain policy (yes or no). The proportion of all men who said yes is % = 0.17, or 17 percent. When you look at the same percentage for the women, you get the same number, 0.17. For both males and females, you get % = 0.83, or 83 percent, for the No group. Because males and females voted exactly the same way, these variables are likely going to be independent in the population as well as the sample.

Table 14-2

Gender and Opinion: Observed Cell Counts = Expected Cell Counts (Independent)

Yes

> No

Marginal Row Totals

Men

10

50

60

Women

10

50

60

Marginal Column Totals 20

100

120 (Grand Total)

To get the expected cell counts for the upper-left cell in Table 14-2, take 60 (row one total) times 20 (column one total) divided by 120 (grand total) = 10. For the next cell in the first row, you multiply 60 by 10°120 = 50. The same results occur in row two, because the numbers are all the same as in row one. Because Table 14-2 represents two independent variables, you get the same expected cell counts for each row.

Under independence, you can find no difference between what you observed and what you expected.

The expected cell-count formula can actually make sense if you look at it the right way. That is, if the two variables are independent, the proportion of the data falling into each column across the row is about the same for each row. So to find the expected cell count for any cell, you take the row total for the row that cell is in, and you multiply that total by the proportion of the table that falls into the column that cell is in (that is, the column total divided by the grand total).

Tying the knot: Dependent data

If two variables are dependent, then the value of one variable affects the value of the other variable. For example, suppose you believe women chew gum more than men. Then gender and gum chewing would be dependent, because if you knew someone’s gender, that would change the probability of them being a gum chewer. Dependent variables affect each other’s probabilities. In the end, the cell counts you actually observe from variables that are dependent won’t match what you expected the cell counts to look like under Ho: The variables are independent. Big differences between observed and expected cell counts means that the variables are dependent.

Table 14-3 shows some data that is dependent because the relationship isn’t the same for each row. More men in the sample said no to gum chewing (%> = 58 percent) than women in this sample (% = 42 percent). However, this may not hold for all men and women in the population.

Table 14-3

Gum Chewing: Observed Cell Counts

Yes

No

Marginal Row Totals

Men

25

35

60

Women

35

25

60

Marginal Column Totals

60

60

120 (Grand Total)

Making conclusions about the population based on the sample (observed) data in a two-way table is taking too big of a leap. You need to conduct a Chi-square test in order to broaden your conclusions to the entire population. Ignoring the fact that sample results vary is where the media, and even some researchers, can get into trouble. Stopping with the sample results only and going merrily on your way can lead to conclusions that others can’t confirm when they take new samples.

To check whether a two-way table is dependent, you first find the expected cell counts by taking the row total times the column total divided by the grand total and do this for each cell in the table. For Table 14-3, the expected cell count for the males who chew gum is 60 * %>o = 30. The expected cell count for the males who don’t chew gum is 60 * 6°i2o = 30. For the females who chew gum, you take 60 * %0 = 30, and the same for females who don’t chew gum. If gender and gum chewing are independent, you should expect to observe 30 in each cell (on average).

Next you compare the expected cell counts to the actual observed cell counts by looking at their differences (see Table 14-3 for the observed cell counts and Table 14-4 for the expected cell counts for the gum chewing example). You can see by Table 14-3 that the observed cell counts are 25, 35, 35, and 25. The expected cell count is 30 for each cell, as you can see in Table 14-4. The differences between the observed and expected cell counts are 25 – 30 = -5; 35 – 30 = 5; 35 – 30 = 5; and 25 – 30 = -5. These differences appear to be small with the naked eye, which may indicate gum chewing preference knows no gender. However, until you do a Chi-square test for independence (Chapter 15), you

Can never really know for

Sure.

Table 14-4

Gum Chewing: Expected Cell Counts

Yes No Marginal Row Totals

Men

60 * (%o) = 30 60 * (%o) = 30 60

Women

60 * (%0) = 30 60 * (%0) = 30 60

Marginal Column Totals

60 60 120 (Grand Total)

Checking the conditions for the test

The time has come for step four of the Chi-square test: checking conditions. The Chi-square test has one main condition that must be met in order to test for independence on a two-way table: The expected count for each cell must be at least five, that is, greater than or equal to five. Expected cell counts that fall below five aren’t reliable in terms of the variability that can take place. This problem is similar to trying to predict the outcome of only five flips of a coin — almost anything can happen. But if you flip the coin more times, you have a better idea of what you can expect to flip.

If you’re analyzing data and you find that your data set doesn’t meet the expected cell count of at least five for one or more cells, you can combine some of your rows and/or columns. This combination makes your table smaller, but it increases the cell counts for the cells that you do have, and that helps.

Calculating the Chi-square test statistic

Every hypothesis test uses data to make the decision about whether or not to reject Ho in favor of Ha. In every hypothesis test, you take information from the data and put it together into a test statistic. The Test statistic, In general, finds the distance between your observed results (your data) and the results you expect if Ho were true. If that difference is large, then you reject Ho in favor of Ha. If that difference is small, you fail to reject Ho. (For more information on test statistics, see another book I wrote, Statistics For Dummies [Wiley], or your intro stats book.)

In the case of testing for independence in a two-way table, you use a hypothesis test based on the Chi-square test statistic. In the following sections, you can see the steps for calculating and interpreting the Chi-square test statistic, which is step five of the Chi-square test.

Working out the formula

A major component of the Chi-square test statistic is the expected cell count

For each cell in the table. The formula for finding the expected cell count, Eif,

, , „ . . .. row I Total * column J Total.. .

For the cell in row i, column I Is Eil =-, . . ,—–. Note that

Ij grand total

The values of I And I Vary for each cell in the table. In a two-way table, the

Upper-left cell of the table is in row one, column one. The cell in the upper -

Right corner is in row one, column two. The cell in the lower-left corner is in

Row two, column one, and the lower-right-hand cell is in row two, column two.

(O – e J2

The formula for the Chi-square test statistic is %2 = ! ! -‘~e——, where OJ Is

The observed cell count for the cell in row I, Column, and EI Is the expected cell count for the cell in row I, Column.

When you calculate the expected cell count for some cells, you typically get a number that has some digits after the decimal point (in other words, the number isn’t a whole number). Don’t round this number off, despite the temptation to do so. This expected cell count is actually an overall-average expected value, and you can keep the count as it is, with decimal included.

Here are the major steps in how the Chi-square test statistic is calculated (Minitab does these steps for you as well):

1. Subtract the observed cell count from the expected cell count for the upper-left-hand cell in the table.

2. Square the result from step one to make the number positive.

3. Divide the result from step two by the expected cell count.

4. Repeat this process for all the cells in the table and add up all the results.

The final sum that you get is your Chi-square test statistic.

The reason you divide by the expected cell count in the Chi-square test statistic is to account for cell-count sizes. If you expect a big cell count, say 100, and are off by only 5 for the observed count of that cell, that difference shouldn’t count as much as if you expected a small cell count (like 10) and the observed cell count was off by 5. Dividing by the expected cell count puts a more fair weight on the differences that go into the Chi-square test statistic.

To perform a Chi-square test in Minitab, enter the raw data (the data on each person) in two columns. The first column is the values of your first variable in your data set. (For example, if your first variable is gender, go down the column entering the gender of each person.) Then enter your second variable in the second column, using the same row to represent each person in the data set. (If your second variable is paint preference, for example, enter each person’s house-paint preference in column two, keeping the data from each person together in each row.) Go to Stat>Tables>Cross-tabulation and %-square. (But don’t stop here: Keep reading.)

On the left-hand side, click on the variable that you wish to be in the rows of your two-way table (you may click on the first variable if you wish). Click Select, and the variable name appears in the row variable portion of the table on the right. Then go to the column variable blank on the right-hand side and click on it. You will be asked to choose your column variable. Go to the left-hand side and click on the name of your second variable. Click Select. Then click on the Chi-square button and choose Chi-square analysis by checking the box. If you want the expected cell counts included, check that box also. Then click OK, and OK.

The Chi-square test statistic can never be negative, because it’s built on sums of squares of differences in the numerator and expected cell counts in the denominator (which are always positive).

The Minitab output for the Chi-square analysis for the house-paint example (from Table 14-1) is shown in Figure 14-1. You can pick out quite a few numbers from the output in Figure 14-1 that are especially important. First, you see three numbers listed in each cell. The first (top) number is the observed cell count for that cell; this matches the observed cell count for each cell shown in Table 14-1. (Notice the marginal row and column totals of Figure 14-1 also match those from Table 14-1.)

The second number in each cell of Figure 14-1 is the expected cell count for that cell; you find it by taking the row total times the column total divided by the grand total (see the section "Figuring the expected cell counts"). For example, the expected cell count for the upper-left cell (males who prefer white house paint) is 500 * 305fooo = 152.50.

The third number in each cell of Figure 14-1 is that part of the Chi-square test statistic that comes from that cell. (See steps one through three of the previous section, "Working out the formula.") The sum of the third numbers in each cell equals the value of the Chi-square statistic listed in the last line of the output. (For the house-paint example, the Chi-square test statistic is 14.27.)

Interpreting the Chi-square test statistic is step six of the Chi-square test; you work through that process in the next section.

Chi-Square Test: Gender, House-Paint Preference

Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts

White Paint

Nonwhite Paint

Total

M

180

320

500

152.50

347.50

4.959

2.176

Figure 14-1:

F

125

375

500

152.50

347.50

Minitab

4.959

2.176

Output for

The house-

Total

305

695

1000

Paint data.

Chi-Sq

= 14.271, DF

= 1, P-Value =

0.000

Finding your results on the Chi-square table

The only way to be able to make an assessment about your Chi-square test statistic is to compare it to all the possible Chi-square test statistics you would get if you had a two-way table with the same row and column totals, yet you distributed the numbers in the cells in every way possible. (You can do that in your sleep, right?) Some resulting tables give large Chi-square test statistics, and some give small Chi-square test statistics.

Putting all these Chi-square test statistics together gives you what’s called a Chi-square distribution. You find your particular test statistic on that distribution (step six of the Chi-square test), and see where it stands compared to

The rest. If your test statistic is large enough that it appears way out on the right tail of the Chi-square distribution (boldly going where no test statistic has gone before), you reject Ho. If the test statistic isn’t that far out, then you can’t reject Ho.

In the next sections, you find out more about the Chi-square distribution and how it behaves, so you can make a decision about the independence of your two variables based on your Chi-square statistic.

.57

Determining degrees of freedom

Each type of two-way table has its own Chi-square distribution, depending on the number of rows and columns it has, and each Chi-square distribution is identified by its Degrees of freedom. In general, a two-way table with R Rows and C Columns uses a Chi-square distribution with (r – 1) * (c – 1) degree of freedom. A two-way table with two rows and two columns uses a Chi-square distribution with one degree of freedom. Notice that 1 = (2 – 1) * (2 – 1). A two-way table with three rows and two columns uses a Chi-square distribution with (3 – 1) * (2 – 1) = 2 degrees of freedom.

Understanding Why Degrees of freedom are calculated this way is likely to be beyond the scope of your statistics class. But if you really want to know, the degrees of freedom represents the number of cells in the table that are flexible, or "free," given all the marginal row and column totals. For example, suppose that a two-way table has all row and column totals equal to 100 and the upper-left cell is 70. Then the upper-right cell must be 100 (row total) -30 = 70. Because the column one total is 100, and the upper-left cell count is 70, the lower-left cell count must be 100 – 70 = 30. Similarly, the lower-right cell count must be 70.

So you have only one free cell in a two-way table after you have the marginal totals set up. That’s why the degree of freedom for a two-way table is 1. In general, you always lose one row and one column because of knowing the marginal totals, because these last row and column values can be calculated through subtraction. That’s where the formula (r - 1) * (c - 1) comes from. (That’s more than you wanted to know, isn’t it?)

Discovering how Chi-square distributions behave

Figure 14-2 shows pictures of Chi-square distributions with one, two, four, six, eight, and ten degrees of freedom, respectively. Here are some important points about Chi-square distributions:

For one degree of freedom, the distribution looks like a hyperbola (see Figure 14-2, top left); for more than one degree of freedom, it looks like a mound that has a long right tail (see Figure 14-2, lower right).

All the values are greater than or equal to zero.

The shape is always skewed to the right (tail going off to the right).

As the number of degrees of freedom increases, the mean (the overall average) increases (moves to the right) and the variances increase (resulting in more spread).

No matter what the degree of freedom is, the values on the Chi-square distribution (known as the Density) Approaches zero for increasingly larger Chi-square values. That means that larger and larger Chi-square values are less and less likely to happen.

Figure 14-2:

Chi-square distributions with 1, 2, 4, 6, 8, and 10 degrees of freedom (moving from upper left to lower right).

0.0 5.5 11.0 16.5 22.0 27.0 33.0 38.5

DF = 1

DF = 2

DF = 4

DF = 6

DF = 8

DF = 10

Jftliw..

0.8 0.6 0.4 0.2 0.0

4

Using the Chi-square table

After you find your Chi-square test statistic and its degrees of freedom, you want to determine how large your statistic is, relative to its corresponding distribution. (You’re now venturing into step seven of the Chi-square test.) If you think about it graphically, you want to find the probability of being beyond (getting a larger number than) your test statistic. If that probability is small, your Chi-square test statistic is something unusual — it’s out there — and you can reject Ho. You then conclude that your two variables are not independent (they are related somehow).

In case you’re following along at home, the Chi-square test statistic for the independent data from Table 14-2 is zero, because the observed cell counts are equal to the expected cell counts for each cell, and their differences are

Always equal to zero. (This result never happens in real life!) This scenario represents a Perfectly independent Situation and results in the smallest possible value of a Chi-square test statistic.

If the probability of being to the right of your Chi-square test statistic (on a graph) isn’t small enough, you don’t have enough evidence to reject Ho. You then stick with Ho; you can’t reject it. You conclude that your two variables are independent (unrelated).

How small of a probability do you need to reject Ho? For most hypothesis tests, statisticians generally use 0.05 as the cutoff. (For more information on cutoff values, also known as a levels, flip to Chapter 3, or check out my other book Statistics For Dummies [Wiley].)

Your job now is to find the probability of being beyond your Chi-square test statistic on the corresponding Chi-square distribution with (r – 1) * (c – 1) degrees of freedom. Each Chi-square distribution is different, and because the number of possible degrees of freedom is infinite, showing every single value of every Chi-square distribution isn’t possible. In Table A-3 (in the Appendix in the back of this book), you see some of the most important values on each Chi-square distribution with degrees of freedom from 1 to 50.

To use the Chi-square table (Table A-3 in the Appendix), you find the row that represents your degrees of freedom (abbreviated Df). Move across that row until you reach the value that is closest to your Chi-square test statistic, without going over. (It’s like a game show, when you’re trying to win the showcase by guessing the price.) Then go to the top of the column you’re in. That number represents the area to the right (above) of the Chi-square test statistic you saw in the table. The area above your particular Chi-square test statistic is less than or equal to this number. This result is the approximate P-value of your Chi-square test.

Using the house-paint example (see Figure 14-1), the Chi-square test statistic was 14.27. You have (2 – 1) * (2 – 1) = 1 degree of freedom. On Table A-3 (in the Appendix), you go to the row for Df = 1, and go across to the number closest to 14.27 (without going over). That number is 7.88, in the last column. (This number is much less than 14.27, but it’s the biggest number on the table for that row.) The number at the top of that column is 0.005.

DrawINg your conclusIOns

You have two alternative ways to draw conclusions from the Chi-square test statistic. You can look up your test statistic on the Chi-square table (located in Table A-3 in the Appendix) and see the probability of being greater than

That. This method is known as Approximating the p-value. (The P-value Of a test statistic is the probability of being at or beyond your test statistic on the distribution to which the test statistic is being compared — in this case, the Chi-square distribution.) Or you can have the computer calculate the exact p-value for your test. (For more on p-values and a levels, see my other book Statistics For Dummies. For a quick review on these topics, see Chapter 3 of this book.)

Before you do anything though, set your a, the cutoff probability for your P-value, in advance. If your P-value is less than your a level, reject Ho. If it is more, you can’t reject Ho.

Approximating p-Value from the table

For the house-paint example (see Figure 14-1), the Chi-square test statistic was 14.27 with 1 df (degree of freedom). The closest number in row one of Table A-3 (in the Appendix), without going over, is 7.88 (in the last column). The number at the top of that column is 0.005. This number is less than your typical a level of 0.05, so you reject Ho. You know that your p-value is less than 0.005 because your test statistic was more than 7.88. In other words, if 7.88 is the minimum evidence you need to reject Ho, you have more evidence than that with a value of 14.28. More evidence against Ho means a smaller P-value. However, because Table A-3 only gives a few values for each Chi-square distribution, the best you can say using this table is that your P-value for this test is less than 0.005.

Here’s the big news: Because your p-value is less than 0.05, you can conclude based on this data that gender and house-paint color are likely to be related in the population (dependent), like the Demographics Survey said (located at the beginning of this chapter). Only now, you have a formal statistical analysis that says this result found in the sample is also likely to occur in the entire population. This statement is much stronger!

If your data shows you can reject Ho, you only know at that point that the two variables have some relationship. The Chi-square test statistic doesn’t tell you what that relationship is. In order to explore the relationship between the two variables, you find the conditional probabilities in your two-way table (see Chapter 13). You can use those results to give you some ideas as to what may be happening in the population. For example, in the house-paint data (because paint preference is related to gender), you can examine the relationship further by first finding the percentage of men that prefer white houses, which comes out to 18%0 = 0.36, or 36 percent, calculated from Table 14-1. Now compare this result to the percentage of women who prefer white houses: 125500 = 0.25, or 25 percent. You can now conclude that in this population (not just the sample), men prefer white houses more than women do.

Extracting the p-value from computer output

After Minitab calculates the test statistic for you, it reports the exact p-value for your hypothesis test. The p-value measures the likelihood that your results were found just by chance while Ho is still true. It tells you how much strength you have against Ho. If the p-value is 0.001, for example, you have much more strength against Ho than if the P-value, say, is 0.10.

Looking at the Minitab output for the house-paint data in Figure 14-1, the P-value is reported to be 0.000. This means that the P-value is smaller than 0.001; for example, it may be 0.0009. That’s a very small p-value! (Minitab only reports results to three decimal points, which is typical of many statistical software packages.)

The Chi-square test for the gum-chewing data from Table 14-3 results in a p-value of 0.068. This calculation is what statisticians call a Marginal result, Because it’s just on the other side of 0.05. (The test statistic turned out to be only 3.33, and that didn’t seem to be very large.) This p-value is larger than the typical a of 0.05, but not a lot larger. Technically speaking, you can’t reject Ho at level a = 0.05. In practical terms, even though gum chewing and gender seem to be dependent in the sample, you can’t say that you can expect to find this relationship in the population.

I’ve seen situations where people who get a result that isn’t quite what they want (like a p-value of 0.068) do some tweaking to get what they want. What they do is change their a level from 0.05 to 0.10 after the fact. This change makes the P-value less than the a level, and they feel they can reject Ho and say that a relationship exists. But what’s wrong with this? They changed the a after they looked at the data, which isn’t allowed. That’s like changing your bet in blackjack after you find out what the dealer’s cards look like. (Tempting, but a serious no-no.) Always be wary of large a levels, and make sure that you always choose your a before collecting any data — and stick to it. The good news is that when P-values are reported, anyone reading them can make his own conclusion; no cut-and-dry rejection and acceptance region is set in stone. But setting an a level once, then changing it after the fact to get a better conclusion is never good!

Comparing Two Tests for Comparing Two Proportions

You can use the Chi-square test to check whether two population proportions are equal (for example, is the proportion of female cell-phone users the same as the proportion of male cell-phone users?). Now you may be thinking, "But

Wait a minute, don’t statisticians already have a test for two proportions? I seem to remember it from my intro stats course. . . I’m thinking. . . yeah, it’s the Z-test for two proportions. What’s that test got to do with a Chi-square test?" In this section, you answer that question, and use both methods to investigate a possible gender gap in cell-phone use.

Getting reacquainted with the Z-test for two population proportions

The way that most people figure out how to test the equality of two population proportions is to use a Z-test for two population proportions (where you collect a random sample from each of the two populations, find and subtract their two sample proportions, and divide by their pooled standard error; see your intro stats book for details on this particular test). This test is possible to do as long as the sample sizes from the two populations are large — at least five successes and five failures in each sample.

The null hypothesis for the Z-test for two population proportions is Ho: p1 = p2, where p1 is the proportion of the first population that falls into the category of interest and p2 is the proportion of the second population that falls into the category of interest. And as always, the alternative hypothesis is one of the following choices Ha: not equal to, greater than, or less than.

Suppose you want to compare the proportion of cell-phone users for men versus women. You make p1 be the proportion of males who own a cell phone, and p2 is the proportion of all females who own a cell phone. You collect data, find the sample proportions from each group, P1 and P2, take their difference

And make a Z-statistic out of it using the formula Z =

Where P = n + ^. Here, X! And X2 Are the number of individuals from samples one and two, respectively, with the desired characteristic; N! And N2 Are the two sample sizes.

Suppose that you collect data on 100 men and 100 women and find 45 male cell-phone owners and 55 female cell-phone owners,. This means that p1 equals %0 = 0.45, and p2 equals %0 = 0.55. Your samples have at least five Successes (having the desired characteristic; in this case, cell-phone ownership) and five Failures (not having the desired characteristic, which is cell-phone ownership.) So you go ahead and compute the Z-statistic for comparing the two population proportions (males versus females) based on this data is -1.41, as shown on the last line of the Minitab output in Figure 14-3.

Figure 14-3:

Minitab output comparing proportion of male and female cell-phone owners.

Test Cell Phone for Two Proportions

Sample X N Sample p M 45 100 0.450000

F 55 100 0.550000

Difference = p (1) – p (2)

Estimate for difference: -0.1

95% CI for difference:(-0.237896, 0.0378957)

Test for difference = 0 (vs not = 0): Z = -1.41 P-Value

0.157

The p-value for the test statistic of Z = -1.41 is 0.157 (calculated by Minitab, or by looking at the area below the Z-value of -1.41 on a Z-table; see your intro stats text for one of those). This p-value (0.157) is greater than the typical a level (prespecified cutoff) of 0.05, so you can’t reject Ho. You can’t say that the two population proportions aren’t equal. That is, you must conclude that the proportion of cell-phone owners for males is no different than for females. Even though the sample seemed to have evidence for a difference (after all, 45 percent isn’t equal to 55 percent), you don’t have enough evidence in the data to say that this same difference carries over to the population. So you can’t lay claim to a gender gap in cell-phone use, at least with this sample.

Equating Chi-square tests and Z-tests for a two-by-two table

Here’s the key to relating the Z-test to a Chi-square test for independence. If you use the Z-test to see whether the proportion of male cell-phone owners is equal to the proportion of female cell-phone owners, you’re really looking at whether you can expect the same proportion of cell-phone owners despite gender (after you take the sample sizes into account). And that means you are testing whether gender (male or female) is independent of cell-phone ownership (yes or no).

If the proportion of female cell-phone owners equals the proportion of male cell-phone owners, then the proportion of cell-phone owners is the same regardless of gender, so gender and cell-phone ownership are independent. On the other hand, if you find the proportion of male cell-phone owners to be unequal to the proportion of female cell phone owners, then you can say that cell-phone use differs by gender — so gender and cell-phone ownership are dependent.

Therefore, the Z-test for two proportions and the Chi-square test for independence in a two-by-two table (one with two rows and two columns) are equivalent if the sample sizes from the two populations are large enough; that is, when the number of successes and the number of failures in each cell of the two samples is at least five.

With the cell-phone data from the previous section, you have 45 males using cell phones (out of 100 males) and 55 females using cell phones (out of 100 females). The Minitab output for the Chi-square test for independence (complete with observed and expected cell counts, degrees of freedom, test statistic, and p-value) is shown in Figure 14-4. The p-value for this test is 0.157, which is greater than the typical a level (prespecified cutoff) of 0.05, so you can’t reject Ho.

Because the Chi-square test for independence and the Z-test tests are equivalent when you have a two-by-two table, the P-value from the Chi-square test for independence is identical to the P-value from the Z-test for two proportions. If you compare the p-values from Figures 14-3 and 14-4, you can see that for yourself.

Chi-Square Test: Gender, Cell Phone

Expected

Counts

Are printed below observed counts

Chi-Square contributions

Are printed below expected counts

Y

N

Total

M

45

55

100

50.00 0.500

50.00 0.500

Figure 14-4:

Minitab

F

55

45

100

Output testing inde-

50.00

0.500

50.00 0.500

Pendence of gender and

Total

100

100

200

Cell-phone ownership.

Chi-Sq =

2.000,

DF =

1, P-Value = 0.157

Also, note that if you take the Z-test statistic for this example (from Figure 14-3), which is -1.41, and square it, you get 2.02, which is equal to the Chi-square test statistic for the same data (last line of Figure 14-4). It is also the case that when the square of the Z-test statistic (when testing for the equality of two proportions) is equal to the corresponding Chi-square test statistic for independence.

Researchers are doing a great deal of study of the effects of cell-phone use while driving. One study published in the New England Journal of Medicine Observed and recorded data in 1997 on 699 drivers who had cell phones and were involved in motor vehicle collisions resulting in substantial property damage but no personal injury. Each person’s cell-phone calls on the day of the collision and during the previous week were analyzed through the use of detailed billing records. A total of 26,798 cell-phone calls were made during the 14-month study period.

One conclusion the researchers made was that ". . . the risk of a collision when using a cell phone is four times higher than the riskof a collision when a cell phone was not being used." They basically conducted a test to see whether cell-phone use and having a collision are independent, and when they found out they were not, they were able to examine the relationship further using appropriate ratios. In particular, they found that the risk of a collision is four times higher for those drivers using cell phones than for those who aren’t.

Researchers also found out that the relativerisk was similar for drivers who differed in personal characteristics, such as age and driving experience. (This finding means that they conducted similar tests to see whether the results were the same for drivers of different age groups and

Drivers of different levels of experience, and the results always came out about the same. Therefore, age and the experience of the driver were not related to the collision outcome.)

The research also shows that ". . . calls made close to the time of the collision were found to be particularly hazardous (p < 0.001). Hands-free cell phones offered no safety advantage over hand-held units (p-value not significant) . . ." Note: The items in parentheses show the typical way that researchers report their results — using p-values. The P In both cases of parentheses represent the p-value of each test.

In the first case, the p-value is very tiny, less than 0.001, indicating strong evidence for a relationship between collisions and cell-phone use at the time. The second p-value in parentheses was stated to be insignificant, meaning that it was substantially more than 0.05, the usual a level people use. This second result indicates that whether or not the drivers used hands-free equipment didn’t affect the chances of a collision happening. That is, the proportion of collisions using hands-free cell phones versus using regular cell phones were found to be statistically the same (they could’ve easily occurred by chance under independence). Whether you use a regular or hands-free cell phone, may this study be a lesson to everyone!

The Chi-square test and Z-test are equivalent only if the table is a two-by-two table (two rows and two columns) and if the Z-test is two tailed (the alternative hypothesis is that the two proportions aren’t equal, instead of using Ha: one proportion is greater than or less than the other). If the Z-test is not two tailed, a Chi-square test isn’t appropriate. If the two-way table has more than two rows or columns, use the Chi-square test for independence (because you no longer have only two proportions if you have many categories, so the Z-test isn’t applicable).

In This Chapter

^ Reading and interpreting two-way tables ^ Figuring probabilities and checking for independence Watching out for Simpson’s Paradox

Ooking for relationships between two categorical (qualitative) variables is a very common goal for researchers. For example, many medical studies center on how some characteristic about a person either raises or lowers his chance of getting some disease. Marketers ask questions like, "Who is more likely to buy our product: males or females?" Sports stat freaks wonder about things like "Does winning the coin toss at the beginning of a football game increase your team’s chance of winning the game?"

To answer each of the above questions, you must first collect data (from a random sample) on the two categorical variables being compared — call them X And Y. Then you organize that data into a table that contains columns and rows, showing how many individuals from the sample appear in each combination of X And Y. Finally, you use the information in the table to conduct a hypothesis test (called the Chi-square test). Using the Chi-square test, you can determine whether you can see a relationship between X And Y In the population from which the data was drawn. This last step needs the machinery from Chapter 14 to accomplish it. The goals of this chapter are to understand what it means for two qualitative variables (x And Y) To be associated and to discover how to use percentages to determine whether a sample data set appears to show a relationship between X And Y.

Suppose you’re collecting data on cell-phone users, and you want to find out whether more females use cell phones than males. A study of 508 randomly selected male cell-phone users and 508 randomly selected female cell-phone users conducted by a wireless company found that women tend to use their phones for personal calls more than men (big shocker). The survey showed that 427 of the women said they used their wireless phones primarily to talk with friends and family, while only 325 of the men admitted to doing so.

But you can’t stop there. You need to break down this information, calculate some percentages, and compare them to see how close they really are. Sample results vary from sample to sample, and differences can appear by chance.

In this chapter, you find out how to organize data from qualitative variables (data based on categories rather than measurements) into a table format. This skill is especially useful when you’re trying to look for relationships between two qualitative variables, such as using a cell phone for personal calls (a yes or no category) and gender (male or female). You also summarize the data to answer your questions. And, finally, you get to figure out, once and for all, what’s going on with that Simpson’s Paradox thing.

Breaking Down a Two-Way Table

A Two-way table Is a table that contains rows and columns, which help you organize data from categorical (qualitative) variables in the following ways:

The rows represent the possible categories for one categorical variable, such as males and females.

The columns represent the possible categories for a second categorical variable, such as using your cell phone for personal calls, or not.

Here I review the basic ideas of organizing and filling in a two-way table.

Organizing data into a two-way table

To organize your data into a two-way table, first set up the rows and columns. Table 13-1 shows the setup for the cell-phone data (refer to the example I give at the beginning of the chapter).

Table 13-1 Two-Way Table Set Up for the Cell-Phone Data

Personal Calls: Yes Personal Calls: No

Males

Females

Notice that Table 13-1 has four empty cells inside of it (not counting the empty space in the upper-left corner). Because gender has two choices (male or female), and personal cell-phone use has two choices (yes or no), the resulting two-way table has 2 * 2 = 4 cells.

To figure out the number of cells in any two-way table, multiply the number of possible categories for the row variables times the number of possible categories for the column variable.

Fitting in the cell counts

After you set up the table with the appropriate number of rows and columns, you need to fill in the appropriate numbers in each of the cells of the two-way table. The number in each cell of a two-way table is called the Cell count For that cell. The upper-left cell in the two-way table shown in Table 13-1 represents the number of males who use their cell phones for personal calls. With the information you have in the cell-phone problem, the cell count for this cell is 325. Because you know that 427 females use their cell phones for personal calls, this number goes into the lower-left cell.

Now, to figure out the numbers in the remaining two cells, you do a bit of subtraction. You know from the information given that the total number of male cell-phone users in the survey is 508. Each male either uses his cell phone for personal calls (falling into the Yes Group), or he doesn’t (falling into the No Group). Because 325 males fall into the Yes Group, and you have 508 males total, 183 males (508 – 325 = 183) don’t use their cell phones for personal calls. This number is the cell count for the upper-right cell of the two-way table. Finally, because 508 females took the survey, and 427 of them use their cell phones for personal calls, you know that the rest of them (508 -427 = 81) don’t. Therefore, 81 is the cell count for the lower-right cell of the table. Table 13-2 shows the completed table for the cell-phone user problem, with the four cell counts filled in.

Table 13-2 Completed Two-Way Table for the Cell-Phone Data

Personal Calls: Yes Personal Calls: No

Males 325 183 (508 – 325)

Females

427

81 (508 – 427)

Just to save you a little time, if you have the total number in a group and how many of those individuals fall into one of the categories of the two-way table, you can determine the number falling into the remaining category by subtracting the total number in the group minus the number in the given category. You can complete this process for each remaining group in the table.

Making marginal totals

One of the most important aspects of a two-way table is to have easy access to all the pertinent totals. Because every two-way table is made up of rows and columns, you can imagine that the totals for each row and the totals for each column are important. Also, the grand total is important to know.

If you take a single row and add up all the cell counts in the cells of that row, you get what is called a Marginal row total For that row. Where does this marginal row total go on the table? You guessed it — out in the margin at the end of that row. You can find the marginal row totals for every row in the table and put them into the margins at the end of the rows. This group of marginal row totals for each row represents what statisticians call the Marginal distribution For the row variable. The marginal row totals should add up to the Grand total, Which is the total number of individuals in the study. (The individuals may be people, cities, dogs, companies, and so on, depending on the scenario of the problem at hand.)

Similarly, if you take a single column and add up all the cell counts in the cells of that column, you get the Marginal column total For that column. This number goes in the margin at the bottom of the column. Follow this pattern for each column in the table, and you have the marginal distribution for the column variable. Again, the sum of all the marginal column totals equals the grand total. The grand total is always located in the lower-right corner of the two-way table.

The marginal row total, marginal column totals, and the grand total for the cell-phone example are shown in Table 13-3.

Table 13-3

Marginal and Grand Totals for the Cell Phone Data

Personal Personal Marginal

Calls: Yes Calls: No Row Totals

Males

325 183(508 – 325) 508

Females

427 81 (508 – 427) 508

Marginal Column Totals 752 264 1,016 (Grand Total)

The marginal row totals add the cell counts in each row; yet the marginal row totals show up as a column in the two-way table. This phenomenon occurs because when summing the cell counts in a row, you put the result in the margin at the end of the row, and when you do this for each row, you’re stacking the row totals into a column. Similarly, the marginal column totals add the cell counts in each column; yet they show up as a row in the two-way table. Don’t let this be a source of confusion when you’re trying to navigate or set up a two-way table. It’s always a good idea to label your totals as marginal row, marginal column, or grand total to help keep it clear.

Breaking Down the Probabilities

A percentage, when applied to a two-way table, represents the portion of the individuals in the sample falling into a certain group. This idea can be expanded to a probability, which gives the chance that an individual person selected from this group falls into a certain category.

A two-way table gives you the opportunity to find many different kinds of probabilities to help you find the answers to different questions about your data or to look at the data another way. In this section, I cover the three most important types of probabilities found in a two-way table: marginal probabilities, joint probabilities, and conditional probabilities. (If you need more info on these terms, check out Probability For Dummies [Wiley].)

When you find probabilities based on a sample, as you do in this chapter, you have to realize that those probabilities pertain to that sample only. They do not transfer automatically to the population being studied. For example, if you take a random sample of 1,000 adults and find that 55 percent of them watch reality TV, this study doesn’t mean that 55 percent of all adults in the entire population watch reality TV. (The media makes this mistake every day.) You need to take into account the fact that sample results vary. In Chapters 14 and 15, you do just that. But this chapter zeros in on summarizing the information in your sample, which is the first step toward that end (but not the last step in terms of making conclusions about your corresponding population).

Marginal probabilities

A Marginal probability Makes a probability out of the marginal total, for either the rows or the columns. A marginal probability represents the proportion of the entire group that belongs in that single row or column category. Each

Marginal probability represents only one category for only one variable — it doesn’t consider the other variable at all. In the cell-phone example, you have four possible marginal probabilities (refer to Table 13-3):

Marginal probability of female (50?-i, oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were females.

E Marginal probability of male (5%oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were males.

E Marginal probability of using a cell phone for personal calls (75X,0i6 = 0.74). Therefore, 74 percent of all cell-phone users in this sample make personal calls with their cell phones.

E Marginal probability of not using a cell phone for personal calls (2%>ie = 0.26). In other words, 26 percent of all the cell-phone users in this sample don’t make personal calls with their cell phones.

Statisticians use shorthand notation for all probabilities. If you let M = male, F = female, Yes = personal cell-phone use, and No = no personal cell-phone use, then each of the preceding marginal probabilities is written this way:

E P(F) = 0.50 E P(M) = 0.50 E P(Yes) = 0.74 E P(No) = 0.26

Notice that P(F) and P(M) add up to i.00. This result is no coincidence, because these two categories make up the entire gender variable. Similarly, P(Yes) and P(No) sum up to i.00 because those choices are the only two for the personal cell-phone use variable. Everyone has to be classified somewhere.

Jf»Nfi/

Be advised that some probabilities aren’t useful in terms of discovering information about the population in general. For example, P(F) = 0.50 in the previous example because the researchers determined ahead of time that they wanted exactly 508 females and exactly 508 males. The fact that 50 percent of the sample is female and 50 percent of the sample is male doesn’t mean that in the entire population of cell-phone users 50 percent are males and 50 percent are females. The sample was just set up that way. If you want to study what proportion of cell-phone users are females and males, you need to take a combined sample instead of two separate ones, and see how many males and females appear in the combined sample.

Joint probabilities

A Joint probability Gives the probability of the intersection of two categories, one from the row variable and one from the column variable. It’s the probability that someone selected from the whole group has two particular characteristics at the same time. A joint probability is found by taking the cell count for those having both characteristics and dividing by the grand total. In other words, both characteristics happen jointly, or together.

The cell-phone example has four joint probabilities:

U The probability that someone from the entire group is male and uses his cell phone for personal calls. This probability is 32/i,oi6 = 0.32, meaning that 32 percent of all the cell-phone users in this sample are males using their cell phones for personal calls.

U The probability that someone from the entire group is male and doesn’t use his cell phone for personal calls is 18J1,oi6 = 0.18.

U The probability that someone from the entire group is female and makes personal calls with her cell phone is 42^,ois = 0.42.

U The probability that someone from the entire group is female and doesn’t make personal calls with her cell phone is 8Koie = 0.08.

The notation for the joint probabilities previously listed is as follows, where + represents the intersection of the two categories listed:

U P(M + Yes) = 0.32 U P(M + No) = 0.18 U P(F + Yes) = 0.42 U P(F + No) = 0.08

The sum of all the joint probabilities for any two-way table should be 1.00, unless you have a little round-off error, which makes it very close to, but not exactly, 1.00. The sum is 1.00, because everyone in the group is classified somewhere with respect to both variables. It’s like dividing the entire group into four parts and showing which proportion falls into each part.

Conditional probabilities

A Conditional probability Is what you use if you want to compare subgroups in the sample. In other words, if you want to break down the table further, a conditional probability is what you use. Each row has a conditional probability

For each cell within the row, and each column has a conditional probability for each cell within that column.

Note: Because conditional probability is one of the sticking points for a lot of students, I want to spend extra time on it. My goal in this section is for you to have a good understanding of what a conditional probability really means and how you can use it in the real world (something many statistics textbooks neglect to mention, I have to say).

Figuring conditional probabilities

Consider the cell-phone example in Table 13-3. Suppose you want to look at just the males who took the survey. The total number of males is 508. You can break this group down into two subgroups by using conditional probability. You can find the probability of using cell phones for personal calls (males only), and you can find the probability of not using cell phones for personal calls (males only). Similarly, you can break down the females by those females who use cell phones for personal calls and those females who don’t.

In each case, to find a conditional probability, you first look at a single row or column of the table that represents the known characteristic about the individuals. The marginal total for that row or column now represents your new grand total, because this group becomes your entire universe when you examine it. Then take the cell counts from that row or column and divide the sum by that row or column’s marginal total.

In the cell-phone example, you have the following conditional probabilities when you break the table down by gender:

The conditional probability that a male uses a cell phone for personal calls is 32548 = 0.64.

The conditional probability that a male doesn’t use a cell phone for personal calls is 1833508 = 0.36.

The conditional probability that a female uses a cell phone for personal calls is *%8 = 0.84.

The conditional probability that a female doesn’t use a cell phone for personal calls is *%8 = 0.16.

To interpret these results, you say that within this sample if you’re male, you’re more likely than not to use your cell phone for personal calls (64 percent compared to 36 percent). However, the percentage of personal-call makers is higher for females (84 percent versus 16 percent).

The conclusions you can make from two-way tables in this chapter must refer only to the sample, not the population it came from. Before going on to make general statements about the conditional probability within a population, you need to conduct a confidence interval for a population proportion (which is

Equivalent to a probability). See Chapter 3 or your intro stats book for information on a hypothesis test for a population proportion.

Notice that for the males in the previous example, the two probabilities (0.64 and 0.36) add up to 1.00. This is no coincidence. The males have been broken down by cell-phone use for personal calls, and because everyone in the study is a cell-phone user, each male has to be classified in one group or the other. Similarly, the two probabilities for the females sum to 1.00.

Notation for conditional probabilities

Conditional probabilities are denoted by a straight up-and-down line that lists and separates the event that is known to have happened (what’s given) and the event for which you want to find the probability. You can write the notation like this: P(XXIXX). You place the given event to the right of the line and the event for which you want to find the probability to the left of the line. For example, suppose you know someone is female (F) and you want to find out the chance she is a Democrat (D). In this case, you’re looking for P(DIF). On the other hand, say you know a person is a Democrat and you want the probability that person is female — you’re looking for P(FID).

The straight up-and-down line in the conditional probability notation isn’t a division sign; the line is just a line separating events A and B. Also, be careful of the order in which you place A and B into the conditional probability notation. In general, P(AIB) ^ P(BIA).

Following is the notation used for the conditional probabilities in the cell-phone example:

P(Yes I M) = 0.64. You can say it this way: "The probability of Yes given Male is 0.64."

P(No I M) = 0.36. In human terms, say "The probability of No given Male is 0.36."

P(Yes I F) = 0.84. Say this one with gusto: "The probability of Yes given Female is 0.84."

P(No I F) = 0.16. You translate this notation by saying "The probability of No given Female is 0.16."

You can see that P(Yes I M) + P(No I M) = 1.00 because you’re breaking all males into two groups: those using cell phones for personal calls (Y) and those not (N). Notice however, that P(Yes I M) + P(Yes I F) doesn’t sum to 1.00. In the first case, you’re looking only at the males, and in the second case, only at the females.

Comparing two groups with conditional probabilities

One of the most common questions regarding two categorical (qualitative) variables is this: Are they related? To answer this question, you use conditional probabilities. You set up and find the conditional probabilities you need to see whether two variables are related.

To compare the conditional probabilities, take one variable and find the conditional probabilities based on the other variable. Do this for each category of the first variable. Compare those conditional probabilities (you can even graph them for the two groups) and see whether they’re different or the same. (If the conditional probabilities are the same for each group, the variables aren’t related in the sample. If they’re different, the variables are related in the sample.) To be able to generalize the results, you need to use the sample results to draw a conclusion from the overall population involved by doing a Chi-square test (see Chapter 14).

Revisiting the cell-phone example from the previous section, you can ask specifically: Is personal use related to gender? You know that you want to compare cell-phone use for males and females to find out whether use is related to gender. However, it’s very difficult to compare cell counts — for example, 325 males use their phones for personal calls, compared to 427 females. In fact, it’s impossible to compare these numbers without using some total for perspective. Three hundred twenty-five out of what?

You have no way of comparing the cell counts in two groups without creating percentages (dividing each cell count by the appropriate total). Percentages give you a means of comparing two numbers on equal terms. For example, suppose you give a one-question opinion survey (yes, no, no opinion) to a random sample of 1,099 people; 465 respondents said yes, 357 said no, and 277 had no opinion. To truly interpret this information, you’re probably in your head trying to compare these numbers to each other. That’s what percentages do for you. Showing the percentage in each group in a side-by-side fashion gives you a relative comparison of the groups with each other.

But first, you need to bring conditional probabilities into the mix. In the cellphone example, if you want the percentage of females who use their cell phones for personal calls, you take 427 divided by the total number of females (508) to get 84 percent. Similarly, to get the percentage of males who use their cell phones for personal calls, take the cell count (325) and divide it by that row total for males (508), which gives you 64 percent. This percentage is the conditional probability of using a cell phone for personal calls, given the person is male.

Now you’re ready to compare the males and females by using conditional probabilities. Take the percentage of females who use their cell phones for personal calls and compare it to the percentage of males who use their cell phones for personal calls. By finding these conditional probabilities, you can easily compare the two groups and say that in this sample at least, more

Females use their cell phones (84 percent) for personal calls than men (64 percent).

Using graphs to display conditional probabilities

One way to highlight conditional probabilities as a tool for comparing two groups is to use graphs such as a pie chart comparing the results of the other variable for each group or a bar chart comparing the results of the other variable for each group.

Figures 13-1a and 13-1b use two pie charts to compare males and females on cell-phone use. Figure 13-1a shows cell-phone use for only the males; this pie chart shows the conditional distribution of use for (given) males. Figure 13-1b shows the conditional distribution of cell phone use for (given) females. A comparison of Figures 13-1a and 13-1b shows the slices for cellphone use aren’t equal (or even close) for males compared to females. That result means that gender and cell-phone use for personal calls are dependent in this sample.

You may be wondering how close the two pie charts need to look (in terms of how close the slice amounts are for one pie compared to the other) in order to say the variables are independent. This question isn’t one you can answer completely until you conduct a hypothesis test for the proportions themselves (see the Chi-square test in Chapter 14). For now, with respect to your sample data, if the difference in the appearance of the slices for the two graphs is enough that you would write a newspaper article about it, then I’d go for dependence. Otherwise, conclude independence.

You can also make a bar chart to show the same idea. (For more info on pie charts and bar charts, see Statistics For Dummies [written by me and published by Wiley] or your intro stats textbook.)

Another way you can make comparisons is to break down the two-way table by the column variable. (You don’t always have to use the row variable for comparisons.) In the cell-phone example (Table 13-3), you can compare the group of personal-call makers to the group of no-personal-call makers and see what percentage in each group is male and female. This type of comparison puts a different spin on the information, because you’re comparing the behaviors to each other, in terms of gender.

With this new breakdown of the two-way table, you get the following:

The conditional probability of being male, given you use your cell phone for personal calls, is P(M I Yes) = 3%2 = 0.43. Note: The denominator is 752, the total number of people who make personal calls with their cell phones.

The conditional probability of being female, given you use your cell phone for personal calls, is P(F I Yes) = 4%2 = 0.57.

Figure 13-1:

Pie charts comparing male versus female personal cell-phone use.

A

Again, these two probabilities add up to 1.00, because you’re breaking down the personal-call makers according to gender (male or female), and the last two probabilities sum to 1.00, because you’re breaking down the non-personal-call makers by gender (male and female).

The overall conclusions are similar to those found in the previous section, but the specific percentages and the interpretation are different. Interpreting the data this way, if you use your cell phone for personal calls, you’re more likely to be female than male (57 percent compared to 43 percent). And if you don’t use your cell phone to make personal calls, you’re more likely to be male (69 percent versus 31 percent).

To get the correct answer for any probability in a two-way table, here’s the trick: Always be sure to identify the group that is being examined. What is the probability "out of"? In the cell phone example (refer to Table 13-3):

If you want the percentage Of all users Who are males using their phones for personal calls, then you take the cell count 325, and divide by 1,016, the grand total.

If you want the percentage Of males Who are using their cell phones for personal calls, you take 325 divided by 508, the total number of males.

If you want the percentage Of personal-call makers Who are male, you take 325 divided by 752 (the total number of people who make personal calls with their cell phones).

In each of these three cases, the numerator is the same, but the denominators are different, leading you to very different answers. Deciding which number to divide by is a very common source of confusion for people, and this trick can really help give you an edge on keeping it straight.

Trying to be Independent

Independence is a big deal in statistics. The term generally means that two items have outcomes whose probabilities don’t affect each other. The items could be events A and B, variables X And Y, Or survey results from two people selected at random from a population, and so on. If the outcomes of the two items do affect each other, statisticians call those two items Dependent (or not independent). In this section, you check for and interpret independence of two categories of qualitative variables in a sample, and you check for and interpret independence of two qualitative variables in a sample.

Checking for independence between two categories

Statistics instructors often have students check to see whether two categories (one from a qualitative variable X And the other from a qualitative variable Y) Are independent. I prefer to just compare the two groups and talk about how similar or different the percentages are, broken down by another variable. However, to cover all the bases and make sure you can answer this very popular question, here’s the official definition of independence, straight from the statistician’s mouth: Two categories are Independent If their joint probability equals the product of their marginal probabilities. The only caveat here is that neither of the categories can be completely empty.

For example, if being female is independent of being a Democrat, then P(F + D) = P(F) * P(D), where D = Democrat and F = Female. So, to show that two categories are independent, find the joint probability and compare it to the product of the two marginal probabilities. If you get the same answer both times, the categories are independent. If not, then the categories are not independent, but rather, they are dependent.

You may be wondering: Don’t all probabilities work this way, where the joint probability equals the product of the marginals? No, they don’t. For example, if you draw a card from a standard 52-card deck, you get a red card with probability K. You draw a black card with probability >2. The chance, though, of drawing both a black and red card with one draw is 0, while the product of the probabilities for black times red comes out to K * 34 = J4.

Now, if you look at a red card that is a two, the joint probability of a red two, which is 252 = !4, Equals the probability of a red card @2) times the probability of a two, which is %2 (because K * %2 =

Another way to check for independence is to compare the conditional probability to the marginal probability. Specifically, if you want to check whether being female is independent of being Democrat, check either of the following two situations (they’ll both work if the variables are independent):

Is P(F I D) = P(F)? That is, if you know someone is a Democrat, does that affect the chance that they will also be female? If yes, F and D are independent. If not, F and D are dependent.

Is P(D I F) = P(D)? This question is asking whether being female changes your chances of being a Democrat. If yes, D and F are independent. If not, D and F are dependent.

Is knowing that you’re in one category going to change the probability of being in another category? If so, the two categories aren’t independent. If it doesn’t affect the probability, then the two categories are independent.

Checking for independence between two Variables

The discussion in the previous section focuses on checking if two specific categories are independent in a sample. If you want to extend this idea to showing that two entire categorical variables are independent, you must check the independence conditions for every combination of categories in those variables. All of them must work, or independence is lost. The first case where dependence is found between two categories means that the two variables are dependent. If you find that the first case shows independence, you must continue checking all the combinations before declaring independence.

Suppose a doctor’s office wants to know whether calling patients to confirm their appointments is related to whether they actually show up. The variables are X = called the patient (called or didn’t call) and Y = patient showed up for their appointment (showed or didn’t show). Here are the four conditions that need to hold before you declare independence:

1. P(showed) = P(showed I called)

2. P(showed) = P(showed I didn’t call)

3. P(didn’t show) = P(didn’t show I called)

4. P(didn’t show) = P(didn’t show I didn’t call)

If any one of these conditions isn’t met, you stop there and declare the two variables to be dependent in the sample. If (and only if) all the conditions are met, you declare the two variables independent in the sample.

You can see the results of a sample of 100 randomly selected patients in Table 13-4.

Table 13-4

Confirmation Calls Related to Showing Up for the Appointment

Called

Didn’t Call

Row totals

Showed

57

33

90

Didn’t Show

3

7

10

Column Totals

60

40

100

Checking the conditions for independence, you can start at the first condition and check to see whether P(showed) = P(showed | called). From the last column of Table 13-4, you can see that P(showed) is equal to 9°ioo = 0.90, or 90 percent. Next, you can find P(showed I called) by looking at the first column of Table 13-4. This probability is % = 95 percent. Because these two probabilities aren’t equal (although they’re close), then you say that showing up and calling first are dependent. In other words, people come a little more often when you call them first. (To determine whether these sample results carry through to the population, which also takes care of the question of how close the probabilities need to be in order to conclude independence, see Chapter 14.)

Demystifying Simpson’s Paradox

Simpson’s Paradox Is a phenomenon where results appear to be in direct contradiction to one another, which can make even the best student’s heart race. This situation can go unnoticed unless three variables (or more) are examined, in which case you organize the results into a Three-way table, With columns within columns or rows within rows.

Simpson’s Paradox is a favorite among statistics instructors (because it’s so mystical and magical — and the numbers get so gooey and complex) but Simpson’s Paradox is a nonfavorite among many students, mainly because of the following two reasons (in my opinion):

Due to the way Simpson’s Paradox is presented in most statistics courses, you can easily get buried in the details and have no hope of seeing the big picture: Simpson’s Paradox presents a big problem in terms of interpreting data, and you need to understand it fully in order to avoid it.

Most textbooks do a good job of showing you examples of Simpson’s Paradox, but they do a not-so-good job of explaining why it occurs (some even neglect to explain the why part at all).

My goals in this section are for you to know what Simpson’s Paradox is, to be able to understand and explain why and how it happens, and to know how to be watchful for it. This is a tall order, I know, but stick with me.

Experiencing Simpson’s Paradox

Simpson’s Paradox was discovered in 1951 by an American Statistician named E. H. Simpson. He realized that if you analyze some data sets one way, by breaking them down by two variables only, you can get one result, but when you break the data down further by a third variable, the results switch direction. That’s why his result is called Simpson’s Paradox — a paradox being an apparent contradiction in results.

In the following sections, you can see Simpson’s Paradox play out in an example and all the details in between.

Simpson’s Paradox in action: Video games and the gender gap

Suppose I am interested in finding out who is better at playing video games, men or women. I watch males and females choose and play a variety of video games, and each time someone plays a video game, I record whether he or she wins or loses. Suppose I record the results of 200 video games, as seen in Table 13-5. (Note that the females played 120 games, and the males played 80 games.)

Table 13-5

Video Games Won and Lost for Males versus Females

All Games

Won

Lost

Marginal Row Totals

Males

44

36

80

Females

84

36

120

Marginal Column Totals

128

72

200 (Grand Total)

Looking at Table 13-5, you see the proportion of males who won their video games, P(Won | Male), is % = 0.55. The proportion of females who won their video games, P(Won I Female), is 8>i20 = 0.70. So overall, the females won more of their video games than the males did. Does this finding mean that women are better than men at video games in general in the sample?

Not so fast, my friend. Notice that the people in the study were allowed to choose the video games they played. This factor blows the study wide open. Suppose females and males choose different types of video games: Can this affect the results? The answer may be Yes. Considering other variables that could be related to the results but weren’t included in the original study (or at least not in the original data analysis) is important. These additional variables that cloud the results are called Confounding variables.

Factoring in difficulty level

Many people may expect the video game results from the previous section to be turned around, that men are better at playing video games than women. According to the research, men spend more time playing video games, on average, and are by far the primary purchaser of video games, compared to women. So what explains the eyebrow-raising results in this study? Is there another possible explanation? Is important information missing that is relevant to this case?

One of the variables that wasn’t considered when I made Table 13-5 was the difficulty level of the video game being played. Suppose I go back and include the difficulty level of the chosen game each time, along with each result (won or lost). Level one indicates easy video games, comparable to the level of Ms. Pac Man (games that are my speed), and level two means more challenging video games (like war games or sophisticated strategy games).

Table 13-6 represents the results with this new information added on difficulty level of games played. You have three variables now: level of difficulty (one or two); gender (male or female); and outcome (won or lost). Statisticians therefore call Table 13-6 a three-way table.

Table 13-6

A Three-Way Table for Gender,

Game Level, and Game Outcome

Level-One Games

Level-Two Games

Won Lost

Won Lost

Males

9 1

35 35

Females

72 18

12 18

Note in Table 13-6 that the number of level-one video games chosen was 9 + 1 + 72 + 18 = 100, and the number of level-two video games chosen was 35 + 35 + 12 + 18 = 100. But now you need to look at who chose which level of game. The next section probes this very issue.

Comparing success rates with conditional probabilities

To compare the success rates for males versus females using Table 13-6, you can figure out the appropriate conditional probabilities, first for level-one games and then for level-two games.

For level-one games (only), the conditional probability of winning given male is P(Won I Male) = Ko = 0.90. So for the level-one games, males won 90 percent of the games they played. For level-one games, the percentage of games won by the females is P(Won I Female) = % = 0.80, or 80 percent. These results mean that at level one, the males did 10 percent better than the females at winning their games. But this percentage appears to contradict the results found in Table 13-5. (Just wait — the contradictions don’t end here!)

Now figure the conditional probabilities for the level-two video games won. For the men, the percentage of males winning level-two games was = 0.50, or 50 percent. For the ladies, the percentage of women winning level-two games was % = 0.40, or 40 percent. Once again, the males outdid the females!

Step back and think about this scenario for a minute. Table 13-5 shows that females won a higher percentage of the video games they played overall. But Table 13-6 shows that males won more of the level-one games and that males won more of the level-two games. What’s going on? No need to check your math. No mistakes were made — no tricks were pulled. This inconsistency in results happens in real life from time to time in situations where an important third variable is left out of a study, a situation aptly named Simpson s Paradox. (See why it’s called a paradox?)

Asking why: Simpson’s Paradox

Confounding variables are the underlying cause of Simpson’s Paradox. (A Confounding variable Is a third variable that’s related to each of the other two variables and can affect the results if not accounted for.)

In the video game example, when you look at the video game outcomes (won or lost) broken down by gender only (Table 13-5), females won a higher percentage of their overall games than males (70 percent overall winning percentage for females compared to 55 overall winning percentage for males). Yet, when you split up the results by the level of the video game (level one or

Level two; see Table 13-6), the results reverse themselves, and you see that males did better than females on the level-one games (90 percent to 80 percent), and males also did better on the level-two games (50 percent versus 40 percent).

To see why this seemingly impossible result happens, take a look at the marginal row Probabilities Versus the marginal row Totals In Table 13-6 (for the level-one games). The percentage of times a male won when he played an easy video game was 90 percent. However, males chose level-one video games only 10 times (out of 80 total level-one games played by men. That’s only 12.5 percent).

To break this idea down further, the males’ non-stellar performance on the challenging video games (50 percent — but still better than the females) coupled with the fact that the males chose challenging video games 70 out of 80 = 87.5 percent of the time really brought down that overall winning percentage (55 percent). And even though the men did really well on the level-one video games, they didn’t play many of them (compared to the females), so their high winning percentage on level-one video games (90 percent) didn’t count much toward their overall winning percentage.

Meanwhile, in Table 13-6, you see that females chose level-one video games 90 times (out of 120). Even though the females only won 72 out of the 90 games (80 percent, a lower percentage than the males), they chose to play many more of the level-one games, boosting their overall winning percentage.

Now the opposite situation happens when you look at the level-two video games in Table 13-6. The males chose the harder video games 70 times (out of 80), while the females only chose the harder ones 30 times out of 120. The males did better than the females on level-two video games (winning 50 percent of them versus 40 percent for the females). However, level-two video games are harder to win than level-one video games. This factor means that the males’ winning percentage on level-two video games, being only 50 percent, doesn’t contribute much to their overall winning percentage. However, the low winning percentage for females on level-two video games doesn’t hurt them much, because they didn’t play many level-two video games.

The bottom line is that the occurrence or non-occurrence of Simpson’s Paradox is a matter of weights. In the overall totals from Table 13-5, the males don’t look as good as the females. But when you add in the difficulty of the games (shown in Table 13-6), you see that most of the males’ wins came from harder games (which have a lower winning percentage). The females played many more of the easier games on average, and easy games have a higher chance of winning no matter who plays them. So it all boils down to this: Which games did the males choose to play, and which games did the females choose to play? The males chose harder games, which contributed in a negative way to their overall winning percentage and made the females look better than they actually were.

Level of game wasn’t included in the original summary, Table 13-5, but it should have been included because it’s a variable that affected the results. Level of game, in this case, was the confounding variable.

Keeping one eye open for Simpson’s Paradox

Simpson’s Paradox shows you the importance of including data about possible confounding variables when attempting to look at relationships between qualitative variables.

In the video game example I use in previous sections, level of difficulty of the game was a confounding variable; more men chose to play the more difficult games, which are harder to win, thereby lowering their overall success rate.

You can avoid Simpson’s Paradox by making sure that obvious confounding variables are included in a study; that way, when you look at the data you get the relationships right the first time, and no room exists for misconstruing the results. And as with all other statistical results, if it looks too good to be true, or too simple to be correct, it probably is! Beware of someone that tried to oversimplify any result. While three-way tables are more difficult to examine, they are often worth using.