In This Chapter
^ Reading and interpreting two-way tables ^ Figuring probabilities and checking for independence Watching out for Simpson’s Paradox
Ooking for relationships between two categorical (qualitative) variables is a very common goal for researchers. For example, many medical studies center on how some characteristic about a person either raises or lowers his chance of getting some disease. Marketers ask questions like, "Who is more likely to buy our product: males or females?" Sports stat freaks wonder about things like "Does winning the coin toss at the beginning of a football game increase your team’s chance of winning the game?"
To answer each of the above questions, you must first collect data (from a random sample) on the two categorical variables being compared — call them X And Y. Then you organize that data into a table that contains columns and rows, showing how many individuals from the sample appear in each combination of X And Y. Finally, you use the information in the table to conduct a hypothesis test (called the Chi-square test). Using the Chi-square test, you can determine whether you can see a relationship between X And Y In the population from which the data was drawn. This last step needs the machinery from Chapter 14 to accomplish it. The goals of this chapter are to understand what it means for two qualitative variables (x And Y) To be associated and to discover how to use percentages to determine whether a sample data set appears to show a relationship between X And Y.
Suppose you’re collecting data on cell-phone users, and you want to find out whether more females use cell phones than males. A study of 508 randomly selected male cell-phone users and 508 randomly selected female cell-phone users conducted by a wireless company found that women tend to use their phones for personal calls more than men (big shocker). The survey showed that 427 of the women said they used their wireless phones primarily to talk with friends and family, while only 325 of the men admitted to doing so.
But you can’t stop there. You need to break down this information, calculate some percentages, and compare them to see how close they really are. Sample results vary from sample to sample, and differences can appear by chance.
In this chapter, you find out how to organize data from qualitative variables (data based on categories rather than measurements) into a table format. This skill is especially useful when you’re trying to look for relationships between two qualitative variables, such as using a cell phone for personal calls (a yes or no category) and gender (male or female). You also summarize the data to answer your questions. And, finally, you get to figure out, once and for all, what’s going on with that Simpson’s Paradox thing.
Breaking Down a Two-Way Table
A Two-way table Is a table that contains rows and columns, which help you organize data from categorical (qualitative) variables in the following ways:
The rows represent the possible categories for one categorical variable, such as males and females.
The columns represent the possible categories for a second categorical variable, such as using your cell phone for personal calls, or not.
Here I review the basic ideas of organizing and filling in a two-way table.
Organizing data into a two-way table
To organize your data into a two-way table, first set up the rows and columns. Table 13-1 shows the setup for the cell-phone data (refer to the example I give at the beginning of the chapter).
Table 13-1 Two-Way Table Set Up for the Cell-Phone Data
Personal Calls: Yes Personal Calls: No
Males
Females
Notice that Table 13-1 has four empty cells inside of it (not counting the empty space in the upper-left corner). Because gender has two choices (male or female), and personal cell-phone use has two choices (yes or no), the resulting two-way table has 2 * 2 = 4 cells.

To figure out the number of cells in any two-way table, multiply the number of possible categories for the row variables times the number of possible categories for the column variable.
Fitting in the cell counts
After you set up the table with the appropriate number of rows and columns, you need to fill in the appropriate numbers in each of the cells of the two-way table. The number in each cell of a two-way table is called the Cell count For that cell. The upper-left cell in the two-way table shown in Table 13-1 represents the number of males who use their cell phones for personal calls. With the information you have in the cell-phone problem, the cell count for this cell is 325. Because you know that 427 females use their cell phones for personal calls, this number goes into the lower-left cell.
Now, to figure out the numbers in the remaining two cells, you do a bit of subtraction. You know from the information given that the total number of male cell-phone users in the survey is 508. Each male either uses his cell phone for personal calls (falling into the Yes Group), or he doesn’t (falling into the No Group). Because 325 males fall into the Yes Group, and you have 508 males total, 183 males (508 – 325 = 183) don’t use their cell phones for personal calls. This number is the cell count for the upper-right cell of the two-way table. Finally, because 508 females took the survey, and 427 of them use their cell phones for personal calls, you know that the rest of them (508 -427 = 81) don’t. Therefore, 81 is the cell count for the lower-right cell of the table. Table 13-2 shows the completed table for the cell-phone user problem, with the four cell counts filled in.
Table 13-2 Completed Two-Way Table for the Cell-Phone Data
Personal Calls: Yes Personal Calls: No
Males 325 183 (508 – 325)
Females
427
81 (508 – 427)
Just to save you a little time, if you have the total number in a group and how many of those individuals fall into one of the categories of the two-way table, you can determine the number falling into the remaining category by subtracting the total number in the group minus the number in the given category. You can complete this process for each remaining group in the table.
Making marginal totals
One of the most important aspects of a two-way table is to have easy access to all the pertinent totals. Because every two-way table is made up of rows and columns, you can imagine that the totals for each row and the totals for each column are important. Also, the grand total is important to know.
If you take a single row and add up all the cell counts in the cells of that row, you get what is called a Marginal row total For that row. Where does this marginal row total go on the table? You guessed it — out in the margin at the end of that row. You can find the marginal row totals for every row in the table and put them into the margins at the end of the rows. This group of marginal row totals for each row represents what statisticians call the Marginal distribution For the row variable. The marginal row totals should add up to the Grand total, Which is the total number of individuals in the study. (The individuals may be people, cities, dogs, companies, and so on, depending on the scenario of the problem at hand.)
Similarly, if you take a single column and add up all the cell counts in the cells of that column, you get the Marginal column total For that column. This number goes in the margin at the bottom of the column. Follow this pattern for each column in the table, and you have the marginal distribution for the column variable. Again, the sum of all the marginal column totals equals the grand total. The grand total is always located in the lower-right corner of the two-way table.
The marginal row total, marginal column totals, and the grand total for the cell-phone example are shown in Table 13-3.
|
Table 13-3
|
Marginal and Grand Totals for the Cell Phone Data
|
|
Personal Personal Marginal
|
|
Calls: Yes Calls: No Row Totals
|
|
Males
|
325 183(508 – 325) 508
|
|
Females
|
427 81 (508 – 427) 508
|
|
Marginal Column Totals 752 264 1,016 (Grand Total)
|

The marginal row totals add the cell counts in each row; yet the marginal row totals show up as a column in the two-way table. This phenomenon occurs because when summing the cell counts in a row, you put the result in the margin at the end of the row, and when you do this for each row, you’re stacking the row totals into a column. Similarly, the marginal column totals add the cell counts in each column; yet they show up as a row in the two-way table. Don’t let this be a source of confusion when you’re trying to navigate or set up a two-way table. It’s always a good idea to label your totals as marginal row, marginal column, or grand total to help keep it clear.
Breaking Down the Probabilities
A percentage, when applied to a two-way table, represents the portion of the individuals in the sample falling into a certain group. This idea can be expanded to a probability, which gives the chance that an individual person selected from this group falls into a certain category.
A two-way table gives you the opportunity to find many different kinds of probabilities to help you find the answers to different questions about your data or to look at the data another way. In this section, I cover the three most important types of probabilities found in a two-way table: marginal probabilities, joint probabilities, and conditional probabilities. (If you need more info on these terms, check out Probability For Dummies [Wiley].)

When you find probabilities based on a sample, as you do in this chapter, you have to realize that those probabilities pertain to that sample only. They do not transfer automatically to the population being studied. For example, if you take a random sample of 1,000 adults and find that 55 percent of them watch reality TV, this study doesn’t mean that 55 percent of all adults in the entire population watch reality TV. (The media makes this mistake every day.) You need to take into account the fact that sample results vary. In Chapters 14 and 15, you do just that. But this chapter zeros in on summarizing the information in your sample, which is the first step toward that end (but not the last step in terms of making conclusions about your corresponding population).
Marginal probabilities
A Marginal probability Makes a probability out of the marginal total, for either the rows or the columns. A marginal probability represents the proportion of the entire group that belongs in that single row or column category. Each
Marginal probability represents only one category for only one variable — it doesn’t consider the other variable at all. In the cell-phone example, you have four possible marginal probabilities (refer to Table 13-3):
Marginal probability of female (50?-i, oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were females.
E Marginal probability of male (5%oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were males.
E Marginal probability of using a cell phone for personal calls (75X,0i6 = 0.74). Therefore, 74 percent of all cell-phone users in this sample make personal calls with their cell phones.
E Marginal probability of not using a cell phone for personal calls (2%>ie = 0.26). In other words, 26 percent of all the cell-phone users in this sample don’t make personal calls with their cell phones.
Statisticians use shorthand notation for all probabilities. If you let M = male, F = female, Yes = personal cell-phone use, and No = no personal cell-phone use, then each of the preceding marginal probabilities is written this way:
E P(F) = 0.50 E P(M) = 0.50 E P(Yes) = 0.74 E P(No) = 0.26
Notice that P(F) and P(M) add up to i.00. This result is no coincidence, because these two categories make up the entire gender variable. Similarly, P(Yes) and P(No) sum up to i.00 because those choices are the only two for the personal cell-phone use variable. Everyone has to be classified somewhere.
Jf»Nfi/
Be advised that some probabilities aren’t useful in terms of discovering information about the population in general. For example, P(F) = 0.50 in the previous example because the researchers determined ahead of time that they wanted exactly 508 females and exactly 508 males. The fact that 50 percent of the sample is female and 50 percent of the sample is male doesn’t mean that in the entire population of cell-phone users 50 percent are males and 50 percent are females. The sample was just set up that way. If you want to study what proportion of cell-phone users are females and males, you need to take a combined sample instead of two separate ones, and see how many males and females appear in the combined sample.
Joint probabilities
A Joint probability Gives the probability of the intersection of two categories, one from the row variable and one from the column variable. It’s the probability that someone selected from the whole group has two particular characteristics at the same time. A joint probability is found by taking the cell count for those having both characteristics and dividing by the grand total. In other words, both characteristics happen jointly, or together.
The cell-phone example has four joint probabilities:
U The probability that someone from the entire group is male and uses his cell phone for personal calls. This probability is 32/i,oi6 = 0.32, meaning that 32 percent of all the cell-phone users in this sample are males using their cell phones for personal calls.
U The probability that someone from the entire group is male and doesn’t use his cell phone for personal calls is 18J1,oi6 = 0.18.
U The probability that someone from the entire group is female and makes personal calls with her cell phone is 42^,ois = 0.42.
U The probability that someone from the entire group is female and doesn’t make personal calls with her cell phone is 8Koie = 0.08.
The notation for the joint probabilities previously listed is as follows, where + represents the intersection of the two categories listed:
U P(M + Yes) = 0.32 U P(M + No) = 0.18 U P(F + Yes) = 0.42 U P(F + No) = 0.08
The sum of all the joint probabilities for any two-way table should be 1.00, unless you have a little round-off error, which makes it very close to, but not exactly, 1.00. The sum is 1.00, because everyone in the group is classified somewhere with respect to both variables. It’s like dividing the entire group into four parts and showing which proportion falls into each part.
Conditional probabilities
A Conditional probability Is what you use if you want to compare subgroups in the sample. In other words, if you want to break down the table further, a conditional probability is what you use. Each row has a conditional probability
For each cell within the row, and each column has a conditional probability for each cell within that column.
Note: Because conditional probability is one of the sticking points for a lot of students, I want to spend extra time on it. My goal in this section is for you to have a good understanding of what a conditional probability really means and how you can use it in the real world (something many statistics textbooks neglect to mention, I have to say).
Figuring conditional probabilities
Consider the cell-phone example in Table 13-3. Suppose you want to look at just the males who took the survey. The total number of males is 508. You can break this group down into two subgroups by using conditional probability. You can find the probability of using cell phones for personal calls (males only), and you can find the probability of not using cell phones for personal calls (males only). Similarly, you can break down the females by those females who use cell phones for personal calls and those females who don’t.
In each case, to find a conditional probability, you first look at a single row or column of the table that represents the known characteristic about the individuals. The marginal total for that row or column now represents your new grand total, because this group becomes your entire universe when you examine it. Then take the cell counts from that row or column and divide the sum by that row or column’s marginal total.
In the cell-phone example, you have the following conditional probabilities when you break the table down by gender:
The conditional probability that a male uses a cell phone for personal calls is 32548 = 0.64.
The conditional probability that a male doesn’t use a cell phone for personal calls is 1833508 = 0.36.
The conditional probability that a female uses a cell phone for personal calls is *%8 = 0.84.
The conditional probability that a female doesn’t use a cell phone for personal calls is *%8 = 0.16.
To interpret these results, you say that within this sample if you’re male, you’re more likely than not to use your cell phone for personal calls (64 percent compared to 36 percent). However, the percentage of personal-call makers is higher for females (84 percent versus 16 percent).

The conclusions you can make from two-way tables in this chapter must refer only to the sample, not the population it came from. Before going on to make general statements about the conditional probability within a population, you need to conduct a confidence interval for a population proportion (which is
Equivalent to a probability). See Chapter 3 or your intro stats book for information on a hypothesis test for a population proportion.
Notice that for the males in the previous example, the two probabilities (0.64 and 0.36) add up to 1.00. This is no coincidence. The males have been broken down by cell-phone use for personal calls, and because everyone in the study is a cell-phone user, each male has to be classified in one group or the other. Similarly, the two probabilities for the females sum to 1.00.
Notation for conditional probabilities
Conditional probabilities are denoted by a straight up-and-down line that lists and separates the event that is known to have happened (what’s given) and the event for which you want to find the probability. You can write the notation like this: P(XXIXX). You place the given event to the right of the line and the event for which you want to find the probability to the left of the line. For example, suppose you know someone is female (F) and you want to find out the chance she is a Democrat (D). In this case, you’re looking for P(DIF). On the other hand, say you know a person is a Democrat and you want the probability that person is female — you’re looking for P(FID).
The straight up-and-down line in the conditional probability notation isn’t a division sign; the line is just a line separating events A and B. Also, be careful of the order in which you place A and B into the conditional probability notation. In general, P(AIB) ^ P(BIA).
Following is the notation used for the conditional probabilities in the cell-phone example:
P(Yes I M) = 0.64. You can say it this way: "The probability of Yes given Male is 0.64."
P(No I M) = 0.36. In human terms, say "The probability of No given Male is 0.36."
P(Yes I F) = 0.84. Say this one with gusto: "The probability of Yes given Female is 0.84."
P(No I F) = 0.16. You translate this notation by saying "The probability of No given Female is 0.16."
You can see that P(Yes I M) + P(No I M) = 1.00 because you’re breaking all males into two groups: those using cell phones for personal calls (Y) and those not (N). Notice however, that P(Yes I M) + P(Yes I F) doesn’t sum to 1.00. In the first case, you’re looking only at the males, and in the second case, only at the females.
Comparing two groups with conditional probabilities
One of the most common questions regarding two categorical (qualitative) variables is this: Are they related? To answer this question, you use conditional probabilities. You set up and find the conditional probabilities you need to see whether two variables are related.
To compare the conditional probabilities, take one variable and find the conditional probabilities based on the other variable. Do this for each category of the first variable. Compare those conditional probabilities (you can even graph them for the two groups) and see whether they’re different or the same. (If the conditional probabilities are the same for each group, the variables aren’t related in the sample. If they’re different, the variables are related in the sample.) To be able to generalize the results, you need to use the sample results to draw a conclusion from the overall population involved by doing a Chi-square test (see Chapter 14).
Revisiting the cell-phone example from the previous section, you can ask specifically: Is personal use related to gender? You know that you want to compare cell-phone use for males and females to find out whether use is related to gender. However, it’s very difficult to compare cell counts — for example, 325 males use their phones for personal calls, compared to 427 females. In fact, it’s impossible to compare these numbers without using some total for perspective. Three hundred twenty-five out of what?
You have no way of comparing the cell counts in two groups without creating percentages (dividing each cell count by the appropriate total). Percentages give you a means of comparing two numbers on equal terms. For example, suppose you give a one-question opinion survey (yes, no, no opinion) to a random sample of 1,099 people; 465 respondents said yes, 357 said no, and 277 had no opinion. To truly interpret this information, you’re probably in your head trying to compare these numbers to each other. That’s what percentages do for you. Showing the percentage in each group in a side-by-side fashion gives you a relative comparison of the groups with each other.
But first, you need to bring conditional probabilities into the mix. In the cellphone example, if you want the percentage of females who use their cell phones for personal calls, you take 427 divided by the total number of females (508) to get 84 percent. Similarly, to get the percentage of males who use their cell phones for personal calls, take the cell count (325) and divide it by that row total for males (508), which gives you 64 percent. This percentage is the conditional probability of using a cell phone for personal calls, given the person is male.
Now you’re ready to compare the males and females by using conditional probabilities. Take the percentage of females who use their cell phones for personal calls and compare it to the percentage of males who use their cell phones for personal calls. By finding these conditional probabilities, you can easily compare the two groups and say that in this sample at least, more

Females use their cell phones (84 percent) for personal calls than men (64 percent).
Using graphs to display conditional probabilities
One way to highlight conditional probabilities as a tool for comparing two groups is to use graphs such as a pie chart comparing the results of the other variable for each group or a bar chart comparing the results of the other variable for each group.
Figures 13-1a and 13-1b use two pie charts to compare males and females on cell-phone use. Figure 13-1a shows cell-phone use for only the males; this pie chart shows the conditional distribution of use for (given) males. Figure 13-1b shows the conditional distribution of cell phone use for (given) females. A comparison of Figures 13-1a and 13-1b shows the slices for cellphone use aren’t equal (or even close) for males compared to females. That result means that gender and cell-phone use for personal calls are dependent in this sample.
You may be wondering how close the two pie charts need to look (in terms of how close the slice amounts are for one pie compared to the other) in order to say the variables are independent. This question isn’t one you can answer completely until you conduct a hypothesis test for the proportions themselves (see the Chi-square test in Chapter 14). For now, with respect to your sample data, if the difference in the appearance of the slices for the two graphs is enough that you would write a newspaper article about it, then I’d go for dependence. Otherwise, conclude independence.
You can also make a bar chart to show the same idea. (For more info on pie charts and bar charts, see Statistics For Dummies [written by me and published by Wiley] or your intro stats textbook.)
Another way you can make comparisons is to break down the two-way table by the column variable. (You don’t always have to use the row variable for comparisons.) In the cell-phone example (Table 13-3), you can compare the group of personal-call makers to the group of no-personal-call makers and see what percentage in each group is male and female. This type of comparison puts a different spin on the information, because you’re comparing the behaviors to each other, in terms of gender.
With this new breakdown of the two-way table, you get the following:
The conditional probability of being male, given you use your cell phone for personal calls, is P(M I Yes) = 3%2 = 0.43. Note: The denominator is 752, the total number of people who make personal calls with their cell phones.
The conditional probability of being female, given you use your cell phone for personal calls, is P(F I Yes) = 4%2 = 0.57.
Figure 13-1:
Pie charts comparing male versus female personal cell-phone use.

A
Again, these two probabilities add up to 1.00, because you’re breaking down the personal-call makers according to gender (male or female), and the last two probabilities sum to 1.00, because you’re breaking down the non-personal-call makers by gender (male and female).
The overall conclusions are similar to those found in the previous section, but the specific percentages and the interpretation are different. Interpreting the data this way, if you use your cell phone for personal calls, you’re more likely to be female than male (57 percent compared to 43 percent). And if you don’t use your cell phone to make personal calls, you’re more likely to be male (69 percent versus 31 percent).
To get the correct answer for any probability in a two-way table, here’s the trick: Always be sure to identify the group that is being examined. What is the probability "out of"? In the cell phone example (refer to Table 13-3):
If you want the percentage Of all users Who are males using their phones for personal calls, then you take the cell count 325, and divide by 1,016, the grand total.
If you want the percentage Of males Who are using their cell phones for personal calls, you take 325 divided by 508, the total number of males.
If you want the percentage Of personal-call makers Who are male, you take 325 divided by 752 (the total number of people who make personal calls with their cell phones).
In each of these three cases, the numerator is the same, but the denominators are different, leading you to very different answers. Deciding which number to divide by is a very common source of confusion for people, and this trick can really help give you an edge on keeping it straight.
Trying to be Independent
Independence is a big deal in statistics. The term generally means that two items have outcomes whose probabilities don’t affect each other. The items could be events A and B, variables X And Y, Or survey results from two people selected at random from a population, and so on. If the outcomes of the two items do affect each other, statisticians call those two items Dependent (or not independent). In this section, you check for and interpret independence of two categories of qualitative variables in a sample, and you check for and interpret independence of two qualitative variables in a sample.
Checking for independence between two categories
Statistics instructors often have students check to see whether two categories (one from a qualitative variable X And the other from a qualitative variable Y) Are independent. I prefer to just compare the two groups and talk about how similar or different the percentages are, broken down by another variable. However, to cover all the bases and make sure you can answer this very popular question, here’s the official definition of independence, straight from the statistician’s mouth: Two categories are Independent If their joint probability equals the product of their marginal probabilities. The only caveat here is that neither of the categories can be completely empty.
For example, if being female is independent of being a Democrat, then P(F + D) = P(F) * P(D), where D = Democrat and F = Female. So, to show that two categories are independent, find the joint probability and compare it to the product of the two marginal probabilities. If you get the same answer both times, the categories are independent. If not, then the categories are not independent, but rather, they are dependent.
You may be wondering: Don’t all probabilities work this way, where the joint probability equals the product of the marginals? No, they don’t. For example, if you draw a card from a standard 52-card deck, you get a red card with probability K. You draw a black card with probability >2. The chance, though, of drawing both a black and red card with one draw is 0, while the product of the probabilities for black times red comes out to K * 34 = J4.
Now, if you look at a red card that is a two, the joint probability of a red two, which is 252 = !4, Equals the probability of a red card @2) times the probability of a two, which is %2 (because K * %2 =
Another way to check for independence is to compare the conditional probability to the marginal probability. Specifically, if you want to check whether being female is independent of being Democrat, check either of the following two situations (they’ll both work if the variables are independent):
Is P(F I D) = P(F)? That is, if you know someone is a Democrat, does that affect the chance that they will also be female? If yes, F and D are independent. If not, F and D are dependent.
Is P(D I F) = P(D)? This question is asking whether being female changes your chances of being a Democrat. If yes, D and F are independent. If not, D and F are dependent.
Is knowing that you’re in one category going to change the probability of being in another category? If so, the two categories aren’t independent. If it doesn’t affect the probability, then the two categories are independent.
Checking for independence between two Variables
The discussion in the previous section focuses on checking if two specific categories are independent in a sample. If you want to extend this idea to showing that two entire categorical variables are independent, you must check the independence conditions for every combination of categories in those variables. All of them must work, or independence is lost. The first case where dependence is found between two categories means that the two variables are dependent. If you find that the first case shows independence, you must continue checking all the combinations before declaring independence.
Suppose a doctor’s office wants to know whether calling patients to confirm their appointments is related to whether they actually show up. The variables are X = called the patient (called or didn’t call) and Y = patient showed up for their appointment (showed or didn’t show). Here are the four conditions that need to hold before you declare independence:
1. P(showed) = P(showed I called)
2. P(showed) = P(showed I didn’t call)
3. P(didn’t show) = P(didn’t show I called)
4. P(didn’t show) = P(didn’t show I didn’t call)
If any one of these conditions isn’t met, you stop there and declare the two variables to be dependent in the sample. If (and only if) all the conditions are met, you declare the two variables independent in the sample.
You can see the results of a sample of 100 randomly selected patients in Table 13-4.
|
Table 13-4
|
Confirmation Calls Related to Showing Up for the Appointment
|
|
Called
|
Didn’t Call
|
Row totals
|
|
Showed
|
57
|
33
|
90
|
|
Didn’t Show
|
3
|
7
|
10
|
|
Column Totals
|
60
|
40
|
100
|
Checking the conditions for independence, you can start at the first condition and check to see whether P(showed) = P(showed | called). From the last column of Table 13-4, you can see that P(showed) is equal to 9°ioo = 0.90, or 90 percent. Next, you can find P(showed I called) by looking at the first column of Table 13-4. This probability is % = 95 percent. Because these two probabilities aren’t equal (although they’re close), then you say that showing up and calling first are dependent. In other words, people come a little more often when you call them first. (To determine whether these sample results carry through to the population, which also takes care of the question of how close the probabilities need to be in order to conclude independence, see Chapter 14.)
Demystifying Simpson’s Paradox
Simpson’s Paradox Is a phenomenon where results appear to be in direct contradiction to one another, which can make even the best student’s heart race. This situation can go unnoticed unless three variables (or more) are examined, in which case you organize the results into a Three-way table, With columns within columns or rows within rows.
Simpson’s Paradox is a favorite among statistics instructors (because it’s so mystical and magical — and the numbers get so gooey and complex) but Simpson’s Paradox is a nonfavorite among many students, mainly because of the following two reasons (in my opinion):
Due to the way Simpson’s Paradox is presented in most statistics courses, you can easily get buried in the details and have no hope of seeing the big picture: Simpson’s Paradox presents a big problem in terms of interpreting data, and you need to understand it fully in order to avoid it.
Most textbooks do a good job of showing you examples of Simpson’s Paradox, but they do a not-so-good job of explaining why it occurs (some even neglect to explain the why part at all).
My goals in this section are for you to know what Simpson’s Paradox is, to be able to understand and explain why and how it happens, and to know how to be watchful for it. This is a tall order, I know, but stick with me.
Experiencing Simpson’s Paradox
Simpson’s Paradox was discovered in 1951 by an American Statistician named E. H. Simpson. He realized that if you analyze some data sets one way, by breaking them down by two variables only, you can get one result, but when you break the data down further by a third variable, the results switch direction. That’s why his result is called Simpson’s Paradox — a paradox being an apparent contradiction in results.
In the following sections, you can see Simpson’s Paradox play out in an example and all the details in between.
Simpson’s Paradox in action: Video games and the gender gap
Suppose I am interested in finding out who is better at playing video games, men or women. I watch males and females choose and play a variety of video games, and each time someone plays a video game, I record whether he or she wins or loses. Suppose I record the results of 200 video games, as seen in Table 13-5. (Note that the females played 120 games, and the males played 80 games.)
|
Table 13-5
|
Video Games Won and Lost for Males versus Females
|
|
All Games
|
Won
|
Lost
|
Marginal Row Totals
|
|
Males
|
44
|
36
|
80
|
|
Females
|
84
|
36
|
120
|
|
Marginal Column Totals
|
128
|
72
|
200 (Grand Total)
|
Looking at Table 13-5, you see the proportion of males who won their video games, P(Won | Male), is % = 0.55. The proportion of females who won their video games, P(Won I Female), is 8>i20 = 0.70. So overall, the females won more of their video games than the males did. Does this finding mean that women are better than men at video games in general in the sample?
Not so fast, my friend. Notice that the people in the study were allowed to choose the video games they played. This factor blows the study wide open. Suppose females and males choose different types of video games: Can this affect the results? The answer may be Yes. Considering other variables that could be related to the results but weren’t included in the original study (or at least not in the original data analysis) is important. These additional variables that cloud the results are called Confounding variables.
Factoring in difficulty level
Many people may expect the video game results from the previous section to be turned around, that men are better at playing video games than women. According to the research, men spend more time playing video games, on average, and are by far the primary purchaser of video games, compared to women. So what explains the eyebrow-raising results in this study? Is there another possible explanation? Is important information missing that is relevant to this case?
One of the variables that wasn’t considered when I made Table 13-5 was the difficulty level of the video game being played. Suppose I go back and include the difficulty level of the chosen game each time, along with each result (won or lost). Level one indicates easy video games, comparable to the level of Ms. Pac Man (games that are my speed), and level two means more challenging video games (like war games or sophisticated strategy games).
Table 13-6 represents the results with this new information added on difficulty level of games played. You have three variables now: level of difficulty (one or two); gender (male or female); and outcome (won or lost). Statisticians therefore call Table 13-6 a three-way table.
|
Table 13-6
|
A Three-Way Table for Gender,
|
|
Game Level, and Game Outcome
|
|
Level-One Games
|
Level-Two Games
|
|
Won Lost
|
Won Lost
|
|
Males
|
9 1
|
35 35
|
|
Females
|
72 18
|
12 18
|
Note in Table 13-6 that the number of level-one video games chosen was 9 + 1 + 72 + 18 = 100, and the number of level-two video games chosen was 35 + 35 + 12 + 18 = 100. But now you need to look at who chose which level of game. The next section probes this very issue.
Comparing success rates with conditional probabilities
To compare the success rates for males versus females using Table 13-6, you can figure out the appropriate conditional probabilities, first for level-one games and then for level-two games.
For level-one games (only), the conditional probability of winning given male is P(Won I Male) = Ko = 0.90. So for the level-one games, males won 90 percent of the games they played. For level-one games, the percentage of games won by the females is P(Won I Female) = % = 0.80, or 80 percent. These results mean that at level one, the males did 10 percent better than the females at winning their games. But this percentage appears to contradict the results found in Table 13-5. (Just wait — the contradictions don’t end here!)
Now figure the conditional probabilities for the level-two video games won. For the men, the percentage of males winning level-two games was = 0.50, or 50 percent. For the ladies, the percentage of women winning level-two games was % = 0.40, or 40 percent. Once again, the males outdid the females!
Step back and think about this scenario for a minute. Table 13-5 shows that females won a higher percentage of the video games they played overall. But Table 13-6 shows that males won more of the level-one games and that males won more of the level-two games. What’s going on? No need to check your math. No mistakes were made — no tricks were pulled. This inconsistency in results happens in real life from time to time in situations where an important third variable is left out of a study, a situation aptly named Simpson s Paradox. (See why it’s called a paradox?)
Asking why: Simpson’s Paradox
Confounding variables are the underlying cause of Simpson’s Paradox. (A Confounding variable Is a third variable that’s related to each of the other two variables and can affect the results if not accounted for.)
In the video game example, when you look at the video game outcomes (won or lost) broken down by gender only (Table 13-5), females won a higher percentage of their overall games than males (70 percent overall winning percentage for females compared to 55 overall winning percentage for males). Yet, when you split up the results by the level of the video game (level one or
Level two; see Table 13-6), the results reverse themselves, and you see that males did better than females on the level-one games (90 percent to 80 percent), and males also did better on the level-two games (50 percent versus 40 percent).
To see why this seemingly impossible result happens, take a look at the marginal row Probabilities Versus the marginal row Totals In Table 13-6 (for the level-one games). The percentage of times a male won when he played an easy video game was 90 percent. However, males chose level-one video games only 10 times (out of 80 total level-one games played by men. That’s only 12.5 percent).
To break this idea down further, the males’ non-stellar performance on the challenging video games (50 percent — but still better than the females) coupled with the fact that the males chose challenging video games 70 out of 80 = 87.5 percent of the time really brought down that overall winning percentage (55 percent). And even though the men did really well on the level-one video games, they didn’t play many of them (compared to the females), so their high winning percentage on level-one video games (90 percent) didn’t count much toward their overall winning percentage.
Meanwhile, in Table 13-6, you see that females chose level-one video games 90 times (out of 120). Even though the females only won 72 out of the 90 games (80 percent, a lower percentage than the males), they chose to play many more of the level-one games, boosting their overall winning percentage.
Now the opposite situation happens when you look at the level-two video games in Table 13-6. The males chose the harder video games 70 times (out of 80), while the females only chose the harder ones 30 times out of 120. The males did better than the females on level-two video games (winning 50 percent of them versus 40 percent for the females). However, level-two video games are harder to win than level-one video games. This factor means that the males’ winning percentage on level-two video games, being only 50 percent, doesn’t contribute much to their overall winning percentage. However, the low winning percentage for females on level-two video games doesn’t hurt them much, because they didn’t play many level-two video games.
The bottom line is that the occurrence or non-occurrence of Simpson’s Paradox is a matter of weights. In the overall totals from Table 13-5, the males don’t look as good as the females. But when you add in the difficulty of the games (shown in Table 13-6), you see that most of the males’ wins came from harder games (which have a lower winning percentage). The females played many more of the easier games on average, and easy games have a higher chance of winning no matter who plays them. So it all boils down to this: Which games did the males choose to play, and which games did the females choose to play? The males chose harder games, which contributed in a negative way to their overall winning percentage and made the females look better than they actually were.
Level of game wasn’t included in the original summary, Table 13-5, but it should have been included because it’s a variable that affected the results. Level of game, in this case, was the confounding variable.
Keeping one eye open for Simpson’s Paradox
Simpson’s Paradox shows you the importance of including data about possible confounding variables when attempting to look at relationships between qualitative variables.
In the video game example I use in previous sections, level of difficulty of the game was a confounding variable; more men chose to play the more difficult games, which are harder to win, thereby lowering their overall success rate.
You can avoid Simpson’s Paradox by making sure that obvious confounding variables are included in a study; that way, when you look at the data you get the relationships right the first time, and no room exists for misconstruing the results. And as with all other statistical results, if it looks too good to be true, or too simple to be correct, it probably is! Beware of someone that tried to oversimplify any result. While three-way tables are more difficult to examine, they are often worth using.