In This Chapter
^ Recognizing and avoiding mistakes when interpreting statistical results ^ Knowing how to decide whether or not someone’s conclusions are credible
/ntermediate statistics is all about building models and doing data analysis. It focuses on looking at data and figuring out the story behind it. It’s about making sure that the story is told correctly, fairly, and comprehensively. In this chapter, I discuss some of the most common errors I’ve seen as a teacher and statistical consultant for many moons. You can use this list to pull ideas together for homework and reports or as a quick review before a quiz or exam. Trust me — your professor will love you for it!
These Statistics Prove…
Be skeptical of anyone who uses the words These statistics And Prove In the same sentence. The word Prove Is a definitive, end-all-be-all, case-closed, lead-pipe-lock sort of concept, and statistics by nature isn’t definitive. Instead, statistics gives you evidence for or against your theory, model, or claim, based on the data you collected; then it leaves you to your own conclusions. Because the evidence is based on data, and data changes from sample to sample, the results can change as well — that’s the challenge, the beauty, and sometimes the frustration of statistics. The best you can say is that your statistics suggest, lead you to believe, or give you sufficient evidence to conclude — but never go as far as to say that your statistics prove anything.
It’s Not Technically Statistically Significant, But…
Ml
VjiJABEft After you set up your model and test it with your data, you have to stand by 4J!/ the conclusions no matter how much you believe they’re wrong. Statistics
Must lend objectivity to every process.
Suppose Barb, a researcher, has just collected and analyzed the heck out of her data, and she still can’t find anything. However, she knows in her heart that her theory holds true, even if her data can’t confirm it. Barb’s theory is that dogs have ESP — in other words, a "sixth sense." She bases this theory on the fact that her dog seems to know when she’s leaving the house, when he’s going to the vet, and when a bath is imminent, because he gets sad and finds a corner to hide in.
Barb tests her ESP theory by studying ten dogs, placing a piece of dog food under one of two bowls and asking each dog to find the food by pushing on a bowl. (Assume the bowl is thick enough that the dogs can’t cheat by smelling the food.) She repeats this process ten times with each dog and records the number of correct answers. If the dogs don’t have ESP, you would expect that they would be right 50 percent of the time, because each dog has two bowls to choose from and each bowl has an equal chance of being selected.
As it turns out, the dogs were right 55 percent of the time. Now this percentage is technically higher than the long-term expected value of 50 percent, but it’s not enough (especially with so few dogs and so few trials) to warrant statistical significance. In other words, Barb doesn’t have enough evidence for the ESP theory. But when Barb presents her results at the next conference she attends, she puts a spin on her results by saying "The dogs were correct 55 percent of the time, which is more than 50 percent. These results are Technically Not enough to be statistically significant, but I believe they do show some evidence that dogs have ESP."
Some statistically incorrect researchers use this kind of conclusion all the time — skating around the statistics when they don’t go their way. This game is very dangerous, because the next time someone tries to replicate Barb’s results (and believe me, someone always does), they find out what you knew from the beginning (through ESP?): When Barb starts packing for a trip, her dog senses trouble coming and hides. That’s all.
This Means X Causes Y
Do you see the word that makes statisticians nervous? Because the words This And Means Seem pretty tame, and X And Y Are just letters of the alphabet,
It’s got to be that word Cause. Of all the words on a final exam that aren’t supposed to be there, Cause Probably tops the list.
Here’s an example of what I mean. For your final report in stats class, you study which factors are related to your final exam score. You collect data on 500 statistics students, asking each one a variety of questions, such as "What was your grade on the midterm?"; "How much sleep did you get the night before the final?"; and "What is your GPA?" You conduct a multiple linear regression analysis (using techniques from Chapter 5), and you conclude that study time and the amount of sleep the night before are the most-important factors in determining exam scores. You write up all your analyses in a paper, and at the very end you say, "These results demonstrate that more study time and a good night of sleep the night before causes your exam grade to be higher."
I was with you until you said the word Cause. You can’t say that more sleep or more study time causes an increase in exam score. The data you collected shows that people who get a lot of sleep and study a lot do get good grades, and those who don’t don’t get the good grades. But that result doesn’t mean you can take a flunky and just have him sleep and study more, and all will be okay. This theory is like saying that because an increase in height is related to an increase in weight, you can get taller by gaining weight.
The problem is that you didn’t take an individual person, change his sleep time and study habits, and see what happened in terms of exam performance (using two different exams of the same difficulty). That study requires a Designed experiment. When you conduct a Survey, You have no way of controlling other related factors going on, which can muddy the waters.
The only way to control for other factors is to do a randomized experiment (complete with a treatment group, a control group, and controls for other factors that may ordinarily affect the outcome). Claiming causation without conducting a randomized experiment is a very common error some researchers make when they draw conclusions.
I Assumed the Data Was NoRMal…
The operative word here is Assumed. To break it down simply, an assumption is something you believe without checking. Assumptions can lead to wrong analyses and incorrect results — all without the person doing the assuming even knowing it.
Many analyses have certain requirements. For example, data should come from a normal distribution (the classic distribution that has a bell shape to it). If someone says "I assumed the data was normal," she just assumed that the data came from a normal distribution. But is having a normal distribution an assumption you just make and then move on, or is more work involved? You guessed it — more work.
For example, in order to conduct a one-sample T-test (see Chapter 3), your data must come from a normal distribution unless your sample size is large, in which you get an approximate normal distribution anyway by the Central Limit Theorem (remember those three words from intro stats?). Here, you aren’t making an assumption, but examining a Condition (something you check before proceeding). You plot the data, see if it meets the condition, and if it does, you proceed. If not, you can use nonparametric methods instead (Chapter 16).
Nearly every statistical technique for analyzing data has at least some condition^) on the data in order for you to use it. Always find out what those conditions are, and check to see whether your data meets them. Be aware that many statistics textbooks wrongly use the word Assumption When they actually mean Condition. It’s a subtle, but very important, difference.
I’m Only Reporting "Important" Results
As a data analyst, you must not only avoid the pitfall of reporting only the significant, exciting, and meaningful results, but you also have to be able to detect when someone else is doing so. Some number crunchers examine every possible option and look at their data in every possible way before settling on the analysis that got them the desired result.
You can probably see the problem here. Every technique has a chance for error along with it. If you’re doing a t-test, for example, and the a level is 0.05, over the long term 5 out every 100 t-tests you conduct will result in a false alarm just by chance (you declare a statistically significant result when it wasn’t really there). So, if an eager researcher conducts 20 hypothesis tests on the same data set, odds are that at least one of those tests could result in a false alarm just by chance, on average. As this researcher conducts more and more tests, he’s unfairly increasing his odds of "finding something" and running the risk of a wrong conclusion in the process.
It’s not all the eager researcher’s fault. He’s pressured by a result-driven system. It’s a sad state of affairs when the only results that get broadcasted on the news and appear in journal articles are the ones that show a statistically significant result (when Ho is rejected). Perhaps it was a bad choice when statisticians came up with the term Significance To denote rejecting Ho — as if to say that rejecting Ho is the only important conclusion you can come to. What about all the times when Ho couldn’t be rejected? For example, when doctors failed to conclude that drinking diet cola causes weight gain, or when pollsters didn’t find that people were unhappy with the president? The public would be better served if researchers and the media were encouraged to spend at least some time reporting the statistically insignificant but still important results, along with the statistically significant ones.
The bottom line is this: In order to find out whether a statistical conclusion is correct, you can’t just look at the analysis the researcher is showing you. You also have to find out about the analyses and results they’re not showing you and ask questions. Avoid the urge to rush to reject Ho.
A Bigger Sample Is Always Better
Bigger is better in some things, but not always with sample sizes. On one hand, the bigger your sample is, the more precise the results are (if no bias is present). A bigger sample also increases the ability of your data analysis to detect differences from a model or to deny some claim about a population (in other words, to reject Ho when you’re supposed to). This ability to detect true differences from Ho is called the Power Of a test (see Chapter 3). However, some researchers can (and often do) take the idea of power too far. They increase the sample size to the point where even the tiniest difference from Ho sends them screaming to press that all-important reject Ho button.
Suppose research claims that the typical in-house dog watches an average of ten hours of TV per week. Bob thinks the true average is more, based on the fact that his dog Fido watches at least ten hours of cooking shows alone each week. Bob sets up the following hypothesis test: Ho: u, = 10 versus Ha: u,> 10. He takes a random sample of 100 dogs and has their owners record how much TV their dogs watch per week. The result turns out that the sample mean is 10.1 hours, and the sample standard deviation is 0.8 hours. This result isn’t what Bob hoped for because 10.1 is so close to 10. He calculates the test statistic for this test using the formula T = -—and comes up with a value of
(10.1 -10.0) 01 RJ~n t = — = 0 08, which equals 1.25 for t. Because the test is a right-tailed
/100
Test (> in Ha), he can reject Ho at a if T Is beyond 1.645, and his t-value of 1.25 is far short of that value. Note that because N = 100 here, you find the value of 1.645 by looking at the very last row of the t-distribution table (Table A-1 in the Appendix). The row is marked with the infinity sign to indicate a large sample. So Bob can’t reject Ho.
To add insult to injury, Bob’s friend Joe conducts the same study and gets the same sample mean and standard deviation as Bob did, but Joe uses a random sample of 500 dogs rather than 100. Consequently, Joe’s T-value is
(10.1 – 10.0) 0 1 . , ovo n ovo- u 1 R*c
T =— = 0 036, which equals 2.78. Because 2.78 is greater than 1.645,
/500
Joe gets to reject Ho (to Bob’s dismay).
Why did Joe’s test find a result that Bob’s didn’t? The only difference was the sample size. Joe’s sample was bigger, and a bigger sample size always makes the standard error smaller (see Chapter 3). The standard error sits in the denominator of the /-formula (as you just saw), so as it gets smaller, the /-value gets larger. A larger /-value makes it easier to reject Ho. (See Chapter 3 for more on precisions and margin of error.)
Now, Joe could technically give a big press conference or write an article on his results (his mom would be so proud), but you know better. You know that Joe’s results are technically S/a/is/ically Significant, but not Prac/ically Significant — they don’t mean squat to any person or dog. After all, who cares that he was able to show evidence that dogs watch just a tiny bit more than ten hours of TV per week? This news isn’t exactly earth-shattering.
Sample sizes should be large enough to provide precision and repeatability of your results, but there is such a thing as being too large, believe it or not. You can always take sample sizes big enough to reject any null hypothesis, even when the actual deviation from it is embarrassingly small. What can you do about this? When you read or hear that a result was deemed statistically significant, ask what the sample mean actually was (before it was put into the /-formula) and see how significant it is to you from a practical standpoint. Beware of someone who says, "These results are statistically significant, and the large sample size of 100,000 gives even stronger evidence for that."
It’s Not Technically Random, But…
When you take a sample on which to build statistical results, the operative word is Random. You want the sample to be randomly selected from the population. The problem is that people oftentimes collect a sample that they think is Mos/ly Random or Sor/ of Random or random Enough — and that doesn’t cut it. The plan for taking a sample is either random or it isn’t.
One day I gave each student in my class of 50 a number from 1 to 50, and I drew two numbers randomly from a hat. The two students I picked sat in the first row, and not only that, they sat right next to each other. Students immediately cried foul!
After these seemingly odd results appeared, I took the opportunity to talk to my class about truly random samples. A Random sample Is chosen in such a way that every member of the original population has an equal chance of being selected. Sometimes people who sit next to each other are chosen. In fact, if these seemingly strange results never happen, you may worry about the process; in a truly random process, you’re going to get results that may seem odd, weird, or even fixed. That’s part of the game.

In my consulting experiences, I always ask how my clients chose or plan to choose their samples. They always say they’ll make sure it’s random. But when I ask them how they’ll do this, I sometimes get less-than-stellar answers. For example, someone needed to get a random sample from a population of 500 free-range chickens in a farmyard. He needed five chickens and said that he’d select them randomly by choosing the five that came up to him first. The problem is, animals that come up to you may be friendlier, more docile, older, or perhaps more tame. These characteristics aren’t present in every chicken in the yard, so choosing a sample this way isn’t random. The results are likely biased in this case.
Always ask the researcher how she selected a sample, and when you select your own samples, stay true to the definition of random. And don’t use your own judgment to choose a random sample; use a computer to do it for you!
1,000 Responses Is 1,000 Responses
A newspaper article on the latest survey says that 50 percent of the respondents said blah blah blah. The fine print says the results are based on a survey of 1,000 adults in the United States. But wait — is 1,000 the actual number of people selected for the sample, or is it the final number of respondents? You may need to take a second look; those two numbers hardly ever match.
For example, Jenny wants to know what percentage of people in the U. S. have ever knowingly cheated on their taxes. In her statistics class, she found out that if she gets a sample of 1,000 people, the margin of error for her survey is only plus or minus 3 percent, which she thinks is groovy. So she sets out to achieve the goal of 1,000 responses to her survey. She knows that in these days it’s hard to get people to respond to a survey, and she’s worried that she may lose a great deal of her sample that way, so she has an idea. Why not send out more surveys than she needs, so that she gets 1,000 surveys back?
Jenny looks at several survey results in the newspapers, magazines, and on the Internet, and she finds that the response rate (the percentage of people who actually responded to the survey) is typically around 25 percent. (In terms of the real world, I’m being generous with this number, believe it or not. But think about it: How many surveys have your thrown away lately? Don’t worry, I’m guilty of it too.) So, Jenny does the math and figures that if she sends out 4,000 surveys and gets 25 percent of them back, she has the 1,000 surveys she needs to do her analysis, answer her question, and have that small margin of error of plus or minus 3 percent.
Jenny conducts her survey, and just like clockwork, out of the 4,000 surveys she sends out, 1,000 come back. She goes ahead with her analysis and finds that 400 of those people reported cheating on their taxes (40 percent). She adds her margin of error, and reports, "Based on my survey data, 40 percent of Americans cheat on their taxes, plus or minus 3 percentage points."

Now hold the phone, Jenny. She only knows what those 1,000 people who returned the survey said. She has no idea what the other 3,000 people said. And here’s the kicker: Whether or not someone responds to a survey is often related to the reason the survey is being done. It’s not a random thing. Those nonrespondents (people who don’t respond to a survey) carry a lot of weight in terms of what they’re not taking time to tell you.
For the sake of argument, suppose that 2,000 of the people who originally got the survey were uncomfortable with the question because they Do Cheat on their taxes, and they just didn’t want anyone to know about it, so they threw the survey in the trash. Suppose that the other 1,000 people don’t cheat on their taxes, so they didn’t think it was an issue and didn’t return the survey. If these two scenarios were true, the results would look like this:
Cheaters = 400 (surveyed) + 2,000 (nonrespondents) = 2,400
These results raise the total percentage of cheaters to 2,400 divided by 4,000 — 60 percent. That’s a huge difference!
You could go completely the other way with the 3,000 nonrespondents. You can suppose that none of them cheat, but they just didn’t take time to say so. If you knew this info, you would get 600 (surveyed) + 3,000 (nonrespondents) = 3,600 noncheaters. Out of 4,000 surveyed, this is 90 percent. The truth is likely to be somewhere between the two examples I just gave you, but nonrespondents make it too hard to tell.
And the worst part is that the formulas Jenny uses for margin of error don’t know that the information she put into them is based on biased data, so her reported 3 percent margin of error is wrong. The formulas happily crank out results no matter what. It’s up to you to make sure that what you put into the formulas is good, clean info.
Getting 1,000 results when you send out 4,000 surveys is nowhere near as good as getting 1,000 results when sending out 1,000 surveys (or even 100 results from 100 surveys). Plan your survey based on how much follow-up you can do with people to get the job done, and if it takes a smaller sample size, so be it. At least the results have a better chance of being statistically correct.
Of Course These Results Apply to the General Population!
Making conclusions about a much broader population than your sample actually represents is one of the biggest no-no’s in statistics. This kind of problem is called Generalization, And it occurs more often than you may think. People want their results instantly; they don’t want to wait for them, so well-planned surveys and experiments take a back seat to instant Web surveys and convenience samples.

For example, a researcher wants to know how cable news channels have influenced the way Americans get their news. He also happens to be a statistics professor at a large research institution and has 1,000 students in his class. He decides that instead of taking a random sample of Americans, which would be difficult, time-consuming, and expensive, he just puts a question on his final exam to get his students’ answers. His data analysis shows him that only 5 percent of his students read the newspaper and/or watch network news programs anymore; the rest watch cable news. For his class, the ratio of students who exclusively watch cable news compared to those students who don’t is 20 to 1. The professor reports this and sends out a press release about it. The cable news channels pick up on it and the next day are reporting, "Americans choose cable news channels over newspapers and network news by a 20 to 1 margin!"
Do you see what’s wrong with this picture? The problem is that the professor’s conclusions go way beyond his study, which is wrong. He used the students in his statistics class to obtain the data that serves as the basis for his entire report and the resulting headline. Yet the professor reports the results about all Americans. I think it’s safe to say that a sample of 1,000 college students taking a statistics class at the same time at the same college doesn’t represent a cross section of America.
If the professor wants to make conclusions in the end about America, he has to select a random sample of Americans to take his survey. If he uses 1,000 students from his class, then his conclusions can only be made about that class and no one else.
To avoid or detect generalization, identify the population that you’re intending to make conclusions about and make sure the sample you selected represents that population. If the sample represents a smaller group within that population, then the conclusions have to be downsized in scope also.
I Just Decided to Leave It Out
It seems easier sometimes to just leave information out. I see this all too often when I read articles and reports based on statistics. But, this error isn’t the fault of only one person or group. The guilty parties can include
The producers: The researchers out there leave items out for a variety of reasons, including time and space constraints. After all, you can’t write about every element of the experiment from beginning to end. However, other items they leave out may be indicative of a bigger problem. For example, reports often say very little about how they collected the data or chose the sample. Or they may discuss the results of a survey but not show the actual questions they asked. Ten out of 100 people may have dropped out of their experiment, and they don’t tell you why. All

These items are important to know before making a decision about the credibility of someone’s results.
Another way in which some data analysts leave information out is by removing data that doesn’t fit the intended model (in other words, "fudging" the data). Suppose a researcher records the amount of time surfing the Internet and relates it to age. He fits a nice line to his data indicating that younger people surf the Internet much more than older people and that surf time decreases as age increases. All is good except for Claude the outlier, who is 80-years-old and surfs the Internet day and night, leading his own bingo chat rooms and everything. What to do with Claude? If not for him, the relationship looks beautiful on the graph; what harm would it do to remove him? After all, he’s only one person, right?
No way. Everything is wrong with this idea. Removing undesired data points from a data set is not only very wrong but also very risky. The only time it’s okay to remove an observation from a data set is if you’re certain beyond doubt that the observation is just plain wrong. For example, someone writes on a survey that she spends 30 hours a day surfing the Internet or that her IQ is 2,200.
The communicators: When reporting statistical results, the media leaves out important information all the time, which is often due to space limitations and fast deadlines. However, part of it is a result of the current, fast-paced society that feeds itself on sound bytes. The best example is survey results, where they often leave out the size of the sample. You can’t calculate margin of error without it.
The consumers: The general public also plays a role in the leave-things-out mindset. People hear a news story and instantly believe it’s true, ignoring any chance for error or bias in the results. You need to make a decision about what car to buy, and you ask your neighbors and friends rather than examine the research and the meticulous, comprehensive ratings that have resulted. Everyone neglects to ask questions as much as he should, at one time or another, which indirectly feeds the entire problem.
In the chain of statistical information, the producers (researchers) need to be comprehensive and forthcoming about the process they conducted and the results they got. The communicators of that information (the media) need to critically evaluate the accuracy of the information they’re getting and report it fairly. The consumers of statistical information (the rest of us) need to stop taking results for granted and to rely on credible sources of statistical studies and analyses to help make those important life decisions.
In the end, if a data set looks too good, it probably is. If the model fits too perfectly, be suspicious. If it fits exactly right, run and don’t look back! Sometimes what is left out speaks much louder than what is put in.