<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Блог Анкара &#187; Data Analysis and Model-Building Basics</title>
	<atom:link href="http://ankar.info/category/data-analysis-and-model-building-basics/feed/" rel="self" type="application/rss+xml" />
	<link>http://ankar.info</link>
	<description>Еби гусей, спасай Россию</description>
	<lastBuildDate>Wed, 08 Feb 2012 09:02:43 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>Sorting through Statistical Techniques</title>
		<link>http://ankar.info/2010/05/15/sorting-through-statistical-techniques/</link>
		<comments>http://ankar.info/2010/05/15/sorting-through-statistical-techniques/#comments</comments>
		<pubDate>Sat, 15 May 2010 18:30:14 +0000</pubDate>
		<dc:creator>Анкар</dc:creator>
				<category><![CDATA[Data Analysis and Model-Building Basics]]></category>

		<guid isPermaLink="false">http://ankar.info/2010/05/15/sorting-through-statistical-techniques/</guid>
		<description><![CDATA[In This Chapter ^ Deciphering the difference between qualitative and quantitative variables ^ Choosing appropriate statistical techniques for the task at hand ^ Evaluating bias and precision levels ^ Interpreting the results properly Ne of the most critical elements of statistics and data analysis is the ability to choose the right statistical technique for each [...]]]></description>
			<content:encoded><![CDATA[<sape_index><p><b><i>In This Chapter</i></b></p>
<p>^ Deciphering the difference between qualitative and quantitative variables ^ Choosing appropriate statistical techniques for the task at hand ^ Evaluating bias and precision levels ^ Interpreting the results properly</p>
<p>Ne of the most critical elements of statistics and data analysis is the ability to choose the right statistical technique for each job. Carpenters and mechanics know the importance of having the right tool when they need it and the problems that can occur if they use wrong tool. They also know that the right tool helps to increase their odds of getting the results they want the first time around, using the &quot;work smarter not harder&quot; approach.</p>
<p>In this chapter, you look at the some of the major statistical analysis techniques from the point of view of the mechanics and carpenters — knowing what each statistical tool is meant to do, how to use it, and when to use it. You also zoom in on mistakes some number crunchers make in applying the wrong analysis or doing too many analyses. Knowing how to spot these problems can help you avoid making the same mistakes, but it also helps you to steer your way through the ocean of statistics that may await you in your job and in everyday life.</p>
<p>If many of the ideas you find in this chapter seem like a foreign language to you and you feel like you need more background information, don&#8217;t fret. Before continuing on in this chapter, head to your nearest intro stats book or check out another one of my books, <i>Statistics For Dummies </i>(Wiley).</p>
<p><b><i>Qualitative Versus Quantitative Variables in Statistical Analysis</i></b></p>
<p>After you&#8217;ve collected all the data you need from your sample, you want to organize it, summarize it, and analyze it. Before plunging the data in to do all the number crunching though, you need to first identify the type of data you&#8217;re dealing with. The type of data you have points you to the proper types of graphs, statistics, and analyses you&#8217;re able to use.</p>
<p>Before I begin, here&#8217;s an important piece of jargon: Statisticians call any quantity or characteristic you measure on an individual a <i>Variable; </i>The data collected on a variable is expected to vary from person to person (hence the creative name).</p>
<p>The two major types of variables are the following:</p>
<p><b>Qualitative: </b>A qualitative variable classifies the individual based on categories. For example, political affiliation may be classified into four categories: Democrat, Republican, Independent, and other; gender as a variable takes on two possible categories: male and female. A person may be categorized as a female Republican, which means that, regarding the gender variable, she falls into the female category, and regarding the political affiliation variable, she falls into the Republican category. Another name for a qualitative variable is a <i>Categorical variable.</i></p>
<p><b>Quantitative: </b>A quantitative variable measures or counts a quantifiable characteristic, such as height, weight, number of children you have, your GPA in college, or the number of hours of sleep you got last night. The quantitative variable value represents a quantity (count) or a measurement and has numerical meaning. That is, you can add, subtract, multiply, or divide the values of a quantitative variable, and the results make sense as numbers. This characteristic isn&#8217;t true of qualitative variables, which can take on numerical values only as placeholders.</p>
<p>Because the two types of variables represent such different types of data, it makes sense that each type has its own set of statistics. Qualitative variables, such as gender, are somewhat limited in terms of the statistics that can be performed on them. For example, suppose you have a sample of 500 classmates classified by gender — 180 of them are male, and 320 are female. How can you summarize this information? You already have the total number in each category (this statistic is called the <i>Frequency). </i>You&#8217;re off to a good start, but frequencies are hard to interpret because you find yourself trying to compare them to a total in your mind in order to get a proper comparison. In the previous example, you may be thinking &quot;One hundred and eighty males out of what? Let&#8217;s see, it&#8217;s out of 500. Hmmm. . . what percentage is that? I can&#8217;t think.&quot;</p>
<p>The next step is to find a means to relate these numbers to each other in an easy way. You can do this by using what is called a relative frequency. The <i>Relative frequency </i>Is the percentage of data that falls into a specific category of a qualitative variable. You can find a category&#8217;s relative frequency by dividing the frequency by the sample total (500, using this example) and multiplying</p>
<p>By 100. In this case, you have i80 = 0.36 * 100 = 36 percent males and</p>
<p>500</p>
<p>320 = 0.64 *100 = 64 percent females. 500</p>
<p>You can also express the relative frequency as a proportion in each group by leaving the result in decimal form and not multiplying by 100. This statistic is called the <i>Sample proportion. </i>If you continue with the same example, the sample proportion of males is 0.36, and the sample proportion of females is 0.64.</p>
<p>You mainly summarize qualitative variables by using two statistics — the number in each category (frequency) and the percentage (relative frequency) in each category.</p>
<p><b><i>Statistics for Qualitative Variables</i></b></p>
<p>The types of statistics done on qualitative data may seem to be limited; however, the wide variety of analyses you can perform using frequencies and relative frequencies offers answers to an extensive range of possible questions you may want to explore.</p>
<p>In this section, you see that the proportion in each group is the number-one statistic for summarizing qualitative data. Beyond that, you see how you can use proportions to estimate, compare, and look for relationships between the groups that compose the qualitative data.</p>
<p><b><i>Comparing proportions</i></b></p>
<p>Researchers, the media, and even everyday folk like you and me love to compare groups (whether you like to admit it or not). For example, what proportion of Democrats support oil drilling in Alaska, compared to Republicans? What percentage of women watch college football versus men? What proportion of readers of <i>Intermediate Statistics For Dummies </i>Pass their stats exams with flying colors, compared to nonreaders? To answer these questions, you need to compare the sample proportions using a hypothesis test for two proportions (see Chapter 3 or your intro stat textbook).</p>
<p>Suppose you&#8217;ve collected data on a random sample of 1,000 United States voters. You may want to compare the proportion of female voters to the proportion of male voters and find out whether they&#8217;re equal. Suppose in your sample you find that the proportion of females is 0.53, and the proportion of males is 0.47. So for this sample of 1,000 people, you have a higher proportion of females than males. But here&#8217;s the big question: Are these sample proportions different enough to say that the entire population of U. S. voters has more females in it than males? After all, sample results vary from sample to sample. The answer to this question requires comparing the sample proportions by using a hypothesis test for two proportions. I demonstrate and expand on this technique in Chapter 3.</p>
<p><b><i>Estimating a proportion</i></b></p>
<p>You can also use relative frequencies (check out the section &quot;Qualitative versus Quantitative Variables in Statistical Analysis&quot;) to make estimates about a single population proportion.</p>
<p>Say, for example, you want to know what proportion of females in the United States are Democrats. According to a sample of 29,839 female voters from the U. S. conducted by the Pew Research Foundation in 2003, the percentage of female Democrats was 36. Now because the Pew researchers based these results on only a sample of the population and not on the entire population, these results may vary from sample to sample. The amount of variability is measured by the <i>Margin of error </i>(the amount that you add and subtract from your sample statistic), which for this sample is only about 0.5 percent. (To find out how to calculate margin of error, explore Chapter 3.) That means that the estimated percentage of female Democrats in the U. S. voting population is estimated to be somewhere between 35.5 percent and 36.5 percent.</p>
<p>The margin of error, combined with the sample proportion, forms what statisticians call a confidence interval for the population proportion. Recall from intro stats that a <i>Confidence interval </i>Is a range of likely values for a population parameter, formed by taking the sample statistic plus or minus the margin of error. (For more on confidence intervals, see Chapter 3.)</p>
<p><b><i>Looking for relationships between qualitative Variables</i></b></p>
<p>Suppose you want to know whether two qualitative variables are related (for example, is gender related to political affiliation?). Answering this question requires putting the sample data into a two-way table (using rows and</p>
<p>Columns to represent the two variables), and analyzing the data by using a Chi-square test (see Chapter 14). By following this process, you can determine whether two categorical variables are independent (unrelated) or whether a relationship exists between them. If you find a relationship, you can use percentages to describe it.</p>
<p>Table 2-1 shows an example of data organized in a two-way table. The data was collected by the Pew Research Foundation.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 2-1</b></p>
</td>
<td>
<p><b>Gender and Political Affiliation for 56,735 U. S. Voters</b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Gender</i></b></p>
</td>
<td>
<p><b><i>Republican Democrat Other</i></b></p>
</td>
</tr>
<tr>
<td>
<p>Males</p>
</td>
<td>
<p>32% 27% 41%</p>
</td>
</tr>
<tr>
<td>
<p>Females</p>
</td>
<td>
<p>29% 36% 35%</p>
</td>
</tr>
</table>
<p>Notice that the percentage of male Republicans in the sample is 32 and the percentage of female Republicans in the sample is 29. These percentages are quite close in relative terms. However, the percentage of female Democrats seems much higher than the percentage of male Democrats (36 percent versus 27 percent); also, the percentage of males in the &quot;Other&quot; category is quite a bit higher than the percentage of females in the &quot;Other&quot; category (41 percent versus 35 percent). These large differences in the percentages indicates that gender and political affiliation are related in the sample. But do these trends carry over to the population of all U. S. voters? This question requires a hypothesis test to answer. The particular hypothesis test you need in this situation is a Chi-square test, which I discuss in detail in Chapter 14.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-16.jpg" width="62" height="60" class=""/></p>
<p>To make a two-way table from a data set by using Minitab, first enter the data in two columns, where column one is the row variable (continuing with the previous example, this variable would be gender) and column two is the column variable (in this case, political affiliation). For example, suppose the first person is a male Democrat. In row one of Minitab, enter <i>M </i>(for male) in column one and <i>D </i>(Democrat) in column two. Then go to Stat&gt;Tables&gt;Cross Tabulation and Chi-square. Highlight column one and click Select to enter this variable in the For Rows line. Highlight column two and click Select to enter this variable in the For Columns line. Click on OK.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-17.jpg" width="57" height="54" class=""/></p>
<p>People often use the word <i>Correlation </i>To discuss relationships between variables, but in the statistical world, you can use correlation only to discuss the relationship between two quantitative (numerical) variables, not two qualitative (categorical) variables. Correlation measures how closely the relationship between two quantitative variables, such as height and weight, follows a</p>
<p>Straight line and tells you the direction of that line as well. In total, for any two quantitative variables, <i>X </i>And <i>Y, </i>The correlation measures the strength and direction of their linear relationship. As one increases, what does the other one do?</p>
<p>Because qualitative variables don&#8217;t have a numerical order to them, they don&#8217;t increase or decrease in value. For example, just because male = 1 and female = 2 doesn&#8217;t mean that a female is worth twice a male. (Although some women may want to disagree.) These numbers represent categories, not values. Therefore, you can&#8217;t use the word <i>Correlation </i>To describe the relationship between, say, gender and political affiliation. The appropriate term to describe the relationships of qualitative variables is <i>Association. </i>You can say that political affiliation is associated with gender, and explain how. (For full details on association, see Chapter 13. For more information on correlation, see Chapter 4.)</p>
<p><b><i>Building models to make predictions</i></b></p>
<p>You can also build models to predict the value of a qualitative variable based on other related information. In this case, building models is more than a lot of little plastic pieces and some irritatingly sticky glue. When you build a model, you look for variables that help explain, estimate, or predict some response you&#8217;re interested in (the variables that do this are called <i>Explanatory variables). </i>You sort through the explanatory variables and figure out which ones do the best job of predicting the response, and you put them together into a type of equation like <i>Y </i>= 2x + 4 where <i>X </i>= shoe size and <i>Y </i>= length of your calf. That equation is a <i>Model.</i></p>
<p>For example, what if you want to know which factors or variables can help you predict someone&#8217;s political affiliation? Is a woman without children more likely to be a Republican or a Democrat? What about a middle-aged man who proclaims Hinduism as his religion? In order for you to compare these complex relationships, you must build a model to evaluate each group&#8217;s impact on political affiliation (or some other qualitative variable). This kind of model building is explored more in-depth in Chapter 8, where I discuss the topic of logistic regression.</p>
<p>Logistic regression builds models to predict the outcome of a qualitative variable, such as political affiliation. If you want to make predictions about a quantitative variable, such as income, you need to use the standard type of regression (check out Chapters 4 and 5).</p>
<p>In 2003, the Pew Research Foundation studied the following variables in terms of their relationship with political affiliation: gender, race, state of residence, income level, age, education, religion, marital status, and whether or not you have children. While you can do individual Chi-square analyses to examine possible connections between each of these variables and political affiliation separately, you can&#8217;t find out which combinations of these variables increase the likelihood of someone being a Democrat, Republican, or other.</p>
<p>For example, the Foundation found that women are more likely to be Democrats than men, but age is also a factor. Younger people tend to be more inclined to be Republican, and older people lean toward being Democrat. However, if you look at the combination of gender and age, you can see mixed results; males who are older are more likely than young females to be Democrat rather than Republican, for example. This kind of result is called an <i>Interaction effect </i>Between gender and age group. An interaction effect occurs when certain combinations of variables produce different results than other combinations. The only way to look for these kinds of more-complex relationships is to do model building, which allows you to examine the combinations of variables and their impact on political affiliation. The Pew Foundation was able to make conclusions about the United States population based on its model linking political affiliation, age and gender, as well as their interactions.</p>
<p><b><i>Statistics for Quantitative Variables</i></b></p>
<p>Quantitative variables, unlike qualitative variables, have a wider range of statistics that you can do, depending on what questions you want to ask. The main reason for this wider range is that <i>Quantitative data </i>Are numbers that represent measurements or counts, so it makes sense that you can order, add or subtract, and multiply or divide them — and the results all have numerical meaning. Examining quantitative date opens up a whole world of possibilities for analysis. In this section, I present the major data-analysis techniques for quantitative data. I further expand each technique in later chapters of this book.</p>
<p><b><i>Making comparisons</i></b></p>
<p>Suppose you want to look at income (a quantitative variable) and how it relates to a qualitative variable, such as gender or region of the country. Your first question may be: Do males still make more money than females? In this case, you can compare the mean incomes of two populations — males and</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-18.jpg" width="63" height="60" class=""/></p>
<p>Females. This assessment requires a hypothesis test of two means (oftentimes called a /-test for independent samples). I present more information on this technique in Chapter 3.</p>
<p>When comparing the means of <i>More </i>Than two groups, don&#8217;t simply look at all the possible /-tests that you can do on the pairs of means, because you have to control for an overall error rate in your analysis. Too many analyses can result in errors — adding up to disaster. For example, if you conduct 100 hypothesis tests, each one with a 5 percent error rate, then 5 of those 100 tests give wrong results on average, just by chance.</p>
<p>If you want to compare the average wage in different regions of the country (the East, the Midwest, the South, and the West, for example), this comparison requires a more sophisticated analysis, because you&#8217;re looking at four groups rather than just two. The procedure you can use to compare more than two means is called <i>Analysis of variance </i>(ANOVA), and I discuss this method in detail in Chapters 9 and 10.</p>
<p><b><i>Finding connections</i></b></p>
<p>Suppose you&#8217;re an avid golfer and you want to figure out how much time you should spend on your putting game. The question is this: Is the number of putts related to your total score? If the answer is yes, then spending time on your putting game makes sense. If not, then you can slack off on it a bit. Both of these variables are quantitative variables, and you&#8217;re looking for a connection between them. You collect data on 100 rounds of golf played by golfers at your favorite course over a weekend. Table 2-2 shows the first few lines of your data set.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 2-2</b></p>
</td>
<td>
<p><b>First Ten Golf Scores (ordered)</b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Number of Putts</i></b></p>
</td>
<td>
<p><b><i>Total Score</i></b></p>
</td>
</tr>
<tr>
<td>
<p><i>23</i></p>
</td>
<td>
<p>76</p>
</td>
</tr>
<tr>
<td>
<p>27</p>
</td>
<td>
<p>80</p>
</td>
</tr>
<tr>
<td>
<p>28</p>
</td>
<td>
<p>80</p>
</td>
</tr>
<tr>
<td>
<p>29</p>
</td>
<td>
<p>80</p>
</td>
</tr>
<tr>
<td>
<p>30</p>
</td>
<td>
<p>80</p>
</td>
</tr>
</table>
<p>29</p>
<p>82</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b><i>Number of Putts</i></b></p>
</td>
<td>
<p><b><i>Total Score</i></b></p>
</td>
</tr>
<tr>
<td>
<p>30</p>
</td>
<td>
<p>83</p>
</td>
</tr>
<tr>
<td>
<p>31</p>
</td>
<td>
<p>83</p>
</td>
</tr>
<tr>
<td>
<p>33</p>
</td>
<td>
<p>83</p>
</td>
</tr>
<tr>
<td>
<p>26</p>
</td>
<td>
<p>84</p>
</td>
</tr>
</table>
<p>The first step in looking for a connection between putts and total scores (or any other quantitative variables) is to make what is called a <i>Scatterplot </i>Of the data, which graphs your data set in two-dimensional space by using an <i>X </i>And <i>Y </i>Plane. You can take a look at the scatterplot of the golf data in Figure 2-1. Here, <i>X </i>Represents the number of putts, and <i>Y </i>Represents the total score. For example, the point in the lower-left corner of the graph represents someone who had only 23 putts and a total score of 75. (For instructions on making a scatterplot by using Minitab, see Chapter 4.)</p>
<p><b>Figure 2-1:</b></p>
<p>A scatterplot is a two-dimensional graph you can use to look for relationships in data.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-19.jpg" width="340" height="228" class=""/></p>
<p>According to Figure 2-1, it appears that as the number of putts increases, so does total score. The relationship seems pretty strong — the number of putts plays a big part in determining the total score.</p>
<p>Now you need a measure of how strong the relationship is between <i>X </i>And <i>Y </i>And whether it goes uphill or downhill. Correlation is the number that measures how close the points follow a straight line. Correlation is always between -1.0 and +1.0, and the more closely the points follow a straight line,</p>
<p>The closer the correlation is to -1.0 or +1.0. A positive correlation means that as <i>X </i>Increases on the x-axis, <i>Y </i>Also increases on the y-axis. Statisticians call this type of relationship an <i>Uphill relationship. </i>A negative correlation means that as <i>X </i>Increases on the <i>X</i>-axis, <i>Y </i>Goes down. Statisticians call this type of relationship — you guessed it — a <i>Downhill relationship.</i></p>
<p>For the golf data set, the correlation is 0.896 = 0.90, which is extremely high as correlations go. This strong correlation (close to +1.0) is a good thing because it means number of putts can do a great job of predicting total score. Because the sign of the correlation is positive, it means as you increase number of putts, your total score increases (an uphill relationship). For instructions on calculating a correlation in Minitab, see Chapter 4.</p>
<p><b><i>Making predictions</i></b></p>
<p>If you want to predict some response variable <i>(y) </i>Using one explanatory variable <i>(x), </i>And you want to use a straight line to do it, you can use <i>Simple linear regression </i>(see Chapter 4 for all the fine points on this topic). Linear regression finds the best-fitting line that cuts through the data set, called the <i>Regression line. </i>After you get the regression line, you can plug in a value of <i>X </i>And get your prediction for <i>Y. </i>(For instructions on using Minitab to find the best-fitting line for your data, see Chapter 4.)</p>
<p>To use the golf example from the previous section, suppose you want to predict the total score you can get for a certain number of putts. In this case, you want to calculate the linear regression line. By using the data set shown in Table 2-2, and running a regression analysis, the computer tells you that the best line to use to predict total score using number of putts is the following:</p>
<p>Total score = 39.6 + 1.52 * Number of putts</p>
<p>So if you have 35 putts in an 18-hole golf course, your total score is predicted to be about 39.6 + 1.52 * 35 = 92.8, or 93. (Not bad for 18 holes!)</p>
<p>Notice that the slope of the regression line tells you what you really want to know — how much does your total score increase with every additional putt? In other words, how much damage is done when you miss the hole on your first, or second, or third putt? The slope of the regression line for the golf data set is 1.52. Because the slope of a line is the ratio of the change in <i>Y </i>(total score) to the change in <i>X </i>(number of putts) this means that every additional putt you need results in an overall increase in total score by 1.52. Maybe that&#8217;s why Tiger Woods spends so much time on his short game.</p>
<p>Don&#8217;t try to predict <i>Y </i>For x-values that fall outside the range of where the data was collected; you have no guarantee that the line still works outside of that range, or that it will even make sense. For the golf example, you can&#8217;t say that if <i>X </i>(the number of putts) = 0 the total score would be 39.6 + 1.52 * 0 = 39.6 (unless you just call it good after your ball hits the green). This mistake is called <i>Extrapolation.</i></p>
<p>You can discover more about simple linear regression, and expansions on it, in Chapters 4 and 5.</p>
<p><b><i>Avo</i></b><b><i>I</i></b><b><i>D</i></b><b><i>I</i></b><b><i>Ng B</i></b><b><i>I</i></b><b><i>As</i></b></p>
<p>Bias is the bane of a statistician&#8217;s existence; it&#8217;s easy to create and very hard to deal with, if not impossible in most situations. The statistical definition of <i>Bias </i>Is the systematic overestimation or underestimation of the actual value. In language the rest of us can understand, it means that the results are always off by a certain amount in a certain direction. For example, a bathroom scale may always report a weight that&#8217;s five pounds more than it should be (I&#8217;m convinced this is true of my doctor&#8217;s office scale); this consistent adding of five points to every outcome represents a systematic overestimation of the actual weight.</p>
<p>The most important idea when dealing with bias is prevention, or at least minimizing it. Bias is like weeds in a garden: After they&#8217;re present, they&#8217;re very hard to deal with, and it&#8217;s always better to eliminate them from the start. In this section, you see ways bias can creep into a data set, or even into a statistic, and what you can do about it.</p>
<p><b><i>Look</i></b><b><i>I</i></b><b><i>Ng at b</i></b><b><i>I</i></b><b><i>As through stat</i></b><b><i>I</i></b><b><i>St</i></b><b><i>I</i></b><b><i>Cal glasses</i></b></p>
<p>Bias can show up in a data set a variety of different ways. Here are some of the most common ways bias can creep into your data:</p>
<p><b>Selecting the sample from the population: </b>Bias occurs when you leave some intended groups out of the process, and/or give certain groups too much weight.</p>
<p>For example, TV surveys (the ones where they ask you to phone in your opinion) are biased because no one has selected a prior sample of people to represent the population — people call in on their own. When people participate in a survey on their own, they&#8217;re more likely to have stronger opinions than those who don&#8217;t choose to participate. Such samples are called <i>Self-selected samples </i>And are typically very biased.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-20.jpg" width="57" height="60" class=""/></p>
<p><b>Designing the data-collection instrument: </b>Poorly designed instruments (including surveys) can result in inconsistent or even incorrect data.</p>
<p>For example, a survey question&#8217;s wording plays a large role in whether or not results are biased. A leading question can make people feel like they should answer a certain way. For example: &quot;Don&#8217;t you think that the president should be allowed to have a line-item veto to prevent government spending waste?&quot; Who would feel they should say <i>No </i>To that?</p>
<p><b>Collecting the data: </b>In this case, bias can infiltrate the results if someone makes errors in the recording of the data or if interviewers deviate from the script.</p>
<p><b>Ur Deciding how and when the data is collected: </b>The time and place you collect data can affect whether your results are biased. For example, if you conduct a telephone survey during the middle of the day, people who work from nine to five aren&#8217;t able to participate. Depending on the issue, the timing of this survey could lead to biased results.</p>
<p>Bias can creep into a data set very easily. The best way to deal with bias is to avoid it in the first place. You can do this in two major ways:</p>
<p><b>Use a random process to select the sample from the population. </b>The</p>
<p>Only way a sample is truly random is if every single member of the population has an equal chance of being selected. Self-selected samples aren&#8217;t random.</p>
<p><b>Make sure that the data is collected in a fair and consistent way. </b>Be</p>
<p>Sure to use neutral question wording and time the survey properly.</p>
<p><b><i>Se</i></b><b><i>Ttli</i></b><b><i>Ng </i></b><b><i>Th</i></b><b><i>E </i></b><b><i>V</i></b><b><i>A</i></b><b><i>Ri</i></b><b><i>Ance con</i></b><b><i>Tr</i></b><b><i>Ove</i></b><b><i>R</i></b><b><i>Sy: The battle of n-1 Versus n</i></b></p>
<p>Not all statistical formulas are free of bias. In other words, some statistics have good characteristics (like offering great precision) and some not-so-good characteristics (like not giving the best possible result in all situations). Statisticians definitely prefer statistics that are both precise and unbiased, and the techniques you find in this book have both qualities. However, precise and unbiased statistics doesn&#8217;t always happen naturally; sometimes the basic idea requires a little tweaking to get a statistic that actually meets the standards of the statistical powers that be (of which I am not one). The classic example of this need to fine-tune is the formula for the variance of a data set, which I describe in the following section.</p>
<p><b><i>The problem</i></b></p>
<p>Statistics textbooks sometimes show two formulas for the variance of a data</p>
<p><i>! _ Xi &#8211; X)</i></p>
<p>Set. One formula shown for the variance is <i>S2 = ——n</i>-, where <i>N </i>Is the</p>
<p>Sample size, the values of <i>X </i>Are the data values, and the sample mean (or the</p>
<p><i>N</i></p>
<p>Average of all the values of the data set) is <i>X = &#9632;L=n</i>—. This formula for variance, you may note, contains an <i>N </i>All by itself in the denominator. The fact that the denominator is <i>N </i>And not <i>N </i>- 1 makes a teacher&#8217;s job of explaining variance a whole lot easier, because it represents the average squared distance from the mean. In this case, the values being squared are the differences between the data values and their mean. You get the average of these squared values by summing them up and dividing by <i>N, </i>The sample size.</p>
<p>However, this version of a formula for variance, as it&#8217;s written, is biased. That means in a statistical sense, you know that in the long term, the results are always off by a very small amount from their target value. If you take repeated samples, find the variance, and do this over and over, the results on average are a little smaller than they should be. (Statisticians can prove this, but you don&#8217;t have to worry about that. I&#8217;m sure you have better things to do.)</p>
<p><b>The </b><b><i>Solution</i></b></p>
<p>Because statisticians prefer results being correct to results that can be more easily explained, they decided to do something about this bias problem in the formula for the sample variance. A group of stat big wigs figured out that dividing by <i>N </i>Was the problem, and if you divide by <i>N </i>- 1 rather than <i>N, </i>You can get answers that are right on target. That&#8217;s how the following commonly used formula for sample variance came into being:</p>
<p><i>N</i></p>
<p><i>! _Xi &#8211; X)</i></p>
<p><i>S</i><i> </i><b>2</b><b> </b><i>= </i><b>—-</b></p>
<p><i>N</i>-1</p>
<p>Notice that an <i>N </i>- 1 rather than an <i>N </i>Is now in the denominator. However, trying to explain why the formula isn&#8217;t dividing by <i>N </i>Does tend to open up a can of worms for statistics professors (and explains why biased statistics are a topic left for the intermediate-level students, like you!).</p>
<p>Because statistics can be biased too, in terms of the results they create through their formulas alone, it&#8217;s always a good idea to check with a statistician or someone else in the know whether a particular statistic is unbiased before you use it.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-21.jpg" width="62" height="60" class=""/></p>
</p>
<p>An animal science researcher came to me one time with a data set he was so proud of. He was studying cows and the variables involved in helping determine their longevity. He came in with a super-mega data set that contained over 100,000 observations. He was thinking &quot;Wow, this is gonna be great! I&#8217;ve been collecting this data for years and years, and I can finally have it analyzed. There&#8217;s got to be loads of information I can get out of this. The papers I&#8217;ll write, the talks I&#8217;ll be invited to give. . . the raise I&#8217;ll get!&quot; He turned his precious data over to me with an expectant smile and sparkling eyes.</p>
<p>But after looking at his data for a few minutes I made a terrible realization — all of his data came from exactly one cow. With no other cows to compare with and a sample size of just one, he had no way to even measure how much those results would vary if he wanted to apply them to another cow. His results were so biased toward that one animal that I couldn&#8217;t do anything with the data. After I summed up the courage to tell him, it took a while to peel him off the floor. The moral of the story, I suppose, is to find a statistician and check out your big plans with her before you go down a cow path like this guy did.</p>
<p><b><i>Gett</i></b><b><i>I</i></b><b><i>Ng Good P</i></b><b><i>R</i></b><b><i>Ec</i></b><b><i>I</i></b><b><i>S</i></b><b><i>I</i></b><b><i>On</i></b></p>
<p><i>Precision </i>Is the amount of movement you expect to have in your sample results if you repeat your entire study again with a new sample. <i>Low precision </i>Means that you expect your sample results to move a lot (not a good thing). <i>High precision </i>Means you expect your sample results to remain fairly close in the repeated samples (a good thing). In this section, you find out what precision does and doesn&#8217;t measure, and you see how to measure the precision of a statistic in general terms.</p>
<p><b><i>Understanding precision from a statistical point of view</i></b></p>
<p>You may think that precision means the level of correctness you have in your statistical results. But precision only measures the <i>Level of consistency </i>In the results from sample to sample. Your results can be consistently correct or consistently incorrect.</p>
<p>For example, a field-goal kicker on a football team may consistently kick the ball two feet to the right of the goalposts every single time. Even though he&#8217;s consistent, he never gets to score, because his results are systematically off by the same amount each time. In other words, his results are biased, even though they&#8217;re precise.</p>
<p>IBE# A statistic can be precise with or without bias, and vice versa. The best situation is when your results are both precise (consistent) as well as unbiased (on target). That goal is what statisticians always strive for. How often does it happen? You can have a lot of control of the precision part by simply taking a larger sample. However, the goal of completely unbiased results is rarely achieved, but that doesn&#8217;t stop statisticians from trying. And you do have ways to minimize it (keep reading).</p>
<p><b><i>Measuring precision with margin of error</i></b></p>
<p>You can measure precision by the margin of error. The <i>Margin of error </i>Is the amount that you expect your statistical results to change from one sample to the next. While you always hope, and may even assume, that statistical results shouldn&#8217;t change much with another sample, that&#8217;s not always the case. It&#8217;s like a commercial that tries to sell a weight-loss product by showing a person who lost 50 pounds in a single weekend; then in small letters at the bottom of the screen, you see the words &quot;results will vary.&quot; Before you report or try to interpret any statistical results, you need to have some measurement of how much those results are expected to vary from sample to sample.</p>
<p>The following sections show how to calculate the precision of your statistic and how to come up with a margin of error.</p>
<p><b><i>Calculating precision</i></b></p>
<p>The exact formulas for margin of error differ depending on the type of data that you&#8217;re analyzing; however, they all contain two major components:</p>
<p>Confidence coefficient Standard error of the statistic</p>
<p>The general structure of a formula for margin of error is the following, where standard error is the standard deviation of the population divided by the square root of the sample size (you can see all the details on margin of error in Chapter 3):</p>
<p>Margin of error = ± Confidence coefficient * Standard error</p>
<p>The big idea is that the confidence coefficient tells you the number of standard errors you&#8217;re willing to add and subtract in order to have a certain level of confidence in your results. If you want to be more confident in your results, you add or subtract more standard errors. If you don&#8217;t have to be as confident, you don&#8217;t have to add or subtract as many standard errors. Typically, you add and subtract about two standard errors if you want to be 95 percent confident and three standard errors if you want to be more than 99 percent confident. This rule of thumb follows a statistical result called the <i>Empirical Rule, </i>Also known as the <i>68-95-99.7 Rule.</i></p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-22.jpg" width="63" height="60" class=""/></p>
<p>The <i>Standard error </i>Is the average amount of movement in the statistic you&#8217;re using. It&#8217;s a function of two quantities:</p>
<p><b>Sample size: </b>Sample size is perhaps the most important factor in controlling margin of error. The sample size is in the denominator of the standard error, meaning that as your sample size increases, the standard error goes down, and that&#8217;s why the margin of error goes down.</p>
<p>This result makes sense, because having a larger sample means having more information in your analysis, which should lead to greater precision. As the sample size decreases, the margin of error goes up, because you have less information to work with and that makes for less-precise results.</p>
<p><b>Standard deviation in the population: </b>Standard deviation is close to the average distance from the mean. If the population you took your sample from has a large amount of variability, the standard deviation is large, and the margin of error for your statistic goes up (because standard deviation is in the numerator of the margin of error). If the population is more homogeneous, your sample results are more homogeneous as well, and the margin of error goes down (because the standard error gets smaller).</p>
</p>
<p>The Gallup Organization states its survey results in a universal, statistically correct format. Using a specific example from a recent survey it conducted, you can see the language it uses to report its results:</p>
<p>&quot;These results are based on telephone interviews with a randomly selected national sample of 1,002 adults, aged 18 years and older, conducted June 9-11, 2006. For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.&quot;</p>
<p>The first sentence of the quote refers to how the Gallup Organization collected the data, as well</p>
<p>As the size of the sample. As you can guess, precision is related to the sample size, as seen in the section &quot;Calculating precision.&quot;</p>
<p>The second sentence of the quote refers to the precision measurement: How much did Gallup expect these sample results to vary? The fact that Gallup is 95 percent confident means that if this process were repeated a large number of times, in 5 percent of the cases the results would be wrong, just by chance. This inconsistency occurs if the sample selected for the analysis doesn&#8217;t represent the population — not due to biased reasons, but due to chance alone (more on this in Chapter 3).</p>
<p>(Check out the section &quot;Bias not included&quot; to get the info on why the third sentence is included in this quote.)</p>
<p>For more details on how to calculate margin of error in various statistical techniques, see Chapter 3.</p>
<p><b><i>Interpreting margin of error</i></b></p>
<p>Finding the margin of error is one thing — figuring out what it means is a whole other ball o&#8217; wax. But don&#8217;t fear; it&#8217;s actually not so bad. To interpret the margin of error, just think of it as the amount of play you allow in your results to cover most of the other samples you could have taken.</p>
<p>Suppose you&#8217;re trying to estimate the proportion of people in the population who support a certain issue, and you want to be 95 percent confident in your results. You sample 1,002 individuals and find that 65 percent support the issue. The margin of error for this survey turns out to be plus or minus 3 percentage points (you can find the details of this calculation in Chapter 3). That result means that you can expect the sample proportion of 65 percent to change by as much as 3 percentage points either way if you took a different sample of 1,002 individuals. In other words, you believe the actual population proportion is somewhere between 65 &#8211; 3 = 62 percent and 65 + 3 = 68 percent. That&#8217;s the best you can say.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-23.jpg" width="57" height="54" class=""/></p>
<p><b><i>Bias not included!</i></b></p>
<p>Realizing that the margin of error measures the consistency (precision) of a statistic only, not its level of bias is extremely important. In other words, a margin of error can appear on paper to be very small yet actually be way off target because of bias in the data that was collected. (In the nearby sidebar, you can see that Gallup discusses margin of error and bias separately.)</p>
<p>Any reported margin of error was calculated on the basis of having zero bias in the data. However, this assumption is rarely true. Before interpreting any margin of error, check first to be sure that the sampling process and the data-collection process don&#8217;t contain any obvious sources of bias. If a great deal of bias exists, you should ignore the results, or take them with a great deal of skepticism.</p>
<p><b><i>Making Conclusions and Knowing Your Limitations</i></b></p>
<p>The most important goal of any data analyst is to remain focused on the big picture — the question that you or someone else is asking — and make sure that the data analysis used is appropriate and comprehensive enough to answer that question correctly and fairly.</p>
<p>Here are some tips for analyzing data and interpreting the results, in terms of the statistical procedures and techniques that you may use — at school, in your job, and in everyday life. These tips are implemented and reinforced throughout this book:</p>
<p><b>Be sure that the research question being asked is clear and definitive.</b></p>
<p>Some researchers don&#8217;t want to be pinned down on any particular set of questions because they have the intent of mining the data (looking for any relationship they can find, and then stating their results after the fact). This can lead to overanalyzing the data, making the results subject to skepticism by statisticians.</p>
<p><b>Double-check that you clearly understand the type of data being collected. </b>Is the data qualitative or quantitative? The type of data used drives the approach that you take in the analysis.</p>
<p><b>Make sure that the statistical technique you use is designed to answer the research question. </b>If you want to make comparisons between two groups and your data is quantitative, use a hypothesis test for two means. If you want to compare five groups, use analysis of variance (ANOVA). You can use this book as a resource to help you determine the technique you need.</p>
<p><b>Look for the limitations of the data analysis. </b>For example, if the researcher wants to know whether negative political ads affect the population of voters, and she bases her study on a group of college students, you can find severe limitations here. For starters, student reactions to negative ads don&#8217;t necessarily carry over to all voters in the population. And even if the population were limited to all student voters, the students from this particular class don&#8217;t represent all students. In this case, it&#8217;s best to limit the conclusions to college students in that class (which no researcher would ever want to do). Ultimately what needs to be done is design the study so the sample contains a representation of the intended population of all voters in the first place (a much more difficult task, but well worth it).</p>
<p>One of the hardest parts of my job as a statistical consultant is dealing with analyses after the design was already done — and done incorrectly. It&#8217;s much better to put in a little work to get a good design together first, and then the analysis will take care of itself.</p></p>
</sape_index><!--c715886456-->]]></content:encoded>
			<wfw:commentRss>http://ankar.info/2010/05/15/sorting-through-statistical-techniques/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Beyond Number Crunching: The Art and Science of Data Analysis</title>
		<link>http://ankar.info/2010/05/15/beyond-number-crunching-the-art-and-science-of-data-analysis/</link>
		<comments>http://ankar.info/2010/05/15/beyond-number-crunching-the-art-and-science-of-data-analysis/#comments</comments>
		<pubDate>Sat, 15 May 2010 18:30:14 +0000</pubDate>
		<dc:creator>Анкар</dc:creator>
				<category><![CDATA[Data Analysis and Model-Building Basics]]></category>

		<guid isPermaLink="false">http://ankar.info/2010/05/15/beyond-number-crunching-the-art-and-science-of-data-analysis/</guid>
		<description><![CDATA[In This Chapter ^ Realizing your role as a data analyst ^ Avoiding statistical faux pas ^ Delving into the jargon of intermediate statistics Ecause you&#8217;re reading this book, you&#8217;re likely familiar with the basics of statistics. You&#8217;re now ready to take it up a notch. That next level Involves using what you know, picking [...]]]></description>
			<content:encoded><![CDATA[<sape_index><p><b><i>In This Chapter</i></b></p>
<p>^ Realizing your role as a data analyst</p>
<p>^ Avoiding statistical faux pas</p>
<p>^ Delving into the jargon of intermediate statistics</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-5.jpg" width="29" height="39" class=""/></p>
<p>Ecause you&#8217;re reading this book, you&#8217;re likely familiar with the basics of statistics. You&#8217;re now ready to take it up a notch. That next level</p>
<p>Involves using what you know, picking up a few more tools and techniques at the intermediate level, and finally putting it all to use to help you answer more realistic questions by using real data.</p>
<p>In statistical terms, you&#8217;re ready to enter the world of the <i>Data analyst. </i>This world&#8217;s an exciting one, with many options to explore and many tools available. But, as you may have guessed, you have to navigate this world very carefully, choosing the right methods for each situation. In this book, you can see that I&#8217;m including the underlying theories and ideas behind the methods where necessary to help you make good decisions — and not just get into the point-and-click mode that today&#8217;s software packages offer.</p>
<p>In this chapter, you review the terms involved in statistics as they pertain to data analysis at the intermediate level. You get a glimpse of the impact that your results can have by seeing what these analysis techniques can do. You also gain insight into some of the common misuses of data analysis and their effects.</p>
<p><b><i>Data Analysis: It&#8217;s Not Just for Statisticians Anymore</i></b></p>
<p>It used to be that statisticians were the only ones who really analyzed data. The reason for this is because the only computer programs that were available</p>
<p>Then were very complicated to use, requiring a great deal of knowledge about statistics to set up and carry out. The calculations were tedious and at times unpredictable and required a thorough understanding of the theories and methods behind the calculations to get correct and reliable answers.</p>
<p>Today, anyone who wants to analyze data can do it easily. Many user-friendly statistical software packages are made expressly for that purpose — Microsoft Excel, Minitab, SAS, and SPSS, just to name a few. Free online programs are even available, such as Stat Crunch, to help you do just what it says — crunch your numbers and get an answer. As you see in this section, the modern easy-to-use statistical packages are good in some ways, and not-so-good in other ways.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-6.jpg" width="57" height="60" class=""/></p>
<p>The most important idea when applying statistical techniques to analyze data is to know what&#8217;s going on behind the number crunching, so you (not the computer) are in control of the analysis. That&#8217;s why knowledge of intermediate statistics is so critical.</p>
<p><b><i>Remembering the old days</i></b></p>
<p>In the old days, in order to determine whether methods gave different results, you had to write a computer program to do it, using code that you had to take a class to learn. You had to type in your data in a specific way that the computer program demanded, and you had to submit your program to a mainframe computer and wait for the printer to print out your results. This method was time consuming and a general all-around pain.</p>
<p>I remember the day in college when I reached bottom. I was just learning to write those sophisticated programs you needed to do the simplest analysis. No matter how hard I tried to write the perfect program, the computer kept spitting my work back at me without doing my analysis, noting error after error in the way I typed the commands. The last straw came when I gave my program to the computer for the umpteenth time: At the end of the printout, the computer told me on the very last line: &quot;Error #34410: Too many errors.&quot;</p>
<p>Now, don&#8217;t get the idea that your author doesn&#8217;t know what she&#8217;s doing. I had all the statistical methods right; I just wasn&#8217;t very good at writing computer programs. So for anyone out there who&#8217;s ever been frustrated by a computer, I feel your pain, and I try to minimize your troubles throughout this book.</p>
<p>Enough lamenting about having to walk to school uphill both ways in the snow with plastic bags on my feet instead of boots. The point is, statistical software packages have undergone an incredible evolution in the last 10 to 15 years, to the point where you can now enter your data quickly and easily in almost any format. Moreover, the choices for data analysis are well organized</p>
<p>And listed in pull-down menus. Now almost anyone (even me) can quickly see how to find the necessary procedure and tell the computer what to do. The results come instantly and successfully, and you can cut and paste them into a word-processing document without blinking an eye. For example, comparing the weight loss for people on different weight-loss programs now takes less than three clicks of the mouse to perform, which is great news for folks like me.</p>
<p>Many very useful and efficient statistical software packages exist, including SAS, SPSS, Data Desk, Stat Crunch, MS Excel, and Minitab, and each one has its own pros and cons (and its own users and protesters). My software of choice, and the one I reference throughout this book, is Minitab, because it&#8217;s very easy to use, the results are correct, the output is very clear and professional looking, and the software&#8217;s loaded with all the data-analysis techniques that are used in intermediate statistics as well as in this book. While a site license for Minitab can be expensive, the downloadable student version is available for rent for only a few bucks a semester.</p>
<p><b><i>The downside of today&#8217;s statistical software</i></b></p>
<p>You may be wondering where the downside is in all of this. Is it too good to be true that what was once a tedious, complicated process for analyzing data has now become as easy as checking your e-mail on your cell phone? Yes and no. Yes, it&#8217;s too good to be true that the software practically does everything for you — if you don&#8217;t pay attention to what the programs are really doing. Yes, it&#8217;s too good to be true if you don&#8217;t understand that conditions need to be checked in every situation before an analysis should be applied. Yes, it&#8217;s too good to be true if you take all the results as complete and utter gospel (as too many statistician wannabees do).</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-7.jpg" width="57" height="60" class=""/></p>
<p>Bottom line: Today&#8217;s software packages are too good to be true if you don&#8217;t have a clear and thorough understanding of the intermediate level of statistics that lie underneath them.</p>
<p>Here&#8217;s the good news, though. By reading this book, you gain the understanding you need to set you up for success. You get enough of the underlying intermediate statistical concepts to be empowered, but not be dangerous. You find out what conditions need to be checked on the data before applying an analysis and how to check them. You get a good feel for which analyses to use to answer your question (and which ones can cause you trouble), and you become aware of the kinds of results you can expect. Most importantly, you discover what&#8217;s possible and appropriate to conclude from your analysis and what limitations and caveats you need to make.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-8.jpg" width="52" height="63" class=""/></p>
<p><b><i>Rule #1: Look Before You C</i></b><i>R</i><b><i>Unch</i></b></p>
<p>Many people don&#8217;t realize that statistical software can&#8217;t tell you when to use and not to use a certain statistical technique. You have to determine that on your own. As a result, people think they&#8217;re doing their analyses correctly, but they can end up making all kinds of mistakes. Statistical software packages are centered on mathematical formulas, and mathematical formulas aren&#8217;t smart enough to know how you&#8217;re applying them or to warn you when you&#8217;re doing something wrong (that&#8217;s where this book comes in).</p>
<p>In this section, I give some examples of some of the major situations where innocent data analyses can go wrong and why it&#8217;s important to know what&#8217;s happening behind the scenes from a statistical standpoint before you start crunching numbers.</p>
<p><b><i>Nothing (even a straight line) lasts forever</i></b></p>
<p>After you get a statistical equation, or <i>Model, </i>That tries to explain or predict some random phenomena, you need to specify for what values the equation applies and for what values the equation doesn&#8217;t apply. Equations don&#8217;t know when they work and when they don&#8217;t; it&#8217;s up to the data analyst to determine that. This idea is the same for applying the results of any data analysis that you do.</p>
<p>Bill Prediction is a statistics student, studying the affect of study time on exam score. Based on his experience, and that of a few friends, Bill comes up with the equation <i>Y </i>= 10x + 30, where <i>Y </i>Represents the test score you get if you study a certain number of hours <i>(x). </i>This equation is Bill&#8217;s model for predicting exam score using study time. Notice that this model is the equation of a straight line with a y-intercept of 30 and a slope of 10.</p>
<p>So Bill predicts, using this model, that if you don&#8217;t study at all, you&#8217;ll get a 30 on the exam (plugging <i>X </i>= 0 into the equation and solving for <i>Y; </i>This point represents the <i>Y</i>-intercept of the line). And he predicts, using this model, that if you study for five hours, you&#8217;ll get an exam score of <i>Y </i>= 10 *5 + 30 = 80. So, the point (5, 80) is also on this line. (I won&#8217;t talk in detail at this point about how well Bill&#8217;s model does at predicting exam score, but you can just say he&#8217;s got some work to do on this and leave it at that for now.)</p>
<p>I&#8217;m sure you would agree that because <i>X </i>Is the amount of study time, that <i>X </i>Can never be a number less than zero. If you plug a negative number in for <i>X, </i>Say <i>X </i>= -10, you get <i>Y </i>= 10 * -10 + 30 = -70, which makes no sense. The worst possible score, according to Bill&#8217;s model, is 30, which occurs when <i>X </i>Equals 0.</p>
<p>And, you can&#8217;t study a negative number of hours, so a negative number for <i>X </i>Itself isn&#8217;t even possible.</p>
<p>On the other side of the coin, <i>X </i>Probably isn&#8217;t a number in the two-digit range (10 or more). Why is this? Say someone did study ten hours for this exam. Plugging in 10 for <i>X </i>In Bill&#8217;s equation, you get <i>Y </i>= 10 * 10 + 30, which equals 130. Remember, <i>Y </i>Is the predicted exam score. Because most exams are out of 100 possible points, a score of 130 isn&#8217;t possible. (I&#8217;m all for extra credit on exams, but 30 points of extra credit is too much, even for me.)</p>
<p>The point is that there are limits on the values of <i>X </i>That make sense in this equation. However, the equation itself, <i>Y </i>= 10x + 30, doesn&#8217;t know that, and if you graph this line, it&#8217;ll go on forever in both the positive and negative directions (see Figure 1-1).</p>
<p>Y</p>
<p>200 150^ 100 50</p>
<p>-20 -15 -10 <b><i>-5&nbsp;</i></b>5 10 15 20</p>
<p>-50&#8211;</p>
<p><b>Figure 1-1:</b></p>
<p>The line y= <i>10x </i>+ 30, for all possible values of <i>X.</i></p>
<p>(-10,-70) ,</p>
<p>-100 -150 -200</p>
<p>(10,130)</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-9.png" width="192" height="191" class=""/></p>
<p>Y=10x+30</p>
<p><b><i>V &#8211; </i></b>(0,30)</p>
<p>1<b>1111</b>1<b>1111</b>1<b>1111 \&quot;__\ 111</b>1<b>1111</b>1<b>1111</b>1<b>1111</b>1 <b><i>&gt;</i></b><b><i> </i></b>X</p>
<p><b><i>Data snooping isn&#8217;t cool</i></b></p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-10.jpg" width="57" height="60" class=""/></p>
<p>Statisticians have come up with a saying that you may have heard of: &quot;Figures don&#8217;t lie. Liars figure.&quot; Make sure that you find out about all the analyses that were performed on a data set, not just the ones reported as being statistically significant.</p>
<p>Suppose Bill Prediction tries to apply his simple model (from the preceding section) to predict exam scores for his whole class, based on their reported amounts of study time, and he finds out that his results fall flat. He figures out</p>
<p>That he needs more information, so he tries to uncover what other factors help determine exam score on a statistics test besides study time. Bill measures everything from soup to nuts. His set of possible variables includes study time, GPA, previous experience in statistics, math grades in high school, attitudes toward statistics, whether you listen to classical music while studying, shoe size, whether you chew gum during the exam, and even what your favorite color is (after all, you never know, he figures). For good measure, he includes 11 other variables, for a total of 20 possible factors that he thinks may relate to exam score.</p>
<p>Bill starts out by looking for relationships between each of these variables and exam score, so he does 20 correlations. <i>(Correlation </i>Is a measure of the linear relationship between two variables; see the section on correlation later in this chapter). He finds out that four variables have a statistically significant relationship with exam score (that means the results are supposed to be correct with a 95 percent chance — but only if he collected the data properly and did the analysis correctly).</p>
<p>The variables that Bill found to be related to exam score were study time, math grades in high school, GPA, and whether the person chews gum during the exam. It turns out that his new model fits pretty well (by criteria I discuss in Chapter 5 on multiple linear regression models). Bill now thinks he&#8217;s scored a home run and has answered that all-elusive question: How can I do better on my statistics test?</p>
<p>But as they said in <i>Apollo 13, </i>&quot;Houston, we have a problem.&quot; By looking at all possible correlations between his 20 variables and exam score, Bill is actually doing 20 separate statistical analyses. Under typical conditions (I describe these conditions in Chapter 3), each statistical analysis has a 5 percent chance of being wrong just by chance (this value of 5 percent is called the <i>Significance level </i>Of the test).</p>
<p>Because 5 percent of 20 analyses is equal to one, you can expect that when you do 20 statistical analyses, one of them will give the wrong result, just by chance, over the long term. I bet you can guess which one of Bill&#8217;s correlations likely came out wrong in this case. Of course, study time has <i>Nothing </i>To do with exam score, and gum-chewing is the answer to all of our problems, right? (If that were the case, all statisticians would be out of business and working for chewing-gum companies instead.)</p>
<p>What Bill is doing is called <i>Data snooping </i>In the data-analysis business. Bill looks around until he finds something, and then he believes the result. This strategy is dangerous, but one that&#8217;s done all too often in the real world. One of the reasons data snooping is running rampant today is because everyone and his brother is out there collecting data and analyzing it — and everyone wants to find something. They&#8217;re using statistical software that allows them</p>
<p>To just point and click to do as many analyses as they want, without any warning about what statisticians call the <i>Overall error rate </i>(that is, the probability of making an error due to chance during any step of the entire analysis, not just the probability of making an error due to chance on any single analysis).</p>
<p><b><i>No (data) f</i></b><b><i>I</i></b><b><i>Sh</i></b><b><i>I</i></b><b><i>Ng allowed</i></b></p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-11.jpg" width="57" height="60" class=""/></p>
<p>Redoing analyses in different ways to try to get the results you want is called <i>Data fishing </i>In the statistics business, and folks in the stat biz consider it to be a major no-no (however, people unfortunately do it all too often in the name of research).</p>
<p>For example, Ellen Go-getter is convinced that dissolving sugar in the water helps cut flowers last longer. She performs an experiment to prove her hypothesis. She cuts two dozen roses and puts one rose in each vase. She fills each vase with 3 cups of water, but in 12 of the vases she adds 1 tablespoon of sugar (the other 12 vases constitute the control group, meaning that Ellen doesn&#8217;t apply any new treatment to them to show what happens if she adds nothing). In the next sections, you follow Ellen through her experiment, keeping an eye on the statistical analyses that pop up along the way.</p>
<p><b><i>Examining Ellen&#8217;s data</i></b></p>
<p>Ellen counts how many days the flowers still look nice and uses the same criteria for each flower. After ten days, all the flowers have withered to the point where they need to be thrown away, so the experiment is over. You can see Ellen&#8217;s data in Table 1-1.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 1-1 Ellen&#8217;s Data: Days Roses Lasted in Sugar Water versus Regular Water (Control Group)</b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Observation Days Lasted: Water Only</i></b></p>
</td>
<td>
<p><b><i>Days Lasted: Sugar Water</i></b></p>
</td>
</tr>
<tr>
<td>
<p><i>13</i></p>
</td>
<td>
<p>5</p>
</td>
</tr>
<tr>
<td>
<p>2 3</p>
</td>
<td>
<p>5</p>
</td>
</tr>
<tr>
<td>
<p>3 4</p>
</td>
<td>
<p>5</p>
</td>
</tr>
<tr>
<td>
<p>4 4</p>
</td>
<td>
<p>4</p>
</td>
</tr>
<tr>
<td>
<p>5 4</p>
</td>
<td>
<p>4</p>
</td>
</tr>
<tr>
<td>
<p>6 4</p>
</td>
<td>
<p>4</p>
</td>
</tr>
<tr>
<td>
<p>7 3</p>
</td>
<td>
<p>3</p>
</td>
</tr>
</table>
<p><i>(continued)</i></p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 1-1 <i>(continued)</i></b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Observation</i></b></p>
</td>
<td>
<p><b><i>Days Lasted: Water Only</i></b></p>
</td>
<td>
<p><b><i>Days Lasted: Sugar Water</i></b></p>
</td>
</tr>
<tr>
<td>
<p>8 3</p>
</td>
<td>
<p>4</p>
</td>
</tr>
<tr>
<td>
<p>9 2</p>
</td>
<td>
<p>3</p>
</td>
</tr>
<tr>
<td>
<p>10</p>
</td>
<td>
<p>4</p>
</td>
<td>
<p>3</p>
</td>
</tr>
<tr>
<td>
<p>11</p>
</td>
<td>
<p>4</p>
</td>
<td>
<p>5</p>
</td>
</tr>
<tr>
<td>
<p>12</p>
</td>
<td>
<p>4</p>
</td>
<td>
<p>5</p>
</td>
</tr>
</table>
<p><b><i>Setting the hypothesis</i></b></p>
<p>Ellen wants to compare the two methods, water and sugar, to see whether the roses that had sugar added lasted longer than the regular water group. She needs to conduct a hypothesis test whose null hypothesis is Ho: There is no difference in days lasted for sugar group versus control group. Her alternative hypothesis, which she hopes to show, is Ha: The roses in the sugar group lasted longer than the control group. She figures a two-sample /-test is in order here. (I discuss hypothesis tests in Chapter 3.)</p>
<p><b><i>Checking the conditions</i></b></p>
<p>Ellen has taken a few statistics classes before and knows that before she plunges into an analysis, she needs to check the proper conditions. For a comparison of two groups, she has to plot the data from each group on a <i>Histogram </i>(a bar graph showing the number of days the flowers lasted, organized into groupings in numerical order versus the number of flowers that lasted each number of days). According to what she knows about a two-sample <i>T</i>-test, the data in each group has to have a normal distribution before she starts. That is, the data has to have a bell-shaped curve when you look at the histogram. Ellen plots the data in histograms for the two groups and gets the following results (see Figures 1-2a and 1-2b).</p>
<p><b>Figure 1-2:</b></p>
<p>Histograms showing number of days roses lasted, using water only versus sugar added.</p>
<p>Histogram of Days Lasted: Water Only</p>
<p>2&nbsp;3&nbsp;4 5</p>
<p>Days Lasted: Water Only</p>
<p>Histogram of Days Lasted: Sugar Group</p>
<p>2345 Days Lasted: Sugar Group</p>
<p>B</p>
<p><b><i>Getting the bad news</i></b></p>
<p>As you can see in Figures 1-2a and 1-2b, Ellen&#8217;s data doesn&#8217;t follow the typical bell-shaped curve. One of the problems is her data only takes on values that are positive whole numbers, so numbers like 1.2, 2.3, and the like aren&#8217;t possible. (Normal distributions are supposed to have many possible values.) The other problem is that the data has no values outside the typical two-, three-, four-, or five-day range, so the histogram doesn&#8217;t have a chance to take on a bell shape. Perhaps more data would have curbed this problem. At any rate, Ellen knows that the conditions for a two-sample /-test aren&#8217;t met here; namely that the data doesn&#8217;t have a normal distribution and is, in fact, <i>Skewed </i>(meaning set off to one side or the other).</p>
<p><b><i>Going nonparametric</i></b></p>
<p>Undaunted by this turn of events, Ellen employs a nonparametric test of her data, which is the right thing to do. Statisticians use <i>Nonparametric statistics </i>In situations where the assumptions of the typical analyses aren&#8217;t met (like not having a normal distribution). However, nonparametric stats often give more conservative (albeit more accurate) results than the typical (parametric) procedures you&#8217;re used to using. (I discuss nonparametrics a bit more in the last section of this chapter. Nonparametric procedures are discussed in full detail in Chapters 16-19.)</p>
<p>Because Ellen&#8217;s data doesn&#8217;t have a normal distribution or even a <i>Symmetric distribution </i>(meaning one that looks the same on each side when you split it down the middle), the mean (or average) isn&#8217;t a good measure of the center of the data, so a two-sample <i>T</i>-test isn&#8217;t possible. As an alternative, she can test whether the two histograms are the same or not, if she compares the histograms of the two populations in question (all roses given water, versus all roses given sugar water).</p>
<p>Because she&#8217;s comparing two groups, Ellen uses a Wilcoxon Rank Sum test, also known as the Mann-Whitney test (see Chapter 19). The Wilcoxon Rank Sum test checks whether two populations have the same distribution (meaning whether the two histograms look the same) versus one of the populations shifting to the right or left. Ellen&#8217;s theory is that the sugar group lasts longer, so she tests Ho: Sugar group and control group have the same distribution versus Ha: Sugar group is shifted to the right of the control group.</p>
<p><b><i>Ellen strikes out</i></b></p>
<p>To cut to the chase, the Wilcoxon Rank Sum test unfortunately fails to reject Ellen&#8217;s null hypothesis. She didn&#8217;t prove what she wanted to confirm by her experiment. Not enough roses in the sugar group lasted longer than those roses in the control group. You can see the underlying reason for this result by comparing the medians of the two groups. When you find the median of each of the data sets in Table 1-1, you get the value of 4 in each case. Because the medians of the two data sets are equal, it&#8217;s unlikely that Ellen can find a statistically significant result by using this test.</p>
<p><b><i>Breaking the rules</i></b></p>
<p>According to the rules that all good statisticians live by, Ellen&#8217;s story should end there. She may still be convinced that sugar indeed helps roses last longer. She may use sugar with her roses for the rest of time and tell her friends to use it too. But, she isn&#8217;t allowed to say that sugar water gives statistically different results than water alone; her analysis failed to show that.</p>
<p>But remember, Ellen&#8217;s last name is Go-getter, so she&#8217;s out to get those results. She knows that nonparametric tests usually give more conservative results than regular tests, and despite the fact that the conditions aren&#8217;t met, she decides to analyze her data again, this time using the two-sample /-test.</p>
<p>Putting her data into a two-sample /-test takes only two more clicks of the mouse, and Ellen&#8217;s results give her a p-value of 0.043. Using the usual significance level used for hypothesis tests, 0.050, her p-value is less than this number, so she can reject Ho. (In a two-sample /-test, Ho is that there&#8217;s no difference in the means of the two groups. And her Ha in this case is that the mean of the sugar group is larger than the mean of the control group.) So Ellen gleefully cheers herself on for getting the results she wanted and decides there&#8217;s no harm in trying a different analysis when all else fails.</p>
<p><b><i>Seeing the error of Ellen&#8217;s Ways</i></b></p>
<p>But again, &quot;Houston. . .&quot; — you know the rest. Ellen&#8217;s problem is that she cheated her way to getting a result that&#8217;s incorrect. She knew that the conditions for the two-sample /-test weren&#8217;t met, but when the correct analysis failed to get the results she wanted, she found an analysis that did. The trouble is, the results of the two-sample /-test are bogus.</p>
<p>Now it may not be a life-and-death situation whether your roses actually do last a little bit longer on sugar or not. (Incidentally, the gardening crowd says they don&#8217;t, and that sugar in fact can encourage the growth of stem-clogging bacteria so the flower can&#8217;t take in water.) But imagine a situation where doctors are trying to test to see whether a certain medication helps people get over an illness faster or whether some procedure helps cancer patients live longer. Now you&#8217;re talking about results with a very serious impact.</p>
<p>Using the wrong data analysis for the sake of getting the results you desire results in two major problems:</p>
<p>You mislead your audience into thinking that your hypothesis is actually correct, which it may not be.</p>
<p>Sooner or later someone is going to try to replicate those results and will find out that they can&#8217;t be replicated. This discovery will result in a loss of your credibility <i>Big /ime. </i>And unfortunately, you mislead many people in the meantime.</p>
<p><b><i>Getting the Big Picture: An OverView of Intermediate Statistics</i></b></p>
<p>Because of the dangers and lingering effects of using the wrong techniques in the wrong situation to analyze data to answer questions, knowing what&#8217;s happening behind the scenes of any data analysis and staying within the rules of well-chosen techniques and appropriate practices is very important. In other words, it&#8217;s crucial for you to take your knowledge of statistics to the next level.</p>
<p>Intermediate statistics is an extension of introductory statistics, so the jargon follows suit and the techniques build on what you already know. If you&#8217;ve been able to grasp the ideas from the first course, you&#8217;ll find no trouble with the terminology for intermediate statistics. If you&#8217;re still unsure about some of the terms from introductory statistics, you can consult your textbook from your first course or see my other book, <i>Statistics For Dummies </i>(Wiley), for a complete rundown.</p>
<p>In this section, you get an introduction to the terminology you use in intermediate statistics, and you get a broad overview of the techniques that statisticians use for the purpose of analyzing data and the big picture behind them.</p>
<p><b><i>Population parameter</i></b></p>
<p>A <i>Parameter </i>Is a number that summarizes the population (the entire group you&#8217;re interested in investigating). Examples of parameters include the mean of a population, the median of a population, or the proportion of the population that falls into a certain category.</p>
<p>Suppose you want to determine the average length of a cell-phone call among teenagers (ages 13 to 18). You&#8217;re not interested in making any comparisons; you just want to make a good guesstimate as to what the average time is. So you want to estimate a population parameter (such as the mean or average). The population is all cell-phone users between the ages of 13 and 18 years old. The parameter is the average length of a phone call this population makes.</p>
<p><b><i>Sample statistic</i></b></p>
<p>You normally can&#8217;t study every member of an entire population (how would you like to measure and record the length of every single cell-phone call made by all teenagers?). So you can&#8217;t determine population parameters exactly; you can only estimate them. But all is not lost; by taking a sample (a subset of individuals) from the population and studying them, you can come</p>
<p>Up with a good guess (estimate) of the population parameter, if you play your cards right. A subset of this population is called a <i>Sample. </i>A <i>Sample statistic </i>Is a single number that summarizes that subset of the population.</p>
<p>For example, in the cell-phone scenario, you select a sample of teenagers and measure the length of their cell-phone calls over a period of time (or look at their cell-phone records if you can gain access legally). You take the average of the cell-phone call lengths. For example, the average length of 100 cellphone calls may be 12.2 minutes — this average is a statistic. This particular statistic is called the <i>Sample mean, </i>Because it&#8217;s the average value from your sample data.</p>
<p>You can also find a statistic called the <i>Sample proportion </i>(the proportion of individuals in the sample that have a certain characteristic — for example, the percentage of female teens who use cell phones). Many different statistics are available (which you probably picked up in intro stats) to study different characteristics of a sample, such as the median, variance, and standard deviation.</p>
<p><b><i>Confidence interval</i></b></p>
<p>A <i>Confidence interval </i>Is a range of values that provides reasonable estimates for a population parameter. A confidence interval is based on a sample and the statistics that come from that sample. The main reason you want to provide a range of possible values rather than a single number is that sample results vary from sample to sample.</p>
<p>For example, say you want to estimate the percentage of people who eat chocolate. According to the Simmons Research Bureau, 78 percent of adults reported eating chocolate, and of those, 18 percent admitted to eating sweets frequently. What&#8217;s missing in these results? These numbers are only a single sample of people, and those sample results are guaranteed to vary from sample to sample. You need some measure of how much you can expect those results to move if you were to repeat the study.</p>
<p>This expected movement in your statistic is measured by the <i>Margin of error, </i>Which reflects a certain number of standard deviations of your statistic you add and subtract to have a certain confidence in your results (see Chapter 3 for more on margin of error). If the chocolate-eater results were based on 1,000 people, the margin of error would be approximately 3 percent, meaning the actual percentage of people who eat chocolate in the entire population is expected to be 78 percent, plus or minus 3 percent. In other words, it&#8217;s somewhere between 75 percent and 81 percent. Now if you only base these results on a sample of 100 people, the margin of error balloons to 10 percent, meaning the percentage of chocolate eaters can only be reported to be between 68 and 88 percent. Notice how much wider the interval becomes when a smaller sample size is used. This result confirms that more data means more precision in your results (provided the data is collected properly).</p>
<p><b><i>Hypothesis test</i></b></p>
<p><i>0</i></p>
<p>A <i>Hypothesis test </i>Is a statistical procedure that you use to test an existing claim about the population, using your data. The claim is noted by Ho (the null hypothesis). If your data support the claim, you fail to reject Ho. If your data don&#8217;t support the claim, you reject Ho and conclude an alternative hypothesis, Ha. The reason most people conduct a hypothesis test is not to merely show that their data support an existing claim, but rather to show that the existing claim is false, in favor of the alternative hypothesis.</p>
<p>The Pew Research Center studied the percentage of people who go to ESPN for their sports news. Their statistics, based on a survey of about 1,000 people, found that in 2000, 23 percent of people said they go to ESPN; while in 2004, only 20 percent reported going to ESPN. The question is this: Does this 3-percent reduction in viewers from 2000 to 2004 represent a significant trend that ESPN should worry about?</p>
<p>To test these differences formally, you can set up a hypothesis test. You set up your null hypothesis as the result you have to believe without your study, Ho = no difference exists between 2000 and 2004 data for ESPN viewership. Your alternative hypothesis (Ha) is that a difference is there.</p>
<p>In very general terms, here&#8217;s what&#8217;s happening with a hypothesis test. You have the sample data, and you find the statistics that are relevant. In this case, you have two sample percentages, one for 2000 and one for 2004. You take the difference between the two samples (3 percent), and divide it by the standard error for the difference. The standard error measures how much the difference in the statistics is expected to change from sample to sample. In this case, the standard error comes to about 1.8 percent (for specific calculations see Chapter 3).</p>
<p>Taking the difference in the statistics (3 percent = 0.03) divided by the standard error (1.8 percent = 0.018) gives you the value of 1.67 (called the <i>Test statistic). </i>This value represents the difference between the two statistics, in terms of number of standard errors. This result has a universal interpretation. Roughly speaking, if your test statistic falls between -2.00 and +2.00, that means the results you found don&#8217;t differ enough to get excited about, because 95 percent of the time, this outcome happens just by chance. (And this example falls right into that situation.) After you take the variability of the sample results into account, the difference in these particular samples doesn&#8217;t transfer over to the populations they represent. So, because you can&#8217;t reject Ho, you have to say the percentage of viewers of ESPN in the entire population probably didn&#8217;t change from 2000 to 2004.</p>
<p>Because you have a 95 percent confidence level, this test uses a significance level (a level) of 1 &#8211; 0.95 = 0.05 or 5 percent. This percentage measures how likely your results would have been just by chance.</p>
<p>The trouble is that people often just report the sample statistics and give no regard to the expected amount of change with a new sample. This disregard leads to big mistakes in the conclusions (more on hypothesis testing in Chapter 3).</p>
<p><b><i>Analysis of Variance (ANOVA)</i></b></p>
<p>ANOVA is the acronym for <i>Analysis of variance. </i>You use ANOVA in situations where you want to compare the means of more than two populations. For example, you want to compare the lifetime of four brands of tires, in number of miles. You take a random sample of 50 tires from each group, for a total of 200 tires, and set up an experiment to compare the lifetime of each tire, and record it. You have four means and four standard deviations now, one for each data set. But you have different types of variability in your data, each measured by using various sums of squares. (Remember from your intro stats that the variance of a data set is the total of all the squared distances between the data and the mean, all divided by <i>N </i>- 1.)</p>
<p>One of the types of variability in your data is called the variability <i>Between </i>Treatments (also known as <i>SST, </i>The treatment sums of squares). SST measures the variation in the average lifetimes of each brand of tire, compared to the overall average lifetime. If SST is large, you have a chance that there&#8217;s a difference in lifetimes due to the treatment (in this case, the brand of tire).</p>
<p>Next, you have the variability <i>Within </i>The treatments (also known as <i>SSE, </i>The error sums of squares). SSE measures the overall average amount of variability of the tire lifetimes within each particular brand (after all, not all tires are created equal, even if they&#8217;re of the same brand). If SSE is large, you have so much variability within the tire brands themselves, that it will be harder to see any real difference between the brands, even if it actually exists.</p>
<p>And finally, you have the <i>Total </i>Overall variability in the data values if you just put them all together into one big data set. This variability is known as <i>SSTO, </i>The total sums of squares. ANOVA splits up the total variability (SSTO) into the between-groups variability (SST) plus the within-groups variability (SSE).</p>
<p>Then, to test for differences in average lifetime for the four brands of tires, you compare the mean sums of squares for treatments (MST) to the mean sums of squares for error (MSE) in a ratio called the <i>F-statistic. </i>If this ratio is large, then the variability between the brands is more than the variability within the brands, giving evidence that not all the means are the same for the different tire brands. If the <i>F</i>-statistic is small, that means not enough difference was between the treatment means, compared to the general variability within the treatments themselves. In this case, you can&#8217;t say that the means are different for the groups. (I give you the full scoop on ANOVA in Chapters 9 and 10.)</p>
<p><b><i>Multiple comparisons</i></b></p>
<p>Suppose you conduct ANOVA, and you find a difference in the average lifetimes of the four brands of tire (see preceding section). Your next questions would probably be, which brands are different, and how different are they? To answer these questions, you use multiple-comparison procedures.</p>
<p>A <i>Multiple-comparison procedure </i>Is a statistical technique that compares means to each other and finds out which ones are different and which ones aren&#8217;t. You&#8217;re then able to put the groups in order, from those with the largest mean to those with the smallest mean, realizing that sometimes two or more groups were too close to tell and so you put them in the same group.</p>
<p>Suppose you compare the exam scores of four different classes (call them class one, class two, class three, and class four), and your ANOVA procedure finds out that not all the means were the same. That means the F-statistic is large. Next, you use multiple-comparison procedures in order to make separate comparisons and figure out which classes were about the same and which ones were different, and come up with an ordering of the classes. It may be, for example, that class four was statistically higher than all the others; classes one and two were statistically equivalent, but both were lower than class four. And class one was in a group all by itself at the bottom. The ordering is: class four (highest average), classes two and three (tied for second highest), and class one (the lowest average).</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-12.jpg" width="57" height="60" class=""/></p>
<p>Never take that second step to compare the means of the groups if the ANOVA procedure doesn&#8217;t find any significant results during the first step. (See Chapter 11 for more information.)</p>
<p>Many different multiple-comparison procedures exist to compare individual means and come up with an ordering in the event that your /-statistic does find that some difference exists. Some of the multiple-comparison procedures include Tukey&#8217;s test, LSD, and pairwise t-tests. (While these tests&#8217; names may cause you to raise an eyebrow, don&#8217;t worry. They&#8217;re legitimate statistical tests.) Some procedures are better than others, depending on the conditions and your goal as a data analyst. I discuss multiple-comparison procedures in detail in Chapter 11.</p>
<p><b><i>Interaction effects</i></b></p>
<p>An <i>Interaction effect </i>In statistics operates the same way that it does in the world of medicine. Sometimes if you take two different medicines at the same time, the combined effect is much different than if you take the two individual medications separately.</p>
<p>Interaction effects come up when you have a model that includes two or more variables, and you&#8217;re using those variables to explain differences or to make comparisons regarding some outcome. When you have two or more variables in a model, you can&#8217;t automatically study the effect of each variable separately; you also have to take into account the way those variables interact in terms of the outcome. In other words, you have to examine whether or not an interaction effect is present.</p>
<p>For example, suppose medical researchers are studying a new drug for depression and want to know how this drug affects the change in blood pressure for a low dose versus a high dose of the drug. They also compare the effects for children versus adults. In total, the model being studied has one response variable, an increase in blood pressure, and two factors that may possibly explain changes in the outcome, namely age group (adults versus children) and dosage level (low versus high). It could be that dosage level affects the blood pressure of adults differently than the blood pressure of children. This type of model is called a <i>Two-way ANOVA model, </i>With a possible interaction effect between the two factors (age group and dosage level). See Chapter 11 for more.</p>
<p>One of the first things statisticians do when they have a two-way ANOVA is to plot the mean outcomes for each group they&#8217;re comparing and look for patterns. This is called an <i>Interaction plot. </i>One interaction plot for the drug-study scenario is in Figure 1-3.</p>
<p>Age group and dosage level when studying the effect on blood pressure.</p>
<p><b>CD&nbsp;CD</b></p>
<p>Oo</p>
<p><b>CD&nbsp;^</b></p>
<p><b>03&nbsp;00</b></p>
<p><b>-&nbsp;00</b></p>
<p><b>O CD</b></p>
<p><b>Figure 1-3: .= </b><b>Q.</b></p>
<p>Interaction between</p>
<p><b>CD O </b><b>CD o</b></p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-13.png" width="284" height="216" class=""/></p>
<p>Low</p>
<p>High</p>
<p>Dosage Level</p>
<p>As you can see by Figure 1-3, the lines cross. If you look at the line representing children, you can see that the mean increase in blood pressure is low for the low dose of the drug, but then for the high dose of the drug; the increase in blood pressure goes way up. Alternatively, the reaction is the exact opposite for adults; on the low dose, the mean increase in blood pressure is very high, but for the high dose, the increase is very low. If the doctors neglected to study children as well as adults, the results of this study could be extremely damaging to children if doctors applied the rules for adults to children. This example shows that interaction effects are very important to look at.</p>
<p>Figure 1-4 shows the situation where you have no interaction effect for this drug. The lines are parallel, which tells you that the mean blood pressure increases more on a higher dosage of the drug for both adults and children. Because the line for the adults is higher up than the line for children, that means that overall, the increase in blood pressure is more for adults than the increase in blood pressure for children, no matter what the dosage level is.</p>
<p><b>Figure 1-4:</b></p>
<p>No</p>
<p>Interaction between age group and dosage level when studying the effect on blood pressure.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-14.png" width="319" height="216" class=""/></p>
<p>Low</p>
<p>High</p>
<p>Dosage Level</p>
<p><b><i>Correlation</i></b></p>
<p>The term <i>Correlation </i>Is often misused. Statistically speaking, the correlation measures the strength and direction of the linear relationship between two quantitative variables (variables that represent counts or measurements only).</p>
<p>You aren&#8217;t supposed to use the word <i>Correlation </i>To talk about relationships of any other kind. For example, it&#8217;s wrong to say that a correlation exists between eye color and hair color. While these variables may be related in</p>
<p>Some way, they&#8217;re not quantitative variables, so you can&#8217;t discuss their relationship in terms of a correlation. (In this case, you would use the term <i>Association; </i>In Chapter 14, you see how to test for association of two categorical variables.)</p>
<p>The long and short of correlation is the following: <i>Correlation </i>Is a number between -1.0 and +1.0. Positive one indicates a perfect positive relationship; in other words, as you increase one variable, the other one increases in perfect sync. On the other side of the coin, a correlation that is -1.0 indicates a perfect negative relationship between the variables. As one variable increases, the other one decreases in perfect sync. A correlation of zero indicates that you found no linear relationship at all between the variables. Most correlations in the real world aren&#8217;t exactly +1.0, -1.0, or 0 — they fall somewhere in between. The closer to +1.0 or -1.0, the stronger the relationship is; the closer to 0, the weaker the relationship is.</p>
<p>Figure 1-5 shows an example of a plot showing the number of coffees sold at football games in Buffalo, New York, as well as the air temperature (in Fahrenheit) at each game. This data set seems to follow a downhill straight line fairly well, indicating a negative correlation. When you calculate the correlation, you get the value of -0.741. This value says that coffees sold has a fairly strong negative relationship with the temperature of the football game. This makes sense, because on days when the temperature is low, people will get cold and want more coffee. On days when the temperature is higher, people will tend to drink less coffee and perhaps tend more toward soft drinks, which are cold. I discuss correlation further, as it applies to model building, in Chapter 4.</p>
<p><b>Figure 1-5:</b></p>
<p>Coffees sold at various air temperatures on football game day.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p>70000-</p>
</td>
<td>
<p>Number of Coffees Sold versus Temperature</p>
</td>
</tr>
<tr> </tr>
<tr>
<td>
<p>60000-</p>
</td>
<td>
<p>•</p>
</td>
</tr>
<tr>
<td>
<p>50000-</p>
</td>
</tr>
<tr>
<td>
<p>Coffees</p>
<p>O o o o o o o o o o o o</p>
</td>
<td>
<p>^\ .</p>
</td>
</tr>
<tr>
<td>
<p>10000-</p>
</td>
<td>
<p>•</p>
<p><b>*</b></p>
</td>
</tr>
<tr>
<td>
<p>0-</p>
</td>
</tr>
<tr>
<td>
<p>-10 0 10 20 30 40 50 60 70</p>
<p>Temperature (°F)</p>
</td>
</tr>
</table>
<p><b><i>Linear regression</i></b></p>
<p>After you&#8217;ve determined that two variables have a fairly strong linear relationship, you may want to try to make predictions for one variable based on the value of the other variable. For example, if you know that a fairly strong negative linear relationship exists between coffees sold and the air temperature at a football game, you may want to use this information to predict how much coffee is needed for a game, just by knowing the temperature. This method of finding the best-fitting line is called <i>Linear regression.</i></p>
<p>In the coffees and temperature example (see Figure 1-5), the best-fitting line has the equation <i>Y </i>= 49,337 &#8211; 554 * <i>X </i>, where <i>X </i>Is temperature and <i>Y </i>Is the number of coffees sold. So when the temperature <i>(x) </i>Is zero degrees, you can expect to sell around 49,337 coffees (this is how you interpret the y-intercept of the line). To interpret the slope of this line, think of -554 as -554 divided by one and use the old rise-over-run idea using coffees and degrees of temperature. Applied here, it means that for every one degree increase in temperature, you can expect the coffee sales to decrease by 554. You can use this line to make predictions for reasonable values of the temperature <i>(x). </i>For example, if the temperature is a cold 20-degrees Fahrenheit, you can predict that the number of coffees sold will be around 49,337 &#8211; 554 * 20 = 38,257.</p>
<p>When you use only one variable to predict the response, the method of regression is called <i>Simple linear regression. </i>(I review the basics of simple linear regression in Chapter 4. But many other types of regression are out there, many of which I discuss in this book.)</p>
<p>Most researchers use more than one variable to predict a response; this technique is called <i>Multiple linear regression. </i>(Check out Chapter 5 for the details about multiple linear regression.) Multiple linear regression has many issues of its own because some variables you can use in the model may be related to each other, making overlapping contributions to the response. That possibility of overlapping makes their individual contributions hard to track. You also have to watch for interaction effects when using more than one variable to predict a response.</p>
<p>Simple and multiple linear regression assume that the response variable (the one being studied) is quantitative in nature (that is, it measures or counts something). However, you may be interested in making predictions about a variable that has only two outcomes: yes or no. For example, whether or not a certain horse will win a race; whether a baby will be a girl or a boy; or whether or not a tropical storm is going to make landfall. These situations require a different kind of regression called <i>Logistic regression. </i>(I discuss logistic regression in Chapter 8.)</p>
<p>Finally, you may be interested in building a model for which a straight line doesn&#8217;t fit. For example, you may want to predict miles per gallon, using the speed of the car. While high speeds get low miles per gallon, low speeds can get low miles per gallon as well. So the relationship between speed and miles per gallon actually follows that of a <i>Parabola </i>(an upside-down bowl, in this case). This kind of relationship is called a <i>Quadratic relationship. </i>More generally speaking, relationships that don&#8217;t follow a straight line are called <i>Nonlinear relationships, </i>And the technique you use to handle these situations is called (no surprise) <i>Nonlinear regression. </i>I get into the meat of this technique in detail in Chapter 7.</p>
<p><b><i>Chi-square tests</i></b></p>
<p>Correlation and regression techniques all assume that the variable being studied in most detail (the response variable) is quantitative. That is, the variable measures or counts something. However, you can run into many situations where the data being studied isn&#8217;t quantitative, but rather qualitative. In other words, the data themselves represent categories, not measurements or counts.</p>
<p>For example, suppose you want to compare the views of the president by political affiliation. Say that in this particular year, the president is a Republican, and you select a random sample of 150 Republicans, 150 Democrats, and 150 Independents to find out their views on the president. The data may look like Table 1-2.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 1-2</b></p>
</td>
<td>
<p><b>Views on a (Republican) President by Political Affiliation</b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Approve</i></b></p>
</td>
<td>
<p><b><i>Neutral</i></b></p>
</td>
<td>
<p><b><i>Disapprove</i></b></p>
</td>
</tr>
<tr>
<td>
<p>Republican</p>
</td>
<td>
<p>100</p>
</td>
<td>
<p>40</p>
</td>
<td>
<p>10</p>
</td>
</tr>
<tr>
<td>
<p>Democrat</p>
</td>
<td>
<p>40</p>
</td>
<td>
<p>10</p>
</td>
<td>
<p>100</p>
</td>
</tr>
<tr>
<td>
<p>Independent</p>
</td>
<td>
<p>50</p>
</td>
<td>
<p>50</p>
</td>
<td>
<p>50</p>
</td>
</tr>
</table>
<p>In looking at how the numbers appear across the columns for various rows in Table 1-2, you may suspect that something is up. It appears that Republicans tend to approve of the president, while Democrats tend to disapprove, and Independents are split down the middle. (So much for the spirit of bipartisanship. . . .)</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-15.jpg" width="57" height="52" class=""/></p>
<p>Now does this association you found in the data set for this sample of 450 people carry over to the entire population? In order to answer this question, you need to conduct a hypothesis test. And not just any hypothesis test — a <i>Chi-square test for independence. </i>You&#8217;re testing to see whether the two qualitative variables, political affiliation and views on the president, are related or not. If they are related, the variables are deemed not independent; if they are unrelated, the variables are independent.</p>
<p>A Chi-square test basically does the following: It figures out the number of values that you expect to see in each cell of the table if the variables are independent (these values are brilliantly called the <i>Expected cell counts). </i>The Chi-square test then compares these expected cell counts to what you actually saw in the data (called the <i>Observed cell counts) </i>And compares them to each other in a Chi-square statistic (see Chapter 14).</p>
<p>If the Chi-square test statistic is large, you&#8217;re likely to find an association between the two variables, because the total differences are large between the observed and expected cell counts. In other words, the variables are not independent, and you can look at the observed cell counts to discuss the relationship you see. If the Chi-square test statistic is small, then you can&#8217;t conclude you&#8217;ve found a relationship, and the two variables are independent.</p>
<p>In the case of political affiliation and views on the president, the Chi-square test statistic is huge, and you conclude a relationship is there somewhere. You can say that in the population, Republicans tend to support the president, Democrats tend to oppose the president, and the Independents are split down the middle. (You can find the details of how to find the expected counts and conduct the Chi-square test in Chapter 14.)</p>
<p>You can also use the Chi-square test to see whether your theory about what percent of each group falls into a certain category is true or not. For example, can you guess what percentage of M&amp;Ms fall into each color category? More on these Chi-square variations, as well as the M&amp;Ms question, in Chapter 15.</p>
<p><b><i>Nonparametrics</i></b></p>
<p><i>Nonparametrics </i>Is an entire area of statistics that provides analysis techniques to use when the conditions for the more traditional and commonly used methods aren&#8217;t met. For example, in order to use a <i>T</i>-test, the data needs to be collected from a population that has a normal distribution (that is, it has to have a bell-shaped curve). In order to do a hypothesis test for two means, the data from each group must be from its own normal population. In fact, most all of the commonly used data-analysis procedures have conditions that must be met in order to use them.</p>
<p>The trouble with these requirements is that many times people forget or just don&#8217;t bother to check those conditions, and if the conditions are actually not met, the entire analysis goes out the window, and the researcher doesn&#8217;t even know it. Or, someone finds out that the conditions aren&#8217;t being met, yet she still goes ahead and uses the procedures anyway (for more on this faux pas, see the section in this chapter &quot;No [data] fishing allowed&quot;).</p>
<p>While many of the traditional methods are what statisticians call <i>Robust, </i>With respect to violations of their conditions (that&#8217;s fancy terminology for the fact that they&#8217;re pretty forgiving), you can only push the window so far. Proceeding to use a statistical procedure that isn&#8217;t appropriate causes a great deal of trouble with respect to the correctness of the conclusions and the credibility of the researcher.</p>
<p>Have no fear, nonparametrics comes to your rescue. If the conditions aren&#8217;t met for a data-analysis procedure that you want to do, chances are that an equivalent nonparametric procedure is waiting in the wings. And the good news is that they&#8217;re generally pretty tame, in terms of formulas, and most statistical software packages can do them just as easily as the regular (parametric) procedures.</p>
<p>Conditions aren&#8217;t checked automatically by statistical software packages, before doing a data analysis. It&#8217;s up to the user to check any and all appropriate conditions, and if they&#8217;re seriously violated, to take another course of action. Many times a nonparametric procedure is just the ticket. For much more information on different nonparametric procedures, see Chapters 16 through 19.</p></p>
</sape_index><!--c715886456-->]]></content:encoded>
			<wfw:commentRss>http://ankar.info/2010/05/15/beyond-number-crunching-the-art-and-science-of-data-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building Confidence and Testing Models</title>
		<link>http://ankar.info/2010/05/15/building-confidence-and-testing-models/</link>
		<comments>http://ankar.info/2010/05/15/building-confidence-and-testing-models/#comments</comments>
		<pubDate>Sat, 15 May 2010 18:30:14 +0000</pubDate>
		<dc:creator>Анкар</dc:creator>
				<category><![CDATA[Data Analysis and Model-Building Basics]]></category>

		<guid isPermaLink="false">http://ankar.info/2010/05/15/building-confidence-and-testing-models/</guid>
		<description><![CDATA[In This Chapter ^ Utilizing confidence intervals to estimate parameters ^ Testing models by using hypothesis tests ^ Finding the probability of getting it right and getting it wrong ^ Discovering power in a large sample size Ne of the major goals in statistics is to use the information you collect from a sample in [...]]]></description>
			<content:encoded><![CDATA[<sape_index><p><b><i>In This Chapter</i></b></p>
<p>^ Utilizing confidence intervals to estimate parameters</p>
<p>^ Testing models by using hypothesis tests</p>
<p>^ Finding the probability of getting it right and getting it wrong</p>
<p>^ Discovering power in a large sample size</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-24.jpg" width="28" height="39" class=""/></p>
<p>Ne of the major goals in statistics is to use the information you collect from a sample in order to get a better idea of what&#8217;s going on in the</p>
<p>Entire population you&#8217;re studying (because populations are generally large and exact info is often unknown). The most common items to study are the mean of the population, the proportion of the population that has a certain characteristic, or a comparison of the means or proportions from two different populations. These unknown values that summarize the population are called <i>Population parameters. </i>Researchers typically either want to get a handle on what those parameters are, or they want to test a hypothesis about the population parameters. In introductory statistics, you typically go over confidence intervals and hypothesis tests for one and two population means and one and two population proportions. Your instructor hopefully emphasized that no matter which parameters you&#8217;re trying to estimate or test, the general process is the same. If not, don&#8217;t worry; that&#8217;s what this chapter&#8217;s all about.</p>
<p>The most important idea you can gain from this chapter is that intermediate statistics focuses on building and testing models. You&#8217;re typically faced with some random phenomena, and you&#8217;re trying to build a model that explains or predicts that phenomena. The situation is more complex than it was in intro stats, where you used one variable to predict another variable in simple linear regression. Intermediate statistics takes it up a notch to using many variables to predict another one. But as long as you keep the big picture of how the process works in your mind, you&#8217;ll be okay.</p>
<p>It all comes down in the end to testing hypotheses to see whether certain models fit, and if they do, to using confidence intervals to estimate certain values in the population or to make predictions based on the model that you built.</p>
<p>This chapter reviews the basic concepts of confidence intervals and hypothesis tests, including the probabilities of making errors by chance. I also discuss how statisticians measure the ability of a statistical procedure to do a good job — of detecting a real difference in the populations, for example. Hang on — you&#8217;re in for quite a ride.</p>
<p><b><i>Estimating Parameters by Using Confidence Intervals</i></b></p>
<p>Confidence intervals are a statistician&#8217;s way of covering themselves when it comes to estimating a population parameter. For example, instead of just giving a one-number guess as to what the average household income is in the United States, a statistician would give a range of likely values for this number. Statisticians do this for two reasons:</p>
<p>All good statisticians know sample results vary from sample to sample, so a one-number estimate isn&#8217;t any good.</p>
<p>Statisticians have developed some awfully nice formulas you can use to give a range of likely values, so why not use them?</p>
<p>In this section, you get the general formula for a confidence interval, including the margin of error, and a good look at the common approach to building confidence intervals. I also discuss interpretation and the chance of making an error.</p>
<p><b><i>Getting the basics: The general form of a confidence interval</i></b></p>
<p>The big idea of a confidence interval is coming up with a range of likely values for a population parameter. The <i>Confidence level </i>Represents the chance that if you repeated your sample-taking over and over, you&#8217;d get a range of likely values that actually contains the actual population parameter. In other words, it&#8217;s the long-term chance of being correct.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-25.jpg" width="57" height="61" class=""/></p>
<p>The general formula for a confidence interval is the following: Confidence interval = Sample statistic ± Margin of error</p>
<p>The confidence interval has a certain level of precision (measured by the margin of error). Precision calculates how close you expect your results to be to the truth.</p>
<p>For example, you want to know the average amount of time a student at Ohio State University spends listening to music per day, using an MP3 player. The average time for the entire population of OSU students that are MP3-player users is the parameter you&#8217;re looking for. Certain that you can&#8217;t ask every student who uses an MP3 player at OSU this question, you take a random sample of students and find the average from there.</p>
<p>Suppose the average time a student uses an MP3 player per day to listen to music based on a random sample of 1,000 OSU students is 2.5 hours, and the standard deviation is 0.5 hours. Is it right to say that the population of all OSU-student MP3-player owners use their players an average of 2.5 hours per day for music listening? No. You hope and may assume that the average for the whole population is close to 2.5, but it probably isn&#8217;t exact. After all, you&#8217;re only sampling a tiny fraction of the 60,000 member population of all OSU students. The fact is that sample results vary from sample to sample.</p>
<p>What&#8217;s the solution to this problem? The solution is to not only report the average from your sample, but along with it, report some measure of how much you expect that sample average to vary from one sample to the next, with a certain level of confidence. You want to cover your bases, so to speak (at least most of the time). The number that you use to represent this level of precision in your results is called the <i>Margin of error. </i>You take your sample average and add and subtract the margin of error (to get that plus-or-minus factor going), which gives you a confidence interval for the average time all OSU students use their MP3 players.</p>
<p><b><i>Finding the confidence interval for a population mean</i></b></p>
<p>The sample statistic part of the confidence-interval formula is fairly straightforward. If you want to estimate the population mean, you use the sample mean. If you want to estimate the population proportion, use the sample proportion. If you want to find the difference of two population means, take two samples, find their sample means, and subtract them.</p>
<p>In the case of the population mean, you use the sample mean to estimate it. The sample mean has a standard error of -<b>?</b>=&#9632;. In this formula, you can see the population standard deviation <b>(o) </b>And the sample size <i>(n).</i></p>
<p>If you think about it though, why would you know the standard deviation of the population, <b>O, </b>When you don&#8217;t even know the mean (recall that the mean is what you&#8217;re trying to estimate)? To handle this additional unknown, do what statisticians always do — estimate it and move on. So you estimate <b>O, </b>The population standard deviation, using (what else?) the standard deviation of the sample, denoted by <i>S. </i>So you replace <b>O </b>By <i>S </i>In the formula for the standard error of the mean.</p>
<p>To estimate the population mean by using a confidence interval when <b>O </b>Is</p>
<p>Unknown, you use the formula <i>X -.</i></p>
<p>. This formula contains the sample</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-26.jpg" width="52" height="63" class=""/></p>
<p><i><u>S</u></i></p>
<p><b><i>1</i></b><b><i> J</i></b><i>N</i></p>
<p>Standard deviation <i>(s), </i>The sample size <i>(n), </i>And a /-value representing how many standard errors you want to add and subtract to get the confidence you need. To get the margin of error for the mean, you see the standard error,</p>
<p>-4=, is being multiplied by a factor of <i>T. </i>Notice that <i>T </i>Has <i>N </i>- 1 as a subscript <i>N</i></p>
<p>To indicate which of the myriad <i>T</i>-distributions you use for your confidence interval. The <i>N </i>- 1 is called <i>Degrees of freedom, </i>Where <i>N </i>Is the sample size.</p>
<p>The value of <i>T </i>In this case represents the number of standard errors you add and subtract to or from the sample mean to get the confidence you want. If you want to be 95 percent confident, for example, you add and subtract about two of those standard errors. If you want to be 99.7 percent confident, you add or subtract about three of them. (Table A-1 in the Appendix presents the t-distribu-tion from which you can find <i>T</i>-values for any confidence level you want.)</p>
<p>If you do know the population standard deviation for some reason, you would certainly use it. In that case, you use the corresponding number from the Z-distribution (standard normal distribution) in the confidence interval formula. (The Z-distribution from your intro stat book can give you the numbers you need.) Or if you know <b>O </b>And have a large sample size, you can simply use the bottom line of the <i>T</i>-distribution, because a <i>T</i>-distribution with a large number of degrees of freedom gives very similar values to the Z-distribution.</p>
<p>For the MP3 player example from the preceding section, a random sample of 1,000 OSU students spends an average of 2.5 hours using their MP3 players to listen to music. The standard deviation is 0.5 hours. Plugging this information</p>
<p>Into the formula for a confidence interval, you get 2.5 ± 1.96</p>
<p>0.5</p>
<p>= 2.5 ±</p>
<p>/ 1,000</p>
<p>0.03 hours. You can conclude that OSU MP3-player owners spent an average of between 2.47 and 2.53 hours listening to music on their players. (The value for <i>T </i>In this example came from the last line of Table A-1 in the Appendix, because this line represents the situation where <i>N </i>Is large.)</p>
<p><b><i>What c</i></b><b><i>H</i></b><b><i>Anges </i></b><b><i>Th</i></b><b><i>E ma</i></b><b><i>R</i></b><b><i>G</i></b><b><i>I</i></b><b><i>N of e</i></b><b><i>Rr</i></b><b><i>O</i></b><b><i>R)</i></b></p>
<p>What do you need to know in order to come up with a margin of error? Margin of error, in general, depends on three elements:</p>
<p>The standard deviation of the population, <b>O </b>(or an estimate of it, denoted by <i>S, </i>The sample standard deviation)</p>
<p>The sample size, <i>N</i></p>
<p>The level of confidence you need</p>
<p>You can see these elements in action in the following formula for margin of</p>
<p>Error of the sample mean: <i>Tn_</i>J* -4=-. Here I assume that <b>O </b>Isn&#8217;t known; <i>Tn_ 1</i></p>
<p><i>N</i></p>
<p>Represents the value on the /-distribution (Table A-l in the Appendix) with <i>N </i>_ l degrees of freedom.</p>
<p>Each of these three elements has a major role in determining how large the margin of error will be when you estimate the mean of a population. At times it may seem that different elements work against each other (and they do!), but you can find ways around that. In the following sections, I show how each of the elements of the margin of error formula work separately and together to affect the size of the margin of error.</p>
<p><b><i>The population standard deviation&#8217;s affect on margin of error</i></b></p>
<p>The standard deviation of the population is typically combined with the sample size in the margin of error formula, with the population standard deviation on top of the fraction, and <i>N </i>In the bottom. (In this case, the standard error of the population, <b>O, </b>Is estimated by the standard deviation of the sample, <i>S, </i>Because <b>O </b>Is typically unknown.)</p>
<p>This combination of standard deviation of the population and sample size is known as the <i>Standard error </i>Of your statistic. It measures how much the sample statistic deviates from its mean in the long term.</p>
<p>How does the standard deviation of the population <b>(o) </b>Affect margin of error? As the standard deviation of the population (or its estimate, <i>S) </i>Gets larger, the margin of error increases, so your range of likely values is wider. That&#8217;s why you typically see the population standard deviation in the numerator of margin of error formulas. The formula for the margin of error for one population is an example of this.</p>
<p>Suppose you have two gas stations, one on a busy corner (gas station #1) and one farther off the main drag (gas station #2). You want to estimate the average time between customers at each station. At the busy gas station #l,</p>
<p>Customers are constantly using the gas pumps, so you basically have no time between customers, and that model holds day after day. At gas station #2, customers sometimes come all at once, and sometimes you don&#8217;t see a single person for an hour or more. So the time between customers varies quite a bit.</p>
<p>For which gas station would it be easier to estimate the overall average time between customers as a whole? Gas station #1 has much more consistency, which represents a smaller standard deviation of times between customers. Gas station #2 has much more heterogeneity of times between customers, so that one is harder to get a handle on. That means <b>O </b>For gas station #1 is smaller than <b>O </b>For gas station #2.</p>
<p><b><i>Sample size and margin of error</i></b></p>
<p>Sample size affects margin of error in a very intuitive way. Suppose you&#8217;re trying to estimate the average number of pets per household in your city. Which sample size would give you better information: 10 homes or 100 homes? You&#8217;d agree that 100 homes would give more precise information (as long as the data on those 100 homes was collected properly).</p>
<p>If you have more data to base your conclusions on, and that data is collected properly, your results will be more precise. Precision is measured by margin of error; so as the sample size increases, the margin of error of your estimate goes down. That&#8217;s why you typically see an <i>N </i>(sample size) in the denominator of margin of error formulas. In the formula for the margin of error of the sample mean, you can see <i>N </i>In the denominator.</p>
<p>Bigger is only better in terms of sample size if the data is collected properly. That is, you should find no bias in the way the members of the sample were selected or in the way the data was collected on those subjects. If the quality of the data can&#8217;t be maintained with a larger sample size, it does no good to have it.</p>
<p><b><i>Confidence level and margin of error</i></b></p>
<p>The amount of confidence you need to have differs from problem to problem. Suppose you&#8217;re estimating the mean weight that an elevator can hold. You would want to be pretty confident about your results, right? But, if you wanted to estimate the percentage of females that may come to your party on Saturday night, you may not need to be so confident (despite the desperation you see in your single buddies&#8217; eyes). For each problem at hand, you have to address how confident you need to be in your results over the long term, and, of course, more confidence comes with a price in the margin of error formula. This level of confidence in your results over the long term is reflected in a number called the confidence level, reported as a percentage. In general, more confidence requires a wider range of likely values. Ninety-five percent is the most common confidence level statisticians use.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-27.jpg" width="57" height="60" class=""/></p>
<p>Every margin of error is interpreted as plus or minus a certain number of standard errors. The number of standard errors added and subtracted is determined by the confidence level. If you need more confidence, you add and subtract more standard errors. If you need less confidence, you add and subtract fewer standard errors. The number that represents how many standard errors to add and subtract is different from situation to situation. For one population mean, you use a value on the /-distribution, represented by <i>Tn</i>_ !, where <i>N </i>Is the sample size. See Table A-1 in the Appendix.</p>
<p>Here&#8217;s an example. Suppose you have a sample size of 20, and you want to estimate the mean of a population. The number of standard errors you add and subtract is represented by <i>Tn </i>_ 1, which in this case is t19. Suppose your confidence level is 90 percent. To find the value of <i>T, </i>You look at row 19 in the t-distribution table (Table A-1 in the Appendix). The table uses the area to the right, so that area in this case is 0.05. (You get this value because 90 percent is within the confidence interval, so 10 percent is outside of it. Half of that 10 percent lies above the confidence interval, and the other half lies below it.) So look at row 19 and the column headed by the value 0.05. You get the value of <i>T </i>= 1.73. So to be 90 percent confident with a sample size of 20, you need to add and subtract 1.73 standard errors.</p>
<p>Now suppose you want to be 95 percent confident in your results, with the same sample size of <i>N </i>= 20. The area above the interval is now half of 5 percent, which is 2.5 percent or 0.025. Row 19 and column 0.025 in Table A-1 gives you the value of t19 = 2.09. Notice that this value of <i>T </i>Is larger than the value of <i>T </i>For 90 percent confidence, because in order to be more confident, you need to go out more standard deviations on the <i>T</i>-distribution table to cover more possible results.</p>
<p><b><i>Large confidence, narrow intervals — just the right size</i></b></p>
<p>A narrow confidence interval is much more desirable than a wide one. For example, if you said that you think the average cost of a new home is $150,000 plus or minus $100,000, that wouldn&#8217;t be helpful at all because this makes your estimate anywhere between $50,000 and $250,000. (Who has an extra hundred grand to throw around?) But you <i>Have </i>To be 99 percent confident, so your statistician has to add and subtract more standard errors to get there, which makes the interval that much wider (a downer). She tells you to be happy with 95 percent confidence, but no!</p>
<p>Wait, don&#8217;t panic — you can have your cake and eat it too! If you know you want to have a high level of confidence, but you don&#8217;t want a wide confidence interval, just increase your sample size to meet that level of confidence. The effect of sample size and the effect of confidence level cancel each other out, so you can have a precise (narrow) confidence interval and a high level of confidence at the same time. It all depends on sample size, something you can control (up to the size of your pocketbook of course).</p>
<p>For example, say the standard deviation of the house prices from a previous study is <i>S </i>= $15,000, and you want to be 95 percent confident in your estimate of average house price. Using a large sample size, your value of <i>T </i>(from the last row of Table A-1 in the Appendix) would be 1.96. With a sample of 100 homes, your margin of error would be plus or minus 1.96 times $15,000 divided by the square root of 100, which comes out to $2,940. If this is too large for you but you still want 95 percent confidence, crank up your value of <i>N. </i>If you sample 500 homes, the margin of error decreases to plus or minus 1.96 times $15,000 divided by the square root of 500, which brings you down to $1,314.81.</p>
<p><b>I&amp;M-STfy. </b>You can actually use a formula to find the sample size you need to meet a</p>
<p>Desired margin of error. That formula is <i>N</i></p>
<p><i>.( tn -</i>1 <i>S</i></p>
<p>, where MOE is the</p>
<p><i>\MOE/</i></p>
<p>Desired margin of error (as a proportion), <i>S </i>Is the sample standard deviation, and <i>T </i>Is the value on the <i>T</i>-distribution that corresponds with the confidence level you want. (You can use the last line of Table A-1 in the Appendix, which will work fine, assuming that your sample size is fairly beyond 30.)</p>
<p><b><i>Interpreting a confidence interval</i></b></p>
<p>Interpreting a confidence interval involves a couple of subtle but important issues, which I discuss in this section. The big idea is that a <i>Confidence interval </i>Presents a range of likely values for the population parameter, based on your sample. It includes this range because your sample results are going to vary, and you want to address that. A 95 percent confidence interval, for example, provides a range of likely values for the parameter such that the parameter is included in the interval 95 percent of the time in the long term.</p>
<p><b>AjV</b><b>\NG/ </b>A 95 percent confidence interval doesn&#8217;t mean that your particular confidence interval has a 95 percent chance of capturing the actual value of the</p>
<p>(&nbsp;] parameter; after the sample has been taken, it&#8217;s either in the interval or it</p>
<p><b><i>J </i></b>Isn&#8217;t. A confidence interval represents the long-term chances of capturing the actual value of the population parameter over many different samples.</p>
<p>Suppose a polling organization wants to estimate the percentage of people in the United States who drive a car with more than 100,000 miles on it, and it wants to be 95 percent confident in its results. The organization takes a random sample of 1,200 people and finds that 420 of them (35 percent) drive a much-driven car.</p>
<p>The meaty part of the interpretation lies in the confidence level — in this case, the 95 percent. Because the organization took a sample of 1,200 people in the U. S., asked each of them whether his or her car has more than 100,000 miles on it and made a confidence interval out of it, the polling organization is, in</p>
<p>Essence, accounting for all of the other samples out there that it could have gotten by building in the margin of error (±3 percent). The organization wants to cover its bases on 95 percent of those other situations, and the ±3 percent satisfies that.</p>
<p>Another way of thinking about the confidence interval is to say that if the organization sampled 1,200 people over and over again and made a confidence interval from its results each time, 95 percent of those confidence intervals would be right. (You just have to hope that yours is one of those right results.)</p>
<p>Using stat notation, you can write confidence levels as 1 &#8211; a. So if you want 95 percent confidence, you write it as 1 &#8211; 0.05. The number that a represents is the chance that your confidence interval is one of the wrong ones. This number, a, is also related to the chance of making a certain kind of error with a hypothesis test, which I explain in the hypothesis-testing section.</p>
<p><b><i>Setting Up and Testing Models</i></b></p>
<p>A <i>Model </i>Is an equation that attempts to describe how a population behaves. It can be a claim that&#8217;s made about a population parameter; for example, a shipping company might say that its packages are on time 95 percent of the time, or a campus official claims that 75 percent of students live off campus. It is important to test these models to see whether they actually hold up in the population, which you can do by using hypothesis tests.</p>
<p>In this section, you see the big ideas of hypothesis testing that are the basis for the data-analysis techniques in this book. You review and expand on the concepts involved in a hypothesis test, including the hypotheses, the test statistic, and the p-value.</p>
<p><b><i>What do Ho and Ha represent — really?</i></b></p>
<p>The big idea here is that you set up a hypothesis test to see whether your model fits the population, based on your data. In the intro stat course, you tested simple hypotheses — like whether the population mean is equal to ten. At the intermediate statistics level, you get to look at much more sophisticated and relevant models that involve several variables and/or several different populations in a variety of situations. The good news, though, is that the basic ideas from intro stats apply here as well. (If you need a brief refresher before barreling through this section, feel free to flip through your intro stats book or check out my other book <i>Statistics For Dummies </i>[Wiley].)</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-28.jpg" width="52" height="63" class=""/></p>
<p>You use a hypothesis test in situations where you have a certain model in mind, and you want to see whether that model fits your data. Your model may be one that just revolves around the population mean (testing whether that mean is equal to ten, for example). Your model may be testing the slope of a regression line (whether or not it&#8217;s zero, for example, with zero meaning you find no relationship between <i>X </i>And y). You may be trying to use several different variables to predict the marketability of a product, and you believe a model using customer age, price, and shelf location can help predict it, so you need to run one or more hypothesis tests to see whether that model works. (This process is called multiple regression; more info on this in Chapter 5.)</p>
<p>A hypothesis test is made up of two hypotheses:</p>
<p><b>The null hypothesis (Ho): </b>Ho symbolizes the current situation — the one that everyone assumed was true until you got involved.</p>
<p><b>The alternative hypothesis (Ha): </b>Ha represents the alternative model that you want to consider. It stands for the researcher&#8217;s hypothesis, and the burden of proof lies on the researcher to prove it.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-29.jpg" width="57" height="60" class=""/></p>
<p>Ho is the model that&#8217;s on trial. If you get enough evidence against it, you conclude Ha, which is the model you&#8217;re claiming is the right one. If you don&#8217;t get enough evidence against Ho, then you can&#8217;t say that your model (Ha) is the right one.</p>
<p><b><i>Gathering your evidence into a test statistic</i></b></p>
<p>A <i>Test statistic </i>Is the statistic from your sample, standardized so you can look it up on a table, basically. While each hypothesis test is a little different, the main thought is the same. For whatever model you&#8217;re trying to test, you come up with a statistic that you use to test that model. Take that statistic, standardize it (take the statistic minus its expected value from Ho and divide all that by the standard error). Then look up your test statistic on a table to see where it stands. That table may be the t-table (Table A-1 in the Appendix), it may be the Chi-square table (Table A-3 in the Appendix), or it may be a different table. The type of test you need to you use on your data dictates which table you use.</p>
<p>In the case of testing a hypothesis for a population mean, u,, you use the sample mean, <i>X</i>, as your statistic. To standardize it, you take <i>X </i>And convert it to a</p>
<p>Value of <i>T </i>By using the formula <i>Tn _ </i>1 = <i>X s </i>U 0, where u,0 is the value in Ho. This</p>
<p>Value is your test statistic. You compare your test statistic to the <i>T</i>-distribution (check out Table A-1 in the Appendix).</p>
<p><b><i>Determining strength of evidence with a p-value</i></b></p>
<p>If you want to know whether your data has the brawn to stand up against Ho, you want to figure out the p-value and compare it to a prespecified cutoff, a (typically 0.05). The <i>P-value </i>Is a measure of the strength of your evidence against Ho. You can calculate the <i>P</i>-value by doing the following:</p>
<p><b>1.&nbsp;Calculate the test statistic. </b>See the preceding section for more info on this.</p>
<p><b>2.&nbsp;Look up the test statistic on the appropriate table (such as the f-table, A-1 in the Appendix).</b></p>
<p><b>3.&nbsp;Find the percentage of values on the table that fall beyond your test statistic. </b>This percentage is the <i>P</i>-value.</p>
<p>Suppose you&#8217;re conducting a hypothesis test and have already decided you will reject Ho at level a = 0.05. You collect your data and find the test statistic (see preceding section). If your test statistic is extremely high or extremely low compared to other values on the table (whatever that table is), then you reject Ho.</p>
<p>For example, say the cutoff value for rejecting Ho at a level a = 0.05 is 1.645, where you&#8217;re testing for the mean of one population. If you get a test statistic of 1.7, you reject Ho. If you get a test statistic of 2.7, you <i>Really </i>Reject Ho. That is, you have more evidence against Ho with a test statistic of 2.7 than with a test statistic of 1.7. The two <i>P</i>-values of 1.7 and 2.7 are what statisticians call <i>Marginally significant </i>And <i>Highly significant </i>Results respectively, to use proper terms.</p>
<p>IBE# Your friend, a, is the cutoff for your p-value — and the star of this chapter.</p>
<p>(a is typically set at 0.05 — sometimes 0.10.) If your p-value is less than your predetermined value of a, reject Ho, because you have sufficient evidence against it. If your p-value is greater than or equal to a, you can&#8217;t reject Ho.</p>
<p>For example, if your <i>P</i>-value is 0.002, then your test statistic is so far away from Ho that the chance of getting this result only by chance is only 2 out of 1,000. So, you conclude that Ho is very likely to be false. However, if your p-value turns out to be 0.30, then this result happens 30 percent of the time anyway, so you see no red flags there, and you can&#8217;t reject Ho. You don&#8217;t have enough evidence against it. It doesn&#8217;t mean Ho is true, but you don&#8217;t have the evidence to say it&#8217;s false — a subtle, but important, difference.</p>
<p>When I compare the p-value to the a (the cutoff value), I like to think of a football analogy, assuming that Ho is &quot;the opposing team can&#8217;t make a touchdown.&quot; The burden is on the other team to show enough evidence to reject</p>
<p>Ho. Now, imagine that their running back makes a touchdown by pushing the ball just barely over the goal line, so close that his team needs to have a referee review the film before calling it a touchdown. This situation is equivalent to rejecting Ho with a p-value just below your prespecified value of a = 0.05. In this case, the p-value is close to the borderline, say 0.045. But, if their team makes a touchdown by catching a pass deep in the end zone, no one has any doubt about the result because the ball was obviously past the goal line, which is equivalent to the p-value being very small, say something like 0.001. The opposing team&#8217;s showing a lot of evidence against Ho (and your team could be in a lot of trouble).</p>
<p><b><i>Deconstructing Type I and Type II errors</i></b></p>
<p>Any technique you use in statistics to make a conclusion about a population based on a sample of data has the chance of making an error. The errors I am talking about, Type I and Type II errors, are due to random chance.</p>
<p>For example, you could flip a fair coin ten times and get all heads, making you think that the coin isn&#8217;t fair at all. This thinking would result in an error, because the coin actually was fair, but the data just wasn&#8217;t confirming that due to chance. On the other hand, another coin may be unfair, and, just by chance, you flip it ten times and get exactly five heads, which makes you think that particular coin is equally balanced and doesn&#8217;t present any problem. (This tells you strange things can happen, especially when the sample size is small.)</p>
<p>The way you set up your test can help to reduce these kinds of errors, but they are always out there. As a data analyst, you need to know how to measure and understand the impact of the errors that can occur with a hypothesis test and what you can do to possibly make those errors smaller. In the following sections, I show you how you can do just that.</p>
<p><b><i>Making false alarms with Type I errors</i></b></p>
<p>A Type I error represents the situation where the coin was actually fair (using the example from the preceding section), but your data led you to conclude that it wasn&#8217;t, just by chance. I think of a Type I error as a false alarm: You blew the whistle when you shouldn&#8217;t have.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-30.jpg" width="61" height="60" class=""/></p>
<p>To include a definition that makes all those stat experts happy, a Type I error is the conditional probability of rejecting Ho, given that Ho is true.</p>
<p>The chance of making a Type I error is equal to a, which is predetermined before you begin collecting your data. This a is the same a that represents the chance of missing the boat in a confidence interval. It makes some sense</p>
<p>That these two probabilities are both equal, because the probability of rejecting Ho when you shouldn&#8217;t (Type I error) is the same as the chance that the true population parameter falls out of the range of likely values when it shouldn&#8217;t. That chance is <b>A.</b></p>
<p>Say someone claims that the mean time to deliver packages for a company is 3.0 days on average (so Ho is <b>U, </b>= 3.0), but you believe it&#8217;s not equal to that (so Ha is 3.0). Your alpha level is 0.05, and because you have a two-sided test, this means you have 0.025 on each side. Your sample of 100 packages has a mean of 3.5 days with a standard deviation of 1.5 days. You find the test</p>
<p><i>. .&nbsp;X &#8211; </i><b>U </b>0 3 5 &#8211; 3 0&nbsp;&#8230;&nbsp;<b>1 O OO rr,, .&nbsp;,&nbsp;,&nbsp;, ,</b></p>
<p>Statistic <i>Tn-</i>1 = <i>—S</i>— = &#8216; 15 &#8216; , which equals 3.33. Ihis value falls beyond</p>
<p>1.96 (the value on the last row and the 0.025 column of the /-distribution, Table A-1 in the Appendix). So you don&#8217;t think 3.0 is a likely value for the mean time of delivery, over all possible packages, and you reject Ho. Your data led you to that decision and you stick to it.</p>
<p>But suppose your sample just by chance contained some longer than normal delivery times, and that in reality, the company&#8217;s claim is right. You just made a Type I error. You made a false alarm about the company&#8217;s claim.</p>
<p>To reduce the chance of a Type I error, reduce your value of a. However I wouldn&#8217;t recommend reducing a too far. On the positive side, this reduction makes it harder to reject Ho, because you need more evidence in your data to do so. On the negative side, by reducing your chance of a Type I error, you increase the chance of another type of error — the Type II error. To tackle Type II errors, keep reading!</p>
<p><b><i>Missing an opportunity with a Type 11 error</i></b></p>
<p>A Type II error represents the situation where (continuing with the coin example) the coin was actually unfair, but your data didn&#8217;t have enough evidence to catch it, just by chance. You can think of a Type II error as a missed opportunity — you didn&#8217;t blow the whistle when you should have. In statistical terms, a Type II error is the conditional probability of not rejecting Ho, given that Ho is false. I call it a missed opportunity, because you were supposed to be able to find a problem with Ho and reject it, but you didn&#8217;t.</p>
<p>The chance of making a Type II error depends on a couple of things:</p>
<p><b>Sample size: </b>If you have more data, you&#8217;re less likely to miss something that&#8217;s going on. For example, if a coin actually is unfair (and you don&#8217;t know it), flipping the coin only ten times may not reveal the problem, because results can go all over the place when the sample size is small. But if you flip the coin 1,000 times, you have a good chance of seeing a pattern that favors heads over tails or vice versa.</p>
<p><i><u>S </u></i><b><i>F</i></b><i>N</i></p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-31.jpg" width="27" height="15" class=""/><img src="/wp-content/uploads/intermediate statistics for dummies-32.jpg" width="52" height="63" class=""/></p>
<p><b>Actual value of the parameter: </b>A Type II error is also related to how big the problem is that you&#8217;re trying to uncover. For example, suppose a company claims that the average delivery time for packages is 3.5 days. If the actual average delivery time is 5 days, you won&#8217;t have a very hard time detecting that with your sample (even a small sample). Evidence will mount up fast for rejecting Ho, which is exactly what you&#8217;re supposed to do in this situation. But if the actual average delivery time is 4.0 days, you have to do more work to actually detect the problem. Note that you never do know the actual value of a parameter, but you want to protect yourself against the different possibilities, which is why you consider them.</p>
<p>To reduce the chance of a Type II error, take a larger sample size. A greater sample size makes it easier to reject Ho, but increases the chance of a Type I error. Type I and Type II errors sit on opposite ends of a seesaw — as one goes up, the other goes down. To try to meet in the middle, choose a large sample size (the bigger, the better; see Figures 3-1 and 3-2) and a small a level (0.05 or less) for your hypothesis test.</p>
<p><b><i>Getting empowered by the power of a hypothesis test</i></b></p>
<p>Type II errors (see preceding section) show the downside of a hypothesis test. Statisticians, despite what many may think, actually try to look on the bright side once in a while, and this case is one of those times. Instead of looking at the chance of <i>Missing </i>A difference from Ho that actually is there, you can look at the chance of <i>Detecting </i>A difference that really is there. This detection is called the <i>Power of a hypothesis test.</i></p>
<p><b><i>BER </i></b>The power of a hypothesis test is one minus the probability of making a Type II error. So <i>Power </i>Is a number between 0 and 1 that represents the chance that you rejected Ho when Ho was false. (You can even sing about it &quot;If Ho is false and you know it, clap your hands. . . .&quot;) Remember that power (just like Type II errors) depends on two elements: the sample size and the actual value of the parameter (see the preceding section for a description of these elements).</p>
<p>In the following sections, you discover what <i>Power </i>Means in statistics (not being one of the big wigs, mind you); you also find out how to quantify power by using a power curve.</p>
<p><b><i>Quantifying power with a power curve</i></b></p>
<p>The specific calculations for the power of a hypothesis test are beyond the scope of this book (so, take that sigh of relief), but computer programs and graphs are available online to show you what the power is for different hypothesis tests and various sample sizes (just type &quot;power curve for the [blah blah blah] test&quot; into an Internet search engine). These graphs are called <i>Power curves </i>For a hypothesis test. A power curve is a special kind of graph. It gives you an idea of how much of a difference from Ho you can detect with the sample size that you have. Because the precision of your test statistic increases as your sample size increases, sample size is directly related to power. But it also depends on how much of a difference from Ho you&#8217;re trying to detect. For example, if a package delivery company claims that its packages arrive in 2 days or less, do you want to blow the whistle if it&#8217;s actually 2.1 days? Or wait until it&#8217;s 3 days? You need a much larger sample size to detect the 2.1-days situation versus the 3-days situation just because of the precision level needed.</p>
<p>In Figure 3-1, you can see the power curve for a particular test of Ho: <b>U, </b>= 0 versus Ha: <b>U, </b>&gt; 0. You can assume that <b>O </b>(the standard deviation of the population) is equal to two (I give you this value in each problem) and doesn&#8217;t change. I set the sample size at ten throughout.</p>
<p>The horizontal <i>(x) </i>Axis on the power curve shows a range of actual values of <b>U </b>. For example, you hypothesize that <b>U </b>Is equal to 0, but it may actually be 0.5, 1.0, 2.0, 3.0, or any other possible value. If <b>U, </b>Equals 0, then Ho is true, and the chance of detecting this (rejecting Ho) is equal to 0.05, the set value of <b>A. </b>You work from that baseline. So, on the graph in Figure 3-1, when <i>X </i>= 0, you get a y-value of 0.05.</p>
<p>1.0</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-33.png" width="359" height="160" class=""/></p>
<p>Suppose that u, is actually 0.5, not 0, as you hypothesized. A computer tells you that the chance of rejecting Ho (what you&#8217;re supposed to do here) is 0.197 = 0.20, which is the power. So, you have about a 20 percent chance of detecting this difference with a sample size of ten. As you move to the right, away from zero on the horizontal <i>(x) </i>Axis, you can see that the power goes up, and the y-values get closer and closer to 1.0.</p>
<p>For example, if the actual value of u is 1.0, the difference from 0 is easier to detect than if it&#8217;s 0.50. In fact, the power at 1.0 is equal to 0.475 = 0.48, so you have almost a 50 percent chance of catching the difference from Ho in this case. And as the values of the mean increase, the power gets closer and closer to 1.0. Power never reaches 1.0, because statistics can never prove anything with 100 percent accuracy. But you can get close to 1.0 if the actual value is far enough from your hypothesis.</p>
<p><b><i>Controlling the sample size</i></b></p>
<p>You don&#8217;t have any control over what the actual value of the parameter is, though, because that number is unknown. So what do you have control over? The sample size. As the sample size increases, it becomes easier to detect a real difference from Ho.</p>
<p>Figure 3-2 shows the power curve with the same numbers as Figure 3-1, except for the sample size <i>(n), </i>Which is 100 instead of 10. Notice that the curve increases much more quickly and approaches 1.0 when the actual mean is 1.0, compared to your hypothesis of 0. You want to see this kind of curve — one that moves up quickly toward the value of 1.0, while the actual values of the parameter increase on the <i>X</i>-axis.</p>
<p>1.0</p>
<p>Ho: <b>U </b>= 0 versus Ha:</p>
<p><b>U </b>&gt; 0, for <i>N = </i>100 and</p>
<p><b>Figure 3-2:</b></p>
<p>Curve for</p>
<p>Power</p>
<p>Power (n=100)</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-34.png" width="248" height="133" class=""/></p>
<p>0.5 1.0 1.5 2.0 2.5 3.0 Actual Value of the Parameter</p>
<p><b>O </b>= 2.</p>
<p>If you compare the power of your test when <b>U, </b>Is 1.0 for the <i>N = </i>10 situation (in Figure 3-1) versus the <i>N </i>= 100 situation (in Figure 3-2), you see that the power increases from 0.475 to more than 0.999. Table 3-1 shows the different values of power for the <i>N </i>= 10 case versus the <i>N </i>= 100 case, when you test Ho: <b>U </b>= 0 versus Ha: <b>U, </b>&gt; 0, assuming a value of <b>O </b>= 2.</p>
<table class=msonormaltable border=1 cellpadding=0 style='mso-cellspacing:1.5pt; mso-yfti-tbllook:1184' frame=box rules=all>
<tr>
<td>
<p><b>Table 3-1</b></p>
</td>
<td>
<p><b>Comparing the Values of Power for <i>N </i>= 10 versus <i>N </i>= 100 (Ho is u = 0)</b></p>
</td>
</tr>
<tr>
<td>
<p><b><i>Actual Value of </i></b>U</p>
</td>
<td>
<p><b><i>Power when n = 10</i></b></p>
</td>
<td>
<p><b><i>Power when n = 100</i></b></p>
</td>
</tr>
<tr>
<td>
<p>0.00</p>
</td>
<td>
<p>0.050 = 0.05</p>
</td>
<td>
<p>0.050 = 0.05</p>
</td>
</tr>
<tr>
<td>
<p>0.50</p>
</td>
<td>
<p>0.197 = 0.20</p>
</td>
<td>
<p>0.804 = 0.81</p>
</td>
</tr>
<tr>
<td>
<p>1.00</p>
</td>
<td>
<p>0.475 = 0.48</p>
</td>
<td>
<p>Approx. 1.0</p>
</td>
</tr>
<tr>
<td>
<p>1.50</p>
</td>
<td>
<p>0.766 = 0.77</p>
</td>
<td>
<p>Approx. 1.0</p>
</td>
</tr>
<tr>
<td>
<p>2.00</p>
</td>
<td>
<p>0.935 = 0.94</p>
</td>
<td>
<p>Approx. 1.0</p>
</td>
</tr>
<tr>
<td>
<p>3.00</p>
</td>
<td>
<p>0.999 = approx. 1.0</p>
</td>
<td>
<p>Approx. 1.0</p>
</td>
</tr>
</table>
<p>You can find power curves for a variety of hypothesis tests under many different scenarios. Each has the same general look and feel to it: starting at the value of a when Ho is true, increasing in an S-shape as you move from left to right on the x-axis, and finally approaching the value of 1.0 at some point. Power curves with large sample sizes approach 1.0 faster than power curves with low sample sizes.</p>
<p><b>OjtXNG/ </b>You can have too much power. For example, if you make the power curve for <i>N </i>= 10,000 and compare it to Figures 3-1 and 3-2, you can find that it&#8217;s practically at 1.0 already for any number other than 0.0 for the mean. In other words, the actual mean could be 0.05 and with your hypothesis Ho: u = 0.00, you would reject Ho, because of the huge sample size you&#8217;ve got. If you zoom in enough, you can always detect something, even if that something makes no practical difference. If the sample size is incredibly large, it can inflate power to the point where you can detect differences from Ho that are smaller than you really want, from a practical standpoint. Beware of surveys and experiments that have what appears to be an excessive sample size — for example, in the tens of thousands. They may be reporting &quot;statistically significant&quot; results that don&#8217;t mean diddly.</p>
<p><img src="/wp-content/uploads/intermediate statistics for dummies-35.jpg" width="63" height="60" class=""/></p>
</p>
<p>The power of a test plays a role in the manufacturing process. Manufacturers often have very strict specifications regarding the size, weight, and/or quality of their products. During the manufacturing process, manufacturers want to be able to detect deviations from these specifications, even small ones, so they must think about how much of a difference from Ho they want to detect, and then figure out the sample size they need in order to detect that difference when it appears. For example, if the candy bar is supposed to weight 2.0 ounces, the manufacturer may want to blow the whistle if the actual</p>
<p>Average weight shifts to, say, 2.5 ounces. Statisticians can work backwards in calculating the power and find the sample size they need to know to stop the process.</p>
<p>Medical scientists also think about power when they set up their studies (called clinical trials). Suppose they&#8217;re checking to see whether an antidepressant adversely affects blood pressure (as a side effect of taking the drug). Scientists need to be able to detect small differences in blood pressure, because for some patients, any change in blood pressure is important to note.</p>
</sape_index><!--c715886456-->]]></content:encoded>
			<wfw:commentRss>http://ankar.info/2010/05/15/building-confidence-and-testing-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

