In This Chapter

^ Making your goal concise

^ Discovering how to hypnotise yourself

^ Determining when self-hypnosis is appropriate

^ Writing your own hypnotherapy scripts

^ Practising self-hypnosis

Here’s something absolutely fascinating about the first time you success -

0 Fully hypnotise yourself. You feel that you’ve done something that you

Thought previously impossible. Then there’s the satisfaction of achieving

Your goal. I’ll never forget the first time that I (Mike) successfully hypnotised

Myself. I had had writer’s block on a project for several weeks, and desperately needed to overcome it because a deadline was imminent. I was alone in a hotel room and after ten minutes of self-hypnosis, I immediately began

Writing – pages and pages!

Most people who learn self-hypnosis from books start by reading generalised

Scripts. You may have already read some scripts elsewhere in this book. After

Reading this chapter, you will know how to customise scripts, and even create your own, so that you can specially address your needs.

So, get ready for a step by step explanation of what you need to do to hypnotise yourself.

Connecting to Your Unconscious

Self-hypnosis is a relatively quick and marvellous way to access your unconscious mind, which is where the actual changes take hold in your life.

Your unconscious is the non-emotional part of your mind, which is simply about the business of preserving and protecting you. The intention of your unconscious mind is always to make your life better in some way. It is literally open to suggestion!

But to get there from here, so to speak, and enter self-hypnosis, you must

Bypass the critical factor of your analytical conscious mind. This simple

Process, a skill actually, becomes very easy with practise. The moment the critical factor is pushed aside, voila! You’re in direct communication with

Your unconscious mind.

Setting Your Goal

The first step in self-hypnosis involves forming a clear understanding of your

Goal. Though you may have many goals, it is best to address them one at a time. So make a list if you want to, but focus on only one at a time. Give each

Goal the exclusive time and attention it deserves.

Think about what you want to achieve or change and state your goal in a single sentence. Making your goal concise and to the point lets you repeat it and remember it easily. That means that your unconscious mind can then absorb the goal and begin to help you seek your own ways of achieving the outcome you want. Stating your goal in a single, simple sentence also helps your unconscious form ways of achieving your goal.

Keep your goals positive and use the present tense.

Some examples of single sentence goals:

^ I am calm and peaceful when I lie down at night, and drift to sleep easily.

^ I remember all I’ve studied when I take an exam, and can recall the information at will.

^ I have greater public speaking confidence because I am knowledgeable,

And the audience wants to hear what I have to say.

^ I honour my health and vitality by selecting foods that are nutritious.

One method helps to clarify goals brilliantly – the Magic wand question. It

Goes like this:

If you had a magic wand and could change one thing about yourself, and one thing about your immediate world, what would be different after you

Used your wand?

This question immediately forces: ^ A concise focus on the problem.

^ An awareness of the connection between how your perception affects

Your reality.

^ An ability to focus and visualise the change you want to make.

When I ask clients the magic wand question, I can see a slight physical change that indicates they are entering a light trance state.

Use the magic wand question to formulate your goal for self-hypnosis.

Hypnotising Yourself

If you understand the concept of trance, which we explain in Chapter 1, you

Already have a firm grasp on self-hypnosis. And, after you know what trance

Feels Like, you can easily hypnotise yourself.

The basic steps for self-hypnosis are similar to those you undergo in a normal session with a hypnotherapist, except that you are the hypnotherapist! (Chapter 13 goes stepbystep through an initial hypnotherapy session.)

The following sections cover what you go through when you experience self-hypnosis.

A couple of tips that can help you establish your self-hypnosis practise include:

^ Establish a place to practise: Choose a place where you can be completely comfortable, whether sitting in a chair or lying down. The environment you choose should be free of distractions and potential interruptions. Your skin becomes sensitive when you are in trance, so be sure that the room temperature is just right (better to be a little warm than too cool). Though not necessary, some people prefer soft lighting, soothing music, or even a scented candle. Self-hypnosis is your gift to you. Whenever possible, indulge yourself in total comfort!

^ Set a time limit: Mentally give yourself the following suggestion: ‘Exactly 10 (or 15) minutes from now, my eyelids open automatically and I feel calm, rested, and refreshed. I am ready to take on the rest of the day, or I am ready to drift off to sleep’ (whichever you prefer). Don’t worry about

Looking at a clock. Your unconscious mind knows how to measure time and will, with practise, reliably disengage you from hypnosis in the precise time that you allotted.

Dealing with distracting thoughts

Don’t be discouraged if distracting or unwelcome thoughts float into your mind during your first few self-hypnosis sessions. Your thoughts are accustomed to bombarding you throughout the day in a rather undisciplined, sometimes chaotic, way. With self-hypnosis, you’re retraining your mind so that you can choose which thoughts to attend to and which thoughts to discard.

A helpful technique is to create a mental inbox and outbox. If the thought that crosses your mind is truly important, simply put it into your mental inbox to attend to later. Don’t worry, you won’t forget about it. If the thought is frivolous, mentally put it in your outbox where you don’t have to consider it again.

Inducing your own trance

An Induction Is the method used to put yourself into trance. In self-hypnosis you induce yourself into trance.

You can choose from a variety of induction techniques, many of which you

Can easily teach yourself. The next subsections offer some induction methods you can try.

Jfiy^ As you read more about hypnosis, you may come across induction scripts ( /^T) That use generic phrases that sound harmless, but in certain cases are to be avoided, such as:

^ If you’re obese or worried about your weight, avoid the word ‘heavy’. Don’t think to yourself ‘I am feeling heavy and tired’, just ‘I am feeling

Tired’.

^ If you are depressed, avoid the word ‘down’. Don’t say ‘I will sink down into trance’, but ‘I will go into a pleasant trance’.

Progressive relaxation

Using the Progressive relaxation Induction technique, you focus on gradually

Relaxing muscles over every part of your body. This relaxation helps you to go into trance.

1. Begin by simply closing your eyes and taking a few deep breaths.

Imagine that with each breath you are exhaling bodily tension, which

Will help you to increasingly relax.

2. Start progressively relaxing all your muscles, from head to toe, or toe to head, whichever you prefer.

Give yourself repeated suggestions to relax all your muscle groups.

Keep in mind that it is not an anatomy test. Forgetting to relax a specific body part – your knees, or elbows, or toes, or whatever – isn’t crucial.

Your unconscious will fill in any parts you forget, if you think of your whole body being relaxed, after taking yourself through this script.

You can use phrases such as ‘Let them relax’ and ‘Let them go limp and slack’. Deliver these quite neutral phrases in a very permissive tone.

The nearby sidebar, ‘Sampling progressive relaxation’, offers a script to

Follow.

The goal of progressive relaxation is to create an overall feeling of comfort from head to toe.

Sampling progressive relaxation

This sidebar contains an example of a progressive relaxation script. You don’t have to use it word for word – feel free to adapt it in any way that works well for you.

‘I’m now letting go of all unnecessary tension in my body… relaxing all my muscles from the top of my head to the bottom of my feet… letting them go nice and relaxed… my head and face are now going nice and slack… my forehead and eyes and eyelids. . . my cheeks, mouth and jaw muscles… it’s a wonderful feeling as I let my face totally relax… I can actually feel the skin settling, smoothing out… I’m just letting it happen … unclenching my teeth and relaxing my tongue. . . the more I physically relax, the more I can mentally relax. . . my neck and shoulder muscles now… becoming completely relaxed… the tops of my arms… letting all tensions

Drain away… down through my elbows… into my forearms . . . down through my wrists and into my hands… right the way down into the very tips of my fingers and thumbs… just letting all those muscles go nice and relaxed… even my breathing is becoming slower as I relax… more and more . . . all tension in my chest area is leaving my body… relaxing my stomach muscles… relaxing my back muscles… down to my waist… my abdomen… down to my buttocks and my thigh muscles.. . becoming nice and relaxed… so are my knees… down through to my shins and calves… all becoming nice and loose… allowing all those areas to relax and let go… down on through to my ankles, my feet… into the very tips of my toes… all the muscles of my body beautifully relaxed and easy…’

Eye fixation technique

Possibly the simplest of all self-hypnosis methods is to simply choose a spot ahead of you – a picture on a nearby wall, for example – and simply stare at it until your eyes tire. When your eyes tire, relax them by closing them and let your whole body also relax. Then allow yourself to slow your breathing

Down, and go into a nice relaxed trance state.

Deepening your trance

Once you achieve a light trance state, you need to deepen and maintain the trance. Following is a very easy deepener you can use:

The ten-to-one countdown is probably one of the simplest ways of deepening trance for beginners once a light trance has begun. Basically, you count down from ten to one and tell yourself that with each number you’ll become more

Relaxed, both physically and mentally, and go deeper into trance. The nearby

Sidebar, ‘Counting down’ has a sample script.

Counting down

This is a sample script for counting down to deepen your trance:

‘In a few moments time… I will count down from ten to one… with each descending number. . . between ten and one. . . I’ll become one-tenth more relaxed… ten per cent more relaxed… with each descending number. . . and each descending number… will help me to go… one-tenth deeper… into a wonderful hypnotic state of relaxation… a light trance state… this

Will become deeper and deeper. . . as I count on… and if, while I am counting… I will begin to experience a very pleasant… physical sensation… as if floating down… into an ever-deepening state… of physical and mental relaxation… that will become deeper… and deeper… as I count on… Ready… 10… 9… deeper, deeper… 8… 7… 6… drifting down… ever more deeper relaxed… 5… 4… 3… deeper and deeper still… 2… 1… and all the way, deep down relaxed… ‘

Alternative deepeners may involve

Imagining yourself in a relaxing scene. ^ Imagining walking down steps, and at the bottom is a comfortable place

To rest.

^ Making a fist, and as you release the fist, imagining a soothing feeling

Being released throughout your entire body. You may now even begin to invent your own deepeners!

Trusting your unconscious mind to carry out your suggestion

When you’re in a deepened trance state, you start using the goal statement you devised for your self-hypnosis session. (See the ‘Setting Your Goal’ section at the start of this chapter.) Now you realise why we tell you to state the

Goal in a single sentence. When in the trance state, you want to minimise

Words to allow your unconscious – the non-verbal part of you – to work its

Magic.

At this stage, just remember your single sentence goal statement. Then simply let go. Let the goal statement pass from your conscious mind, just say it a few times before starting the trance, allowing it to sink into your mind, then trust that you have handed it over to your unconscious mind, and that this wise part of you will now solve the problem.

This is the focal point of self-hypnosis. Don’t just think your goal statement -YfM\ Imagine hearing it, seeing it, and experiencing the change actually occurring.

LiO ) Use as many of your senses as possible to incorporate your goal into your VIH/ trance state. If you can visualise yourself having made the changes, that’s

Even better. The point is to ruminate over your goal and make it as vivid as possible in your imagination. Your unconscious mind will do the work you have given it, if you are clear, focused, and concise on what you want.

Strengthening your ego

Ego strengthening is the icing on the cake after the main therapy. This is

Where you encourage yourself to feel happier, more confident, and all the other ‘feel good’ statements. Add these after you’ve repeated and imagined your goal statement. It can be a very powerful thing to give your unconscious mind positive messages for a change!

Waking yourself from trance

Although you may not feel it necessary, it is a good idea to count yourself awake, and tell yourself that you’re no longer in trance. This helps you to disconnect from the self-hypnosis experience and return to a fully alert state.

Try counting Up From one to ten. Counting up essentially reverses the ten-to -

One countdown you use to deepen your trance. Your mind responds to it as it is the opposite of how you entered trance.

You can tell yourself:

As with each ascending number from one to ten, I will become more awake, and confident that my unconscious mind is already seeking new ways to obtain my goal.’

Using awaking scripts helps to come out of trance and back into your normal conscious state. These scripts also give you confidence of success.

A few minutes after awakening from self-hypnosis, you are still in a highly suggestible state. Use that time to reinforce how relaxed and calm you feel, and how pleased you are that your unconscious mind is helping you reach your goal.

Examining the Pros and Cons of Self-Hypnosis

One main difference between this book and others is that although we

Acknowledge the power of self-hypnosis, we still advocate that serious problems are best dealt with in conjunction with a professional clinical hypnotherapist. In the following sections, we describe when self-hypnosis is and isn’t appropriate.

When self-hypnosis is appropriate

We want to encourage you to enjoy the amazing benefits of self-hypnosis.

Even though you may not have access to a professional hypnotherapist, it

Doesn’t mean that hypnotherapy is out of the question. Self-hypnosis can be

An extremely beneficial tool when used appropriately.

Some appropriate goals for self-hypnosis are:

^ Doing homework assigned by your hypnotherapist. ^ Boosting your confidence.

^ Encouraging healthier living and eating choices. ^ Enhancing your creativity. ^ Controlling pain.

^ Lifting your performance in sports, school, the arts, and so on.

Of course you can use self-hypnosis in many other ways, but you may find these suggestions helpful in choosing a goal for your own self-hypnosis.

When self-hypnosis isn’t appropriate

It is important to know the limits of self-hypnosis. You should not attempt to hypnotise yourself in certain situations, and it’s important to be clear on those occasions.

Following are examples of when not to attempt self-hypnosis:

^ If you have a serious mental illness (for example, schizophrenia). ^ If you have issues relating to serious trauma (for example, rape,

Violence, childhood abuse).

^ If your problems involve relations between you and other people. ^ If you have serious phobias.

In any of these situations, we encourage you to work with a professional hypnotherapist. Why? Because with serious problems, it is very difficult indeed to resolve them alone. A professional hypnotherapist has the expertise to

Help you to achieve your goals and overcome problems that may have roots outside of your conscious awareness.

Developing Your Own Scripts

Hypnotherapy scripts must be individually tailored to be effective.

One of the most exciting things you can do at this stage is to choose a script and rewrite it so the words and message feel natural to you. We present several sample scripts throughout this book that help you understand how hypnotic suggestions are phrased to help you achieve your therapeutic goals.

Take any script in this book that interests you and re-write it in the language that you use when you think or speak to a close friend. Using your own language and phrasing makes it more likely that your unconscious will absorb the suggestions and start searching for change.

Follow these general guidelines for script writing: ^ Phrase sentences like you breathe – don’t be too wordy, and use short

Phrases.

^ Aim for the simplest language possible.

^ Avoid using negatives such as ‘no’, ‘never’, ‘not’, and ‘won’t’ – state goals

In the positive.

^ Avoid being too specific about how to achieve your goal. Trust your

Unconscious to find its own solution.

^ Avoid setting deadlines for achieving your goal. Again, trust your internal clock.

^ Be realistic about your goals.

Believe that you will succeed and you will.

Ongoing Self-Hypnosis

How can you best reach your goals? By being true to yourself and discovering the best way that you absorb new information. Hypnosis is a lot like going to school. The difference is that with hypnosis, you are learning a new behaviour.

We are all different, and what works well for one person won’t work as well for another. The following subsections offer tips that may be helpful for you

To think about when deciding what works best for you. A great deal of material on self-hypnosis is available, some of it contradictory.

The old saying ‘be true to yourself’ applies here strongly. It is really important to be true to yourself when trying different self-hypnosis scripts and

Techniques. Don’t try a script that doesn’t feel right to you. It just won’t

Be as effective as one that you really believe in.

Making your hypnosis Work

If you want to be really successful, you should:

^ Try to be hypnotised by a hypnotherapist before trying self-hypnosis. ^ Practise self-hypnosis regularly. ^ Set realistic and simple goals.

As with any newly acquired or desired skill, it is very important to persevere

With your practise. Praise yourself for practising regularly, and don’t punish

Yourself if you miss a practise session; just keep persevering!

* 111

Establishing a routine

If your hypnotist teaches you self-hypnosis, she will give you direct advice about how often to practise, and tips for how to get the most with your practise.

As a beginner, you first need to prove that you can induce trance. At this

Beginning stage, you can keep the hypnosis brief – maybe two or three times a day for 10 to 15 minutes at a time.

Just before bed, and after waking up, are excellent times to practise

Self-hypnosis.

As you get better at hypnosis, you will get quicker and be able to hypnotise yourself in seconds. But be patient, this takes a good deal of practise.

At the risk of sounding obvious, the secret is to practise as much as you can without overdoing things. If you practise too often you don’t give your unconscious enough time to process your previous self-hypnosis session. You must

Trust that even if you don’t get instant results, your unconscious self is working on your goal on its own timetable.

Regular practise, over a period of time, is more effective than huge gaps of time with no practise and then overdoing it in a single day to compensate.

Improving your effectiveness

The main way to deepen your trance is to read scripts, and to see a variety of approaches to the problem, or goal, that you are trying to work on. The Appendix can point you to books and other resources that offer a broad range of techniques.

A technique called Pseudo orientation in time Helps you visualise yourself in the near future, having achieved your goal. Using an hypnotic trance to see yourself in the future without the problem greatly increases your chances for success.

To use the pseudo orientation in time technique, you hypnotise yourself to go into the near future, with the change having been made sometime ago.

Then you simply experience the feelings and changes made after achieving

Your goal. You then return to the present with these feelings of change

Embedded in your unconscious. (This technique is the bedrock of much

Hypnotherapy, and was one of the most frequent components of the work

Of Milton Erickson.)

Practising seeing your problem in the past under hypnosis activates your unconscious to move you towards the solutions and goals you want to achieve. This is one of the most tangible proofs that your hypnosis is working – when you find that you have suddenly solved your problem, without actually mapping out a conscious strategy to do so!

In This Chapter

^ Extending the /-test for comparing two means by using ANOVA

^ Discovering and utilizing the ANOVA process

^ Carrying out an F-test

^ Navigating the ANOVA table

Ne of the most commonly used statistical techniques at the intermediate level is Analysis of variance (affectionately known as ANOVA).

Because the name has the word variance in it, you may think that this technique has something to do with Variance — and you would be right. Analysis of variance is all about examining the amount of variability in a Y (response) variable and trying to understand where that variability is coming from.

One way that you can use ANOVA is to compare several populations regarding some quantitative variable, Y. The populations you want to compare constitute different groups (denoted by an X Variable), such as political affiliations, age groups, or different brands of a product. ANOVA is also particularly suitable for situations involving an experiment where you apply certain treatments (x) To subjects, and you measure a response (y).

In this chapter, you start with the /-test for two population means, the precursor to ANOVA. Then you move on to the basic concepts of ANOVA: sums of squares, the F-test, and the ANOVA table. You apply these basics to the one-factor or one-way ANOVA, where you compare the responses based only on one treatment variable. (In Chapter 11, you can see them applied to a two-way ANOVA, which has two treatment variables.)

Comparing Two Means with a t-Test

The Two sample t-test Is designed to test to see whether two population means are different. The conditions for the two sample t-test are the following:

*e The two populations are independent (in other words, their outcomes don’t affect each other).

*e The response variable (y) Is a quantitative variable (meaning that its values represent counts or measurements).

*e The y-values for each population have a normal distribution (however, their means may be different; that is what the t-test determines).

*e The variances of the two normal distributions are equal.

For large sample sizes when you know the variances, you use a Z-test for the two population means. However, a t-test allows you to test two population means when the variances are unknown or the sample sizes are small. This occurs quite often in situations where an experiment is performed and the number of subjects is limited.

Although you have seen t-tests before in your intro stats class, it may be good to review the main ideas. The t-test tests the hypotheses Ho: U\ = |u2 versus Ha: Ui Is <, >, or ^u,2, where the situation dictates which of these hypotheses you use. (Just a note that with ANOVA, you extend this idea to K Different means from K Different populations, and the only version of Ha of interest is ^.)

To conduct the two sample T-test, you collect two data sets from the two populations, using two independent samples. To form the test statistic (the T-statistic), you subtract the two sample means and divide by the standard error (a combination of the two standard deviations from the two samples and their sample sizes). You compare the t-statistic to the t-distribution with ni + n2 – 2 degrees of freedom and find the p-value.

If the p-value is less than the prespecified a level, say 0.05, you have enough evidence to say the population means are different. (For information on hypothesis tests, see Chapter 3.)

For example, suppose you’re at a watermelon seed spitting contest where contestants each put watermelon seeds in their mouths and spit them as far as they can. Results are measured in inches and are treated with the reverence of the shot-put results at the Olympics. You want to compare the watermelon seed spitting distances of female and male adults. Your data set includes ten people from each group.

You can see the results of the T-test In Figure 9-1. The mean spitting distance for females was 47.8 inches; the mean for males was 56.5 inches. The t-statistic for the difference in the two means (females – males) is T = -2.23, which has a p-value of 0.039 (see last line of Figure 9-1 output). At a level of a = 0.05, this difference is significant (because 0.039 < 0.05). You conclude that males and females differ with respect to their mean watermelon seed spitting distance. And you can say males are likely spitting farther because their sample mean was higher.

Figure 9-1:

A f-test comparing mean watermelon seed spitting distances for females versus males.

Two-sample T for females vs males

N

Mean

StDev

SE Mean

Females

10

47.80

9.02

2.9

Males

10

5 6.50

8.45

2.7

Difference

= mu

(females)

- mu

(males)

Estimate for difference: -8.70000

95% CI for difference: (-16.90914, -0.49086)

T-Test of difference = 0 (vs not =): T-Value = -2.23 P-Value

= 0.039 DF = 18

Evaluating More Means with ANOVA

Now that you can compare two independent populations inside and out, at some point two populations will not be enough. Suppose you want to compare more than two populations regarding some response variable (y). This idea kicks the t-test up a notch into the territory of ANOVA. The ANOVA procedure is built around a hypothesis test called the F-test, Which compares how much the groups differ from each other, compared to how much variability is in each group. In this section, I set up an example of when to use ANOVA and show you the steps involved in the ANOVA process. You can then apply the ANOVA steps to the following example throughout the rest of the chapter.

Spitting seeds: A situation just waiting for ANOVA

Before you can jump into using ANOVA, you must figure out what question you want answered and collect the necessary data.

Suppose you want to compare the watermelon seed spitting distances for four different age groups: 6-8, 9-11, 12-14, and 15-17. The hypotheses for this example are Ho: \i1 = U,2 = u,3 = u,4 versus Ha: At least two of these means

Are different, where the population means u, represent those from the age groups, respectively. Over the years of this contest, you have collected data on 200 children from each age group, so you have some prior ideas about what the distances typically look like. This year, you have 20 entrants, 5 in each age group. You can see the data from this year, in inches, in Table 9-1.

Table 9-1

Watermelon Seed Spitting Distances for Four Child Age Groups (Measured in Inches)

6-8 Years

9-11 Years

12-14 Years

15-17 Years

38

38

44

44

39

39

43

47

42

40

40

45

40

44

44

45

41

43

45

46

Do you think you see a difference in distances for these age groups based on this data? If you just combined all the data, you would see quite a bit of difference (the range of the combined data goes from 38 inches to 47 inches). Perhaps accounting for which age groups each contestant is in does explain at least some of what’s going on. But don’t stop there. In the next section, you see the official steps you need to do to answer your question.

4

Walking through the steps of ANOVA

You have decided on the quantitative response variable (y) You want to compare for your K Various population (or treatment) means, and you collected a random sample of data from each population. Now you’re ready to conduct ANOVA on your data to see whether the population means are different for your response variable, Y.

The characteristic that defines these populations is called the Treatment variable, x. Statisticians use the word Treatment In this context because one of the biggest uses of ANOVA is for designed experiments where subjects are randomly assigned to treatments, and the responses are compared for the various treatment groups. So statisticians oftentimes use the word Treatment Even when the study isn’t an experiment, and they’re comparing regular populations. Hey, don’t blame me! I’m just following the proper statistical terminology.

Just to get a feeling for what an ANOVA procedure involves and to give you a quick reference for a later time, here are the general steps in a one-way ANOVA:

1. Check the ANOVA conditions, using the data collected from each of the K Populations.

See the next section, "Checking the conditions," for the specifics on these conditions.

2. Set up the hypotheses Ho: \i1 = U,2 =. . . = Uk Versus Ha: At least two of the population means are different.

Another way to state your alternative hypothesis is by saying Ha: At least two of U2, .. . U* Are different.

3. Collect data from K Random samples, one from each population.

4. Conduct an F-test on the data from step three, using the hypotheses from step two, and find the p-value.

See the section "Doing the F-test" later in this chapter for these instructions.

5. Make your conclusions: If you reject Ho (when your p-value is less than 0.05 or your prespecified A Level), you conclude that at least two of the population means are different; otherwise, you conclude that you didn’t have enough evidence to reject Ho (you can’t say the means are different).

If these steps look like a foreign language to you, don’t fear — I describe each of these steps in detail in the sections to follow.

Checking the Conditions

Step one of ANOVA is checking to be sure all necessary conditions are met before diving into the data analysis. The conditions for using ANOVA are just an extension of the conditions for a /-test (see the section "Comparing Two Means with a /-Test"). The following conditions all need to hold in order for ANOVA to be conducted:

The K Populations are independent (in other words, their outcomes don’t affect each other).

The K Populations each have a normal distribution. The variances of the K Normal distributions are equal.

I go into more detail about these conditions in the following sections.

Checking off independence

To check the first condition, examine how the data was collected from each of the separate populations. In order to maintain independence, the outcomes from one population can’t affect the outcomes of the other populations. If the data has been collected by using a separate random sample from each population (random Here meaning that each individual in the population had an equal chance of being selected), this factor ensures independence at the strongest level.

In the watermelon seed spitting data (see Table 9-1), the data aren’t randomly sampled from each age group because the data represents everyone who participated in the contest. But, you can argue that the seed spitting distances from one age group don’t affect the seed spitting distances from the other age groups, so the independence assumption is okay here also.

Looking for what’s normal

The second ANOVA condition is that each of the K Populations has a normal distribution. To check this condition, make a separate histogram of the data from each group and see whether it resembles a normal distribution. Data from a normal distribution should look symmetric (in other words, if you split the histogram down the middle, it looks the same on each side) and have a bell-shape. Don’t expect the data in each histogram to follow a normal distribution exactly (remember it’s only a sample), but it shouldn’t be extremely different from a normal, bell-shaped distribution.

Since the data contains only five children per age group, checking conditions can be iffy. But in this case, you have past data for 200 children in each age group, so you can use that to check the conditions. The histograms and descriptive statistics of the seed spitting data for the four age groups are shown in Figure 9-2, all in one panel, so you can easily compare them to each other on the same scale. Looking at the four histograms in Figure 9-2, you can see that each graph resembles a bell shape; the normality condition isn’t being violated here. (Red flags should come up if you see two peaks in the data, or a skewed shape where the peak is off to one side, or if the histogram is flat, for example.)

You can use Minitab to make histograms for each of your samples and have all of them appear on one large panel, all using the same scale. To do this, go to Graph>Histogram and click OK. Choose the variables that represent data from each sample by highlighting them in the left-hand box and clicking Select. Then click on Multiple Graphs, and a new window opens. Under the Show Graph Variables option, check the following box: In separate panels of the same graph. On the Same Scales for Graphs option, check the box for X And the box for Y. This option gives you the same scale on both the X And Y Axes for all the histograms. Then click OK.

Figure 9-2:

Checking ANOVA conditions by using histograms and

Descriptive statistics.

Histogram of Age Group 1, Age Group 2, Age Group 3, Age Group 4

36 39 42 45 48 51

Age Group 1

J]

M.

Age Group 3

—iR I—I—T-L=—,-

36 39 42 45 48 51

_i_i_i_i_i_

Age Group 2

Age Group 4

EL

20

15

10

5

0

Descriptive Statistics: Age Group 1, Age Group 2, Age Group 3, Age Group 4

Total

Variable

Count

Mean

Variance

Age

Group

1

2 00

40.116

4.256

Age

Group

2

2 00

41.880

4.994

Age

Group

3

2 00

44.165

3.249

Age

Group

4

2 00

47.405

5.154

Taking note of spread

The third condition for ANOVA is that the variance in each of the K Populations is the same. To check this out on your data, use Minitab to find the variance in each sample and compare them. The variances for each sample should be close. What does Close Mean? A hypothesis test can handle this question; however, it falls outside the scope of most intermediate statistics courses. So you are left with a judgment call. Compare all the variances as a group and look for any glaring differences. If a difference is large enough for you to write home about (say 10 percent or more), this variance indicates a problem. (Not only do you have a problem with the ANOVA conditions, but if you’re writing your mom about your stats problems you might need to get a bit of a life.) If no big differences exist in the variances, you can say that the equal variance condition is met. The variances for the seed spitting data are shown in Figure 9-2 for each age group. They are quite close, so this condition is met.

To find descriptive statistics for each sample, go to Stat>Basic Statistics> Display Descriptive Statistics. Click on each variable in the left-hand box for which you want the descriptive statistics and then click Select. Click on the Statistics option, and a new window appears with tons of different types of statistics. Click on the ones you want and click off the ones you don’t want. Click OK. Then click OK again. Your descriptive statistics are calculated.

Note that you don’t need the sample sizes in each group to be equal to carry out ANOVA; however, in intermediate stats, you’ll typically see what statisticians call a Balanced design, Where each sample from each population has the same sample size. (For more precision in your data, the larger the sample sizes, the better; see Chapter 3.)

Setting Up the Hypotheses

Step two of ANOVA is setting up the hypotheses to be tested. You’re testing to see whether or not all the population means can be deemed equal to each other. The null hypothesis for ANOVA is that all the population means are equal. That is, Ho: u, i = u,2 =. . . = Uk, where \i1 Is the mean of the first population, u,2 is the mean of the second population, and so on until you reach u, k (the mean of the Kth Population).

Now what appears in the alternative hypothesis (Ha) must be the opposite of what is in the null hypothesis (Ho). What’s the opposite of having all K Of the population’s means equal to each other? You may think the opposite is that they’re all different. But that’s not the case. In order to blow Ho wide open, all you need is for at least two of those means to not be equal. The alternative hypothesis, Ha, is that at least two of the population means are different from each other. That is, Ha: At least two of U2, . .. Uk Are different.

Note that Ho and Ha for ANOVA are an extension of the hypotheses for a two sample f-test (which only compares two independent populations). And while the alternative hypothesis in a f-test may be that one mean is greater than, less than, or not equal to the other, you don’t consider any alternative other than ^ in ANOVA. You only want to know whether or not the means are equal — at this stage of the game anyway. After you reach the conclusion that Ho is rejected in ANOVA, you can proceed to figure out how the means are different, which ones are bigger than others, and so on, using multiple comparisons. Those details appear in Chapter 10.

Doing the F-Test

Step three, collecting the data, includes taking K Random samples, one from each population. Step four of ANOVA is doing the F-test on this data, which is

The heart of the ANOVA procedure. This test is the actual hypothesis test of Ho: U4 = U,2 =. . . = \ik Versus Ha: At least two of U,1, u,2, . . . \ik Are different.

You have to carry out three major steps in order to complete the F-test (don’t get these steps confused with the main ANOVA steps; consider the F-test a few steps within a step):

1. Break down the variance of Y Into sums of squares.

2. Find the mean sums of squares.

3. Put the mean sums of squares together to form the F-statistic.

I describe each step of the F-test in detail and apply it to the example of comparing watermelon seed spitting distances (see Table 9-1) in the following sections.

Because data analysts rely heavily on computer software to conduct each step of the F-test, you can do the same. All computer software packages organize and summarize the important information from the F-test into a table format for you. This table of results for ANOVA is called (what else?) the ANOVA table. Because the ANOVA table is a critical part of the entire ANOVA process, I start the following sections out by describing how to run ANOVA in Minitab to get the ANOVA table, and I continue to reference this section as I describe each step of the ANOVA process.

Stacked data Means that you enter all the data into two columns. Column one includes the number indicating what sample the data value is from (1 to k), and the responses (y) Are in column two. To analyze this data, go to Stat>ANOVA>One-Way Stacked. Highlight the response (y) Variable and click Select. Highlight the factor (population) variable and click Select. Click OK.

Unstacked Is the other method of entering data: a separate column for the data in each sample. To analyze the data entered this way, go to Stat>ANOVA>One-Way Unstacked. Highlight the names of the columns where your data are located. Click OK.

Running ANOVA in Minitab

Using Minitab to run ANOVA, you first have to enter the data from the k samples. You can enter the data one in of two ways:

I typically use the unstacked version just because I think it helps visualize the data. However, the choice is up to you, and the results come out the same no matter which one you choose.

Breaking down the Variance into sums of squares

The first step of the F-test is splitting up the variability in the Y Variable into portions that define where the variability is coming from. The term Analysis of variance Is a great description for exactly how you conduct a test of K Population means. With the overall goal of testing whether K Population (or treatment) means are equal, you take a random sample from each of the K Populations. You first put all the data together into one big group and measure how much total variability there is; this variability is called the Sums of squares total, Or SSTO. If the data are really diverse, SSTO is large. If the data are very similar, SSTO is small.

Now the total variability in the combined data set (SSTO) can be split into two parts:

SST: The variability between the groups, known as the sums of squares for treatment

SSE: The variability within the groups, known as the sum of squares for error

This splitting up of the variability in your data results in one of the most important equalities in ANOVA. That equality is SSTO = SST + SSE.

IBE# The formula for SSTO is the numerator of the formula for s2, the variance of a single data set, so SSTO = EE (Xv – x J , where I And J Represent the JTh value in the sample from the I Th population. SSTO represents the total squared distance between the data values and their overall mean. The formula for SST is SST = Nt E (Xi – x) , where ni is the size of the sample coming from the ITh population. SST represents the total squared distance between the means from 2 each sample and the overall mean. The formula for SSE is SSE = EE (Xv – xiJ , where XIj Is the JTh value in the sample from the ITh population and Xi Is the mean of the sample coming from the I Th population. This formula represents the total squared distance between the values in each sample and their corresponding sample means. Using algebra, you can show (with some serious elbow grease) that SSTO = SST + SSE.

The Minitab output for the watermelon seed spitting contest for the four age groups is shown in Figure 9-3. Under the Source column of the ANOVA table, you see Factor Listed in row one. The factor variable (as described by Minitab) represents the treatment or population variable. In column three of the Factor row, you see the SST, which is equal to 89.75. In the Error row (row two), you locate the SSE in column three, which equals 56.80. In row three (Total), column three, you see the SSTO, which is 146.55. Using the values of SST, SSE, and SSTO from the Minitab output, you can verify that SST + SSE = SSTO.

Figure 9-3:

ANOVA Minitab output for the watermelon seed spitting example.

One-Way ANOVA: Age Group 1, Age Group 2, Age Group 3, Age Group 4

Source Factor Error Total

DF 3 16 19

SS 89.75 5 6.80 146.55

MS 2 9.92 3.55

F

8.43

P

0.001

S — 1.884 R-Sq — 61.24% R-Sq(adj) — 53.97%

Now you’re ready to use these sums of squares to complete the next step of the F-test (keep reading).

Locating those mean sums of squares

After you have the sums of squares for treatment, SST, and the sums of squares for error, SSE (see preceding section for more on these), you want to compare them to see whether the variability in the y-values that is due to the model (SST) is large compared to the amount of error left over in the data after the groups have been accounted for (SSE). So you ultimately want a ratio comparing SST to SSE somehow. To make this ratio form a statistic that statisticians know how to work with (in this case, an F-statistic), they decided to find the mean of each of SST and SSE and work with that. Finding the mean sums of squares is the second step of the F-test.

MST Is the mean sums of squares for treatments, which measures the mean variability that occurs between the different treatments (the different samples in the data). What you’re looking for is the amount of variability in the data as you move from one sample to another. A great deal of variability between samples (treatments) may indicate that the populations are different as well. You can find MST by taking SST and dividing by K - 1 (where K Is the number of treatments).

MSE Is the mean sums of squares for error, which measures the mean within-treatment variability. The Within-treatment variability Is the amount of variability that you see within each sample itself, due to chance and/or other factors not included in the model. You can find MSE by taking SSE divided by N - K (where N Is the total sample size and K Is the number of treatments). The values of K - 1 and N - K, Respectively, are called the Degrees of freedom For SST and SSE. Minitab calculates and posts the degrees of freedom for SST and SSE, as well as the values of MST and MSE, in the ANOVA table in columns two and four, respectively.

From the ANOVA table for the seed spitting data in Figure 9-3, you can see that column two has the heading DF, Which stands for degrees of freedom. You can find the degrees of freedom for SST in the Factor row (row two); this value is equal to K - 1 = 4 – 1 = 3. The degrees of freedom for SSE is found to be N - K = 20 – 4 = 16. (Remember you have four age groups and five children in each group for a total of N = 20 data values.) The degrees of freedom for SSTO is N - 1 = 20 – 1 = 19 (found in the Total row under DF.) You can verify that the degrees of freedom for SSTO = degrees of freedom for SST + degrees of freedom for SSE.

The values of MST and MSE are shown in column four of Figure 9-3, with the heading MS. You can see the MST in the Factor row, which is 29.92. This value was calculated by taking SST = 89.75, and dividing it by degrees of freedom, 3. You can see MSE in the Error row, equal to 3.55. MSE is found by taking SSE = 56.80 and dividing that value by its degrees of freedom, 16.

By finding the mean sums of squares, you’ve completed step two of the F-test, but don’t stop here! You need to continue to the next section if you want to complete the process.

Figuring the F-statistic

The test statistic for the test of the equality of the K Population means is

F = . The result of this formula is called the F-statistic. The F-statistic MSE

Has an F-distribution, which is equivalent to the square of a T-test (when the numerator degrees of freedom is 1). All F-distributions start at zero and are skewed to the right. The degree of curvature and the height of the curvature of each F-distribution is reflected in two Degrees of freedom, Represented by K - 1 and N - K. (These come from the denominators of MST and MSE, respectively, where N Is the total sample size and K Is the total number of treatments or populations.) A shorthand way of denoting the F-distribution for this test

Is F(k - 1,n – k).

In the watermelon seed spitting example, you’re comparing four means and have a sample of size five from each population. Figure 9-4 shows the corresponding F-distribution, which has degrees of freedom 4 – 1 = 3 and 20 – 4 = 16; in other words 16).

You can see the F-statistic on the Minitab ANOVA output (see Figure 9-3) in the Factor row, under the column indicated by F. For the seed spitting example, the value of the F-statistic is 8.43. This number was found by taking MST = 29.92 divided by MSE = 3.55. You can then locate 8.43 on the F-distribution in Figure 9-4 to see where it stands. (More on that in the next section.)

F (3, 16)

Making conclusions from ANOVA

If you’ve completed the F-test and found your F-statistic (step four in the ANOVA process), you’re ready for step five of ANOVA: making conclusions for your hypothesis test of the K Population means. If you haven’t already, you can compare the F-statistic to the corresponding F-distribution with K - 1, N - K Degrees of freedom, to see where it stands and make a conclusion. You can make the conclusion in one of two ways: the p-value approach or the critical-value approach. (The approach you use depends primarily on whether you have access to a computer, especially during exams.) I describe these two approaches in the following sections.

Using the p-value approach

On Minitab ANOVA output (see Figure 9-3), the value of the F-statistic is located in the Factor row, under the column noted by F. The associated P-value for the F-test is located in the Factor row under the column headed by P. The p-value tells you whether or not you can reject Ho. If the p-value is less than your prespecified a (typically 0.05), reject Ho. Conclude that the K Population means aren’t all equal and that at least two of them are different. If the P-value is greater than a, then you can’t reject Ho. You don’t have enough evidence in your data to say the K Population means have any differences.

The F-statistic for comparing the mean watermelon seed spitting distances for the four age groups is 8.43. The p-value as indicated in Figure 9-3 is 0.001. That means the results are highly statistically significant. You reject Ho and conclude that at least one pair of age groups differ in its mean watermelon seed spitting distances. (You would hope that a 17-year-old could do a lot better than a 6-year-old, but maybe those 6-year-olds have a lot more spitting going on in their lives than 17-year-olds do.)

Using Figure 94, you see how the F-statistic of 8.43 stands on the F-distribution with (4 – 1, 20 – 4) = (3, 16) degrees of freedom. You can see it’s way off to the right, out of sight. It makes sense that the p-value, which measures the probability of being beyond that F-statistic, is 0.001.

Lf you’ve gotta use critical Values…

If you’re in a situation where you don’t have access to a computer (as is still the case in many statistics courses today when it comes to taking exams), finding the exact p-value for the F-statistic isn’t possible. However, statistical software packages automatically calculate all P-values exactly (so on any computer output you can see them as such).

To approximate the p-value from your F-statistic (in the event you don’t have a computer or computer output available), you find a cutoff value on the F-distribution with (k - 1, N - K) Degrees of freedom that draws a line in the sand between rejecting Ho and not rejecting Ho. This cutoff (also known as the Critical value) Is determined by your prespecified a (typically 0.05). You choose the critical value so that the area to its right on the F-distribution is equal to a.

Table A-5 in the Appendix shows the critical values of the F-distribution with various degrees of freedom, all using a = 0.05. Other F-distribution tables are available in various statistics textbooks and Internet links for other values of a; however, a = 0.05 is by far the most common a level used for the F-distribution and is sufficient for your purposes.

This table of values for the F-distribution is called the F-table (students are typically given these with their exams). For the seed spitting example, the F-statistic has an F-distribution with degrees of freedom (3, 16), which I calculate in a previous section. To find the critical value, go to Table A-5 in the Appendix. Because the degrees of freedom are (3, 16), go to column 3 and row 16 on the F-table. The critical value is 3.2389 (or 3.24). Your F-statistic for the seed spitting example is 8.43, which is well beyond this critical value (you can see how 8.43 compares to 3.24 by looking at Figure 9-4). Your conclusion is to reject Ho at level a. At least two of the age groups differ on mean seed spitting distances.

With the critical value approach, any F-statistic that lies beyond the critical value results in rejecting Ho, no matter how far or close to the line it is. If your F-statistic is beyond the value found in Table A-5, then you reject Ho and say at least two of the treatments (or populations) have different means.

What’s next?

After you’ve rejected Ho in the F-test and concluded that not all the populations means are the same, your next question may be: Which ones are different? You can answer that question by using a statistical technique called Multiple comparisons. Statisticians use many different multiple comparison procedures to further explore the means themselves after the F-test has been rejected. I discuss and apply some of the more common multiple comparison techniques in Chapter 10.

Checking the Fit of the ANOVA Model

As with any other model, you must determine how well the ANOVA model fits before you can use its results with confidence. In the case of ANOVA, the model basically boils down to a treatment variable (also known as the population you’re in) plus an error term. To assess how well that model fits the data, see the values of R2 And R2 Adjusted on the last line of the ANOVA output below the ANOVA table. For the seed spitting data, you see those values at the bottom of Figure 9-3.

The value of R2 Measures the percentage of the variability in the response variable (y) Explained by the explanatory variable (x). In the case of ANOVA, the X Variable is the factor due to treatment (where the treatment can represent a population being compared). A high value of R2 (say above 80 percent) means this model fits well. The value of R2 Adjusted, the preferred measure, takes R2 And adjusts it for the number of variables in the model. In the case of one-way ANOVA, you have only one variable, the factor due to treatment so R2 And R2 Adjusted won’t be very far apart. For more on R2 And R2 Adjusted, see Chapter 5.

For the watermelon seed spitting data, the value of R2 Adjusted (as found in the last row of Figure 9-3) is only 53.97 percent. That means age group (while shown to be statistically significant by the F-test; see the section "Making conclusions from ANOVA") explains just over half of the variability in the watermelon seed spitting distances. Because age group alone explains only a little over half of what’s going on in the seed spitting distances, you may find other variables you can examine in addition to age group, making an even better model.

The results of the T-test Done to compare the spitting distances of males and females in the section "Comparing Two Means with a t-Test" (see Figure 9-1) showed that males and females were significantly different on mean seed spitting distances. So I would venture a guess that if you include gender as well as age group thereby creating what statisticians call a Two-factor ANOVA (or two-way ANOVA), The resulting model would fit the data even better, resulting in higher values of R2 And R2 Adjusted. (See Chapter 11 for two-way ANOVA.)

Many medical and psychological studies use designed experiments to compare the responses of several different treatments, looking for differences. A Designed experiment Is a study in which subjects are randomly assigned to treatments (experimental conditions) and their responses are recorded. The results are used to compare treatments to see which one(s) work best, which ones work equally well, and so on.

One example of one such experiment that employs ANOVA is from The Ohio State University research press release Web site. The experiment tested three traditional principles of writing refusal letters:

Using a buffer — a neutral or positive sentence that delays the negative information

Placing the reason before the refusal

Ending the letter on a positive note as a way of reselling the business

Subjects were randomly assigned to treatments, and their responses to the rejection letters were compared (likely on some sort of scale such as 1 = very negative to 7 = very positive with 4 being a neutral response).

This scenario can be analyzed by using ANOVA. It compares three treatments (forms of the rejection letters) on some quantitative variable (response to the letter). You can argue that this isn’t a continuous variable, because it has

Enough possible values that ANOVA isn’t unreasonable. The data were also shown to have a bell shape.

The null hypothesis would be Ho: Mean responses to the three types of rejection letters are equal, versus Ha: At least two forms of the rejection letter resulted in different mean responses.

In the end, the researcher did find some significant results. In other words, the different ways the rejection letter was written affected the participants in different ways. Using multiple comparison procedures (see Chapter 10), you would be able to go in and determine which forms of the rejection letters gave different responses and how the responses differed.

So in case you have to write a rejection letter at some point, the researcher recommends the following guidelines for writing it:

Don’t use buffers to begin negative messages.

Give a reason for the refusal when it makes the sender’s boss look good.

Present the negative positively but clearly; offer an alternative or compromise if possible.

A positive ending isn’t necessary.

In This Chapter

^ Recognizing and avoiding mistakes when interpreting statistical results ^ Knowing how to decide whether or not someone’s conclusions are credible

/ntermediate statistics is all about building models and doing data analysis. It focuses on looking at data and figuring out the story behind it. It’s about making sure that the story is told correctly, fairly, and comprehensively. In this chapter, I discuss some of the most common errors I’ve seen as a teacher and statistical consultant for many moons. You can use this list to pull ideas together for homework and reports or as a quick review before a quiz or exam. Trust me — your professor will love you for it!

These Statistics Prove…

Be skeptical of anyone who uses the words These statistics And Prove In the same sentence. The word Prove Is a definitive, end-all-be-all, case-closed, lead-pipe-lock sort of concept, and statistics by nature isn’t definitive. Instead, statistics gives you evidence for or against your theory, model, or claim, based on the data you collected; then it leaves you to your own conclusions. Because the evidence is based on data, and data changes from sample to sample, the results can change as well — that’s the challenge, the beauty, and sometimes the frustration of statistics. The best you can say is that your statistics suggest, lead you to believe, or give you sufficient evidence to conclude — but never go as far as to say that your statistics prove anything.

It’s Not Technically Statistically Significant, But…

Ml

VjiJABEft After you set up your model and test it with your data, you have to stand by 4J!/ the conclusions no matter how much you believe they’re wrong. Statistics

Must lend objectivity to every process.

Suppose Barb, a researcher, has just collected and analyzed the heck out of her data, and she still can’t find anything. However, she knows in her heart that her theory holds true, even if her data can’t confirm it. Barb’s theory is that dogs have ESP — in other words, a "sixth sense." She bases this theory on the fact that her dog seems to know when she’s leaving the house, when he’s going to the vet, and when a bath is imminent, because he gets sad and finds a corner to hide in.

Barb tests her ESP theory by studying ten dogs, placing a piece of dog food under one of two bowls and asking each dog to find the food by pushing on a bowl. (Assume the bowl is thick enough that the dogs can’t cheat by smelling the food.) She repeats this process ten times with each dog and records the number of correct answers. If the dogs don’t have ESP, you would expect that they would be right 50 percent of the time, because each dog has two bowls to choose from and each bowl has an equal chance of being selected.

As it turns out, the dogs were right 55 percent of the time. Now this percentage is technically higher than the long-term expected value of 50 percent, but it’s not enough (especially with so few dogs and so few trials) to warrant statistical significance. In other words, Barb doesn’t have enough evidence for the ESP theory. But when Barb presents her results at the next conference she attends, she puts a spin on her results by saying "The dogs were correct 55 percent of the time, which is more than 50 percent. These results are Technically Not enough to be statistically significant, but I believe they do show some evidence that dogs have ESP."

Some statistically incorrect researchers use this kind of conclusion all the time — skating around the statistics when they don’t go their way. This game is very dangerous, because the next time someone tries to replicate Barb’s results (and believe me, someone always does), they find out what you knew from the beginning (through ESP?): When Barb starts packing for a trip, her dog senses trouble coming and hides. That’s all.

This Means X Causes Y

Do you see the word that makes statisticians nervous? Because the words This And Means Seem pretty tame, and X And Y Are just letters of the alphabet,

It’s got to be that word Cause. Of all the words on a final exam that aren’t supposed to be there, Cause Probably tops the list.

Here’s an example of what I mean. For your final report in stats class, you study which factors are related to your final exam score. You collect data on 500 statistics students, asking each one a variety of questions, such as "What was your grade on the midterm?"; "How much sleep did you get the night before the final?"; and "What is your GPA?" You conduct a multiple linear regression analysis (using techniques from Chapter 5), and you conclude that study time and the amount of sleep the night before are the most-important factors in determining exam scores. You write up all your analyses in a paper, and at the very end you say, "These results demonstrate that more study time and a good night of sleep the night before causes your exam grade to be higher."

I was with you until you said the word Cause. You can’t say that more sleep or more study time causes an increase in exam score. The data you collected shows that people who get a lot of sleep and study a lot do get good grades, and those who don’t don’t get the good grades. But that result doesn’t mean you can take a flunky and just have him sleep and study more, and all will be okay. This theory is like saying that because an increase in height is related to an increase in weight, you can get taller by gaining weight.

The problem is that you didn’t take an individual person, change his sleep time and study habits, and see what happened in terms of exam performance (using two different exams of the same difficulty). That study requires a Designed experiment. When you conduct a Survey, You have no way of controlling other related factors going on, which can muddy the waters.

The only way to control for other factors is to do a randomized experiment (complete with a treatment group, a control group, and controls for other factors that may ordinarily affect the outcome). Claiming causation without conducting a randomized experiment is a very common error some researchers make when they draw conclusions.

I Assumed the Data Was NoRMal…

The operative word here is Assumed. To break it down simply, an assumption is something you believe without checking. Assumptions can lead to wrong analyses and incorrect results — all without the person doing the assuming even knowing it.

Many analyses have certain requirements. For example, data should come from a normal distribution (the classic distribution that has a bell shape to it). If someone says "I assumed the data was normal," she just assumed that the data came from a normal distribution. But is having a normal distribution an assumption you just make and then move on, or is more work involved? You guessed it — more work.

For example, in order to conduct a one-sample T-test (see Chapter 3), your data must come from a normal distribution unless your sample size is large, in which you get an approximate normal distribution anyway by the Central Limit Theorem (remember those three words from intro stats?). Here, you aren’t making an assumption, but examining a Condition (something you check before proceeding). You plot the data, see if it meets the condition, and if it does, you proceed. If not, you can use nonparametric methods instead (Chapter 16).

Nearly every statistical technique for analyzing data has at least some condition^) on the data in order for you to use it. Always find out what those conditions are, and check to see whether your data meets them. Be aware that many statistics textbooks wrongly use the word Assumption When they actually mean Condition. It’s a subtle, but very important, difference.

I’m Only Reporting "Important" Results

As a data analyst, you must not only avoid the pitfall of reporting only the significant, exciting, and meaningful results, but you also have to be able to detect when someone else is doing so. Some number crunchers examine every possible option and look at their data in every possible way before settling on the analysis that got them the desired result.

You can probably see the problem here. Every technique has a chance for error along with it. If you’re doing a t-test, for example, and the a level is 0.05, over the long term 5 out every 100 t-tests you conduct will result in a false alarm just by chance (you declare a statistically significant result when it wasn’t really there). So, if an eager researcher conducts 20 hypothesis tests on the same data set, odds are that at least one of those tests could result in a false alarm just by chance, on average. As this researcher conducts more and more tests, he’s unfairly increasing his odds of "finding something" and running the risk of a wrong conclusion in the process.

It’s not all the eager researcher’s fault. He’s pressured by a result-driven system. It’s a sad state of affairs when the only results that get broadcasted on the news and appear in journal articles are the ones that show a statistically significant result (when Ho is rejected). Perhaps it was a bad choice when statisticians came up with the term Significance To denote rejecting Ho — as if to say that rejecting Ho is the only important conclusion you can come to. What about all the times when Ho couldn’t be rejected? For example, when doctors failed to conclude that drinking diet cola causes weight gain, or when pollsters didn’t find that people were unhappy with the president? The public would be better served if researchers and the media were encouraged to spend at least some time reporting the statistically insignificant but still important results, along with the statistically significant ones.

The bottom line is this: In order to find out whether a statistical conclusion is correct, you can’t just look at the analysis the researcher is showing you. You also have to find out about the analyses and results they’re not showing you and ask questions. Avoid the urge to rush to reject Ho.

A Bigger Sample Is Always Better

Bigger is better in some things, but not always with sample sizes. On one hand, the bigger your sample is, the more precise the results are (if no bias is present). A bigger sample also increases the ability of your data analysis to detect differences from a model or to deny some claim about a population (in other words, to reject Ho when you’re supposed to). This ability to detect true differences from Ho is called the Power Of a test (see Chapter 3). However, some researchers can (and often do) take the idea of power too far. They increase the sample size to the point where even the tiniest difference from Ho sends them screaming to press that all-important reject Ho button.

Suppose research claims that the typical in-house dog watches an average of ten hours of TV per week. Bob thinks the true average is more, based on the fact that his dog Fido watches at least ten hours of cooking shows alone each week. Bob sets up the following hypothesis test: Ho: u, = 10 versus Ha: u,> 10. He takes a random sample of 100 dogs and has their owners record how much TV their dogs watch per week. The result turns out that the sample mean is 10.1 hours, and the sample standard deviation is 0.8 hours. This result isn’t what Bob hoped for because 10.1 is so close to 10. He calculates the test statistic for this test using the formula T = -—and comes up with a value of

(10.1 -10.0) 01 RJ~n t = — = 0 08, which equals 1.25 for t. Because the test is a right-tailed

/100

Test (> in Ha), he can reject Ho at a if T Is beyond 1.645, and his t-value of 1.25 is far short of that value. Note that because N = 100 here, you find the value of 1.645 by looking at the very last row of the t-distribution table (Table A-1 in the Appendix). The row is marked with the infinity sign to indicate a large sample. So Bob can’t reject Ho.

To add insult to injury, Bob’s friend Joe conducts the same study and gets the same sample mean and standard deviation as Bob did, but Joe uses a random sample of 500 dogs rather than 100. Consequently, Joe’s T-value is

(10.1 – 10.0) 0 1 . , ovo n ovo- u 1 R*c

T =— = 0 036, which equals 2.78. Because 2.78 is greater than 1.645,

/500

Joe gets to reject Ho (to Bob’s dismay).

Why did Joe’s test find a result that Bob’s didn’t? The only difference was the sample size. Joe’s sample was bigger, and a bigger sample size always makes the standard error smaller (see Chapter 3). The standard error sits in the denominator of the /-formula (as you just saw), so as it gets smaller, the /-value gets larger. A larger /-value makes it easier to reject Ho. (See Chapter 3 for more on precisions and margin of error.)

Now, Joe could technically give a big press conference or write an article on his results (his mom would be so proud), but you know better. You know that Joe’s results are technically S/a/is/ically Significant, but not Prac/ically Significant — they don’t mean squat to any person or dog. After all, who cares that he was able to show evidence that dogs watch just a tiny bit more than ten hours of TV per week? This news isn’t exactly earth-shattering.

Sample sizes should be large enough to provide precision and repeatability of your results, but there is such a thing as being too large, believe it or not. You can always take sample sizes big enough to reject any null hypothesis, even when the actual deviation from it is embarrassingly small. What can you do about this? When you read or hear that a result was deemed statistically significant, ask what the sample mean actually was (before it was put into the /-formula) and see how significant it is to you from a practical standpoint. Beware of someone who says, "These results are statistically significant, and the large sample size of 100,000 gives even stronger evidence for that."

It’s Not Technically Random, But…

When you take a sample on which to build statistical results, the operative word is Random. You want the sample to be randomly selected from the population. The problem is that people oftentimes collect a sample that they think is Mos/ly Random or Sor/ of Random or random Enough — and that doesn’t cut it. The plan for taking a sample is either random or it isn’t.

One day I gave each student in my class of 50 a number from 1 to 50, and I drew two numbers randomly from a hat. The two students I picked sat in the first row, and not only that, they sat right next to each other. Students immediately cried foul!

After these seemingly odd results appeared, I took the opportunity to talk to my class about truly random samples. A Random sample Is chosen in such a way that every member of the original population has an equal chance of being selected. Sometimes people who sit next to each other are chosen. In fact, if these seemingly strange results never happen, you may worry about the process; in a truly random process, you’re going to get results that may seem odd, weird, or even fixed. That’s part of the game.

In my consulting experiences, I always ask how my clients chose or plan to choose their samples. They always say they’ll make sure it’s random. But when I ask them how they’ll do this, I sometimes get less-than-stellar answers. For example, someone needed to get a random sample from a population of 500 free-range chickens in a farmyard. He needed five chickens and said that he’d select them randomly by choosing the five that came up to him first. The problem is, animals that come up to you may be friendlier, more docile, older, or perhaps more tame. These characteristics aren’t present in every chicken in the yard, so choosing a sample this way isn’t random. The results are likely biased in this case.

Always ask the researcher how she selected a sample, and when you select your own samples, stay true to the definition of random. And don’t use your own judgment to choose a random sample; use a computer to do it for you!

1,000 Responses Is 1,000 Responses

A newspaper article on the latest survey says that 50 percent of the respondents said blah blah blah. The fine print says the results are based on a survey of 1,000 adults in the United States. But wait — is 1,000 the actual number of people selected for the sample, or is it the final number of respondents? You may need to take a second look; those two numbers hardly ever match.

For example, Jenny wants to know what percentage of people in the U. S. have ever knowingly cheated on their taxes. In her statistics class, she found out that if she gets a sample of 1,000 people, the margin of error for her survey is only plus or minus 3 percent, which she thinks is groovy. So she sets out to achieve the goal of 1,000 responses to her survey. She knows that in these days it’s hard to get people to respond to a survey, and she’s worried that she may lose a great deal of her sample that way, so she has an idea. Why not send out more surveys than she needs, so that she gets 1,000 surveys back?

Jenny looks at several survey results in the newspapers, magazines, and on the Internet, and she finds that the response rate (the percentage of people who actually responded to the survey) is typically around 25 percent. (In terms of the real world, I’m being generous with this number, believe it or not. But think about it: How many surveys have your thrown away lately? Don’t worry, I’m guilty of it too.) So, Jenny does the math and figures that if she sends out 4,000 surveys and gets 25 percent of them back, she has the 1,000 surveys she needs to do her analysis, answer her question, and have that small margin of error of plus or minus 3 percent.

Jenny conducts her survey, and just like clockwork, out of the 4,000 surveys she sends out, 1,000 come back. She goes ahead with her analysis and finds that 400 of those people reported cheating on their taxes (40 percent). She adds her margin of error, and reports, "Based on my survey data, 40 percent of Americans cheat on their taxes, plus or minus 3 percentage points."

Now hold the phone, Jenny. She only knows what those 1,000 people who returned the survey said. She has no idea what the other 3,000 people said. And here’s the kicker: Whether or not someone responds to a survey is often related to the reason the survey is being done. It’s not a random thing. Those nonrespondents (people who don’t respond to a survey) carry a lot of weight in terms of what they’re not taking time to tell you.

For the sake of argument, suppose that 2,000 of the people who originally got the survey were uncomfortable with the question because they Do Cheat on their taxes, and they just didn’t want anyone to know about it, so they threw the survey in the trash. Suppose that the other 1,000 people don’t cheat on their taxes, so they didn’t think it was an issue and didn’t return the survey. If these two scenarios were true, the results would look like this:

Cheaters = 400 (surveyed) + 2,000 (nonrespondents) = 2,400

These results raise the total percentage of cheaters to 2,400 divided by 4,000 — 60 percent. That’s a huge difference!

You could go completely the other way with the 3,000 nonrespondents. You can suppose that none of them cheat, but they just didn’t take time to say so. If you knew this info, you would get 600 (surveyed) + 3,000 (nonrespondents) = 3,600 noncheaters. Out of 4,000 surveyed, this is 90 percent. The truth is likely to be somewhere between the two examples I just gave you, but nonrespondents make it too hard to tell.

And the worst part is that the formulas Jenny uses for margin of error don’t know that the information she put into them is based on biased data, so her reported 3 percent margin of error is wrong. The formulas happily crank out results no matter what. It’s up to you to make sure that what you put into the formulas is good, clean info.

Getting 1,000 results when you send out 4,000 surveys is nowhere near as good as getting 1,000 results when sending out 1,000 surveys (or even 100 results from 100 surveys). Plan your survey based on how much follow-up you can do with people to get the job done, and if it takes a smaller sample size, so be it. At least the results have a better chance of being statistically correct.

Of Course These Results Apply to the General Population!

Making conclusions about a much broader population than your sample actually represents is one of the biggest no-no’s in statistics. This kind of problem is called Generalization, And it occurs more often than you may think. People want their results instantly; they don’t want to wait for them, so well-planned surveys and experiments take a back seat to instant Web surveys and convenience samples.

For example, a researcher wants to know how cable news channels have influenced the way Americans get their news. He also happens to be a statistics professor at a large research institution and has 1,000 students in his class. He decides that instead of taking a random sample of Americans, which would be difficult, time-consuming, and expensive, he just puts a question on his final exam to get his students’ answers. His data analysis shows him that only 5 percent of his students read the newspaper and/or watch network news programs anymore; the rest watch cable news. For his class, the ratio of students who exclusively watch cable news compared to those students who don’t is 20 to 1. The professor reports this and sends out a press release about it. The cable news channels pick up on it and the next day are reporting, "Americans choose cable news channels over newspapers and network news by a 20 to 1 margin!"

Do you see what’s wrong with this picture? The problem is that the professor’s conclusions go way beyond his study, which is wrong. He used the students in his statistics class to obtain the data that serves as the basis for his entire report and the resulting headline. Yet the professor reports the results about all Americans. I think it’s safe to say that a sample of 1,000 college students taking a statistics class at the same time at the same college doesn’t represent a cross section of America.

If the professor wants to make conclusions in the end about America, he has to select a random sample of Americans to take his survey. If he uses 1,000 students from his class, then his conclusions can only be made about that class and no one else.

To avoid or detect generalization, identify the population that you’re intending to make conclusions about and make sure the sample you selected represents that population. If the sample represents a smaller group within that population, then the conclusions have to be downsized in scope also.

I Just Decided to Leave It Out

It seems easier sometimes to just leave information out. I see this all too often when I read articles and reports based on statistics. But, this error isn’t the fault of only one person or group. The guilty parties can include

The producers: The researchers out there leave items out for a variety of reasons, including time and space constraints. After all, you can’t write about every element of the experiment from beginning to end. However, other items they leave out may be indicative of a bigger problem. For example, reports often say very little about how they collected the data or chose the sample. Or they may discuss the results of a survey but not show the actual questions they asked. Ten out of 100 people may have dropped out of their experiment, and they don’t tell you why. All

These items are important to know before making a decision about the credibility of someone’s results.

Another way in which some data analysts leave information out is by removing data that doesn’t fit the intended model (in other words, "fudging" the data). Suppose a researcher records the amount of time surfing the Internet and relates it to age. He fits a nice line to his data indicating that younger people surf the Internet much more than older people and that surf time decreases as age increases. All is good except for Claude the outlier, who is 80-years-old and surfs the Internet day and night, leading his own bingo chat rooms and everything. What to do with Claude? If not for him, the relationship looks beautiful on the graph; what harm would it do to remove him? After all, he’s only one person, right?

No way. Everything is wrong with this idea. Removing undesired data points from a data set is not only very wrong but also very risky. The only time it’s okay to remove an observation from a data set is if you’re certain beyond doubt that the observation is just plain wrong. For example, someone writes on a survey that she spends 30 hours a day surfing the Internet or that her IQ is 2,200.

The communicators: When reporting statistical results, the media leaves out important information all the time, which is often due to space limitations and fast deadlines. However, part of it is a result of the current, fast-paced society that feeds itself on sound bytes. The best example is survey results, where they often leave out the size of the sample. You can’t calculate margin of error without it.

The consumers: The general public also plays a role in the leave-things-out mindset. People hear a news story and instantly believe it’s true, ignoring any chance for error or bias in the results. You need to make a decision about what car to buy, and you ask your neighbors and friends rather than examine the research and the meticulous, comprehensive ratings that have resulted. Everyone neglects to ask questions as much as he should, at one time or another, which indirectly feeds the entire problem.

In the chain of statistical information, the producers (researchers) need to be comprehensive and forthcoming about the process they conducted and the results they got. The communicators of that information (the media) need to critically evaluate the accuracy of the information they’re getting and report it fairly. The consumers of statistical information (the rest of us) need to stop taking results for granted and to rely on credible sources of statistical studies and analyses to help make those important life decisions.

In the end, if a data set looks too good, it probably is. If the model fits too perfectly, be suspicious. If it fits exactly right, run and don’t look back! Sometimes what is left out speaks much louder than what is put in.

Going Nonparametric

15 Май
0

In This Chapter

^ Seeing the need for nonparametric techniques

^ Distinguishing regular methods from nonparametric methods

^ Laying the groundwork: The basics of nonparametric statistics

J\Jm Any researchers do analyses involving hypothesis tests, confidence

Intervals, Chi-square tests, regression, and ANOVA. But nonparametric statistics doesn’t seem to gain the same popularity as the other methods. It’s more in the background — an unsung hero, if you will. However, nonparametric statistics is, in fact, a very important and very useful area of statistics because it gives you accurate results when other, more common methods fail.

In this chapter, you see the importance of nonparametric techniques and why they should have a prominent place in your data-analysis toolbox. You also discover some of the basic terms and techniques involved with non-parametric statistics.

Arguing for Nonparametric Statistics

Nonparametric statistics plays an important role in the world of data analysis. Nonparametric techniques can save the day when you can’t use other methods. The problem is that researchers often disregard, or don’t even know about, nonparametric techniques and don’t use them when they should. In that case, you never know what kind of results you get; what you do know is they could very well be wrong.

In the following sections, you see the advantages and the flexibility of using a nonparametric procedure. You also find out the downside is minimal, which makes it a win-win situation most of the time.

No need to fret if conditions aren’t met

Many of the techniques that you typically use to analyze data, including many shown in this book, have one very strong condition on the data that must be met in order to use them. That is the population(s) from which your data are collected must follow a typically required normal distribution. These methods are called Parametric Methods.

There are a couple of ways to help you decide whether a population has a normal distribution, based on your sample:

You can graph the data, using a histogram, and see whether it appears to have a bell shape (a mound of data in the middle, trailing down on each side).

You can make a normal probability plot, which compares your data to that of a normal distribution, using an X-y Graph (similar to the ones used when you graph a straight line). If the data do follow a normal distribution, your normal probability plot will show a straight line. If the data do not follow a normal distribution, the normal probability plot will not show a straight line; it may show a curve off to one side or the other, for example.

To make a histogram in Minitab, enter your data into a column. Go to Graph> Histogram, and click OK. Click on your variable in the left-hand box, and it appears in the Graph Variables box. Click OK, and you get a histogram.

To make a normal probability plot in Minitab, enter your data in a column. Go to Graph>Probability Plot and click OK. Click on your variable in the left-hand column, and it appears in the Graph Variables column. Click OK, and you see your normal probability plot.

When you find that the normal distribution condition is clearly not met, that’s where nonparametric methods come in. Nonparametric methods Are those data-analysis techniques that don’t require the data to have a specific distribution. Nonparametric procedures may require one of the following two conditions (and these are only in certain situations):

The data come from a symmetric distribution (which looks the same on each side when you cut it down the middle).

I The data from two populations come from the same type of distribution (they have the same general shape).

Note also that the normal distribution centers solely on the mean as its main statistic (for example, the Z-value for the hypothesis test for one population mean is calculated by taking the data value, subtracting the mean, and dividing

By the standard deviation). So the condition that the population has a normal distribution automatically says you are working with the mean. However, many nonparametric procedures work with the Median, Which is a much more flexible statistic because the median isn’t affected by outliers or skew-ness as the mean is.

The median’s in the spotlight for a change

Many times, any particular statistics question at hand revolves around the center of a population —that is, the number that represents a typical value, or a central value, in the population. One of those measures of center is the Mean. The Population mean Is the average value over the entire population, which is something that is typically not known (that’s why you take a sample). Many data analysts focus heavily on the population mean; they want to estimate it, test it, compare the means of two or more populations, or predict the mean value of a Y Variable given an X Variable. However, the mean isn’t the only measure of the center of a population; you also have the good ol’ median.

You may recall that the Median Of a data set is the value that represents the exact middle, when you order the data from smallest to largest. For example, in the data set 1, 5, 4, 2, 3, you order the data to get 1, 2, 3, 4, 5 and find that the number in the middle is 3, the median. If the data set has an even number of values, for example, 2, 4, 6, 8, then you average the two middle numbers to get your median (5 in this case).

As you may recall from your introductory statistics course, you can find the mean and the median and compare them to each other. You organize your data into a histogram, and you look at its shape. If the data set is symmetric, meaning it looks the same on either side when you draw a line down the middle, the mean and median are the same. Figure 16-1a shows an example of this situation. In this case, the mean and median are both 5.

If the histogram is skewed to the right, meaning that you have a lot of smaller values and a few larger values, the mean increases due to those few large values, but the median isn’t affected. In this case, the mean is larger than the median. Figure 16-1b shows an example of this situation. In this case, the mean is 4.5 and the median is 4.0.

When a data set is skewed left, you have many larger values that pile up, but only a few smaller values. In this case, the mean goes down because of the few small values, but the median still isn’t affected. In this case, the mean is lower than the median. Figure 16-1c pictures this case, with a 6.5 mean and a 7.0 median.

5" 8

Ab

Figure 16-1:

Symmetric and skewed histograms.

12

10

4

C

My point is that the median is important! It’s a measure of the center of a population, or a sample data set. The median competes with the mean and often wins. Researchers use nonparametric procedures when they want to estimate, test, or compare the median(s) of one or more populations. They also want to use the median in cases where their data are symmetric but don’t necessarily follow a normal distribution, or when they want to focus on a measure of center that’s not influenced by Outliers (extreme values either above or below the mean) or skewedness.

For example, if you look at house prices in your neighborhood, you may find a large number of houses within a certain relatively small price range, and then you have a few homes that cost a great deal more. If a real estate agent wants to sell a house and intends to justify a high price for it, she may report the mean price of homes in your neighborhood because the mean is affected by outliers. The mean is higher than the median in this case. But if the agent wants to help someone buy a house, she wants to look at the median of the house prices in the neighborhood, because the median isn’t affected by those few higher-priced homes and is lower than the mean.

Now suppose you want to come up with a number that describes the typical house price in your entire county. Should you use the mean or the median? You gathered techniques in your introductory statistics class for estimating the mean of a population (see Chapter 3 for a quick review), but you probably didn’t hear about how to come up with a confidence interval for the median of a population. Oh sure, you can take a random sample and calculate

The median of that sample. But you need a margin of error to go with it. And I’ll tell you something — the formula for the margin of error for the mean doesn’t work for the margin of error associated with the median. (See Chapter 17 for the margin of error for the median.)

So, what’s the catch?

You may be wondering, what’s the catch if I use a nonparametric technique? A downside must be around here somewhere. Well, many researchers believe that nonparametric techniques water down statistical results; for example, say you find an actual difference between two population means, and the populations really do have a normal distribution. A parametric technique, the hypothesis test for two means, would likely detect this difference (if the sample size was large enough).

The question is, if you use a nonparametric technique (which doesn’t need the populations to be normal), do you risk the chance of not finding the difference? The answer is maybe, but the risk isn’t as big as you think. More often than not, nonparametric procedures are only slightly less efficient than parametric procedures (meaning they don’t work quite as well at detecting a significant result, or at estimating a value as parametric procedures are when the normality condition is met, but this difference in efficiency is small). But the big payoff occurs when the normal distribution conditions aren’t met. Parametric techniques can make the wrong conclusion, and corresponding nonparametric techniques can lead to a correct answer. Many researchers don’t know this, so spread the word!

«*JABЈ^ The bottom line: Always check for normality first. If you’re very confident

That the normality condition is met, go ahead and use parametric procedures because they are more precise. If you have any doubt about the normality condition, use nonparametric procedures. Even if the normality condition is met, nonparametric procedures are only a little less precise than parametric procedures. If the normality condition isn’t met, nonparametrics provide appropriate and justifiable results where parametric procedures may not.

Getting the Basics of Nonparametric Statistics

Because you may not have run into nonparametric statistics during your intro to stats class, figuring out some of the basics needs to be your first step toward using nonparametric techniques. In this section, you get used to some of the terminology and major concepts involved in nonparametric statistics. These terms and concepts are commonly used in Chapters 17 through 20 of this book (and hopefully in your intermediate stats course).

Sign

The Sign Is a value of 0 or 1 that’s assigned to each number in the data set. The sign for a value in the data set represents whether that data value is larger or smaller than some specified number. The value of +1 is given if the data value is greater than the specified number, and the value of 0 is given if the data value is less than or equal to the specified number. For example, suppose your data set is 10, 12, 13, 15, 20, and your specified number for comparison is 16. Because 10, 12, 13, and 15 are all less than 16, they each receive a sign of 0. Because 20 is greater than 16, it receives a sign of +1.

Several uses of the sign statistic appear in nonparametric statistics. You can use signs to test to see if the median of a population equals some specified value. Or you can use signs to analyze data from a matched-pairs experiment (where subjects are matched up according to some variable and a treatment is applied and compared). You can also use signs in combination with other nonparametric statistics. For example, you can combine signs with ranks to develop statistics for comparing the median of two populations. (Ranks are discussed in the next section and are used in a hypothesis test for two population medians in Chapter 18.)

In the following sections, you see exactly how the sign statistic is used to test the median of a population and analyze data in a matched pairs experiment.

Testing the median

You can use signs to test whether the median of a population is equal to some value m. You do this by conducting a hypothesis test based on signs. You have Ho: Median = M Versus Ha: Median ^ M (or, you can use a > or < sign in Ha also). Your test statistic is the sum of the signs for all the data. If this sum is significantly greater or significantly smaller than what is expected if Ho were true, you reject Ho. Exactly how large or how small the sum of the signs must be to reject Ho is given by the sign test (Chapter 17).

Suppose you’re testing whether the median of a population is equal to 5. That is, you’re testing Ho: Median = 5 versus Ha: Median ^ 5. You collect the following data: 4, 4, 3, 3, 2, 6, 4, 3, 3, 5, 7, 5. Ordering the data, you get 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6, 7. Now you find the sign for each value in the data set, determined by whether the value is greater than 5. The sign of the first data value, 2, is 0, because it’s below 5. Each of the 3s receives a sign of 0, as do the three 4s, and the 5s, for the same reason. Only the numbers 6 and 7 receive a sign of + 1, being the only values in the data set that are greater than 5 (the number of interest for the median).

By summing the signs, you’re in essence counting the number of values in the data set that are greater than the given quantity in Ho. For example, the total of all the signs of the ordered data values is 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0+ 0 +

1 + 1 = 2, and you can see that the total number of data values above 5 (the number of interest for the median) is 2. The fact that the total of the signs (2) is much less than half the sample size gives you some evidence that the median is probably not 5 here, because the median represents the middle of the population. If the median were truly 5 in the population, your sample should yield about 6 values below it and 6 values above.

Doing a matched-pairs experiment

You can use signs in a matched-pairs experiment (where you use the same subject twice or pair them on some important variables). For example, you can use signs to test whether or not a certain treatment resulted in an improvement in patients, compared to a control. In the cases where the sign statistic is used, improvement is measured not by the mean of the differences in the responses for treatment versus control (as in a paired /-test), but by the median of the differences in the responses.

Suppose you’re testing a new antihistamine for allergy patients. You take a sample of 100 patients and have each patient assess the severity of his allergy symptoms before and after taking the medication on a scale from 1 (best) to 10 (worst). (Of course, you do a controlled experiment where some of the patients get a placebo to adjust for the fact that some people may perceive their symptoms going away just because they took something, anything.)

In this study, you’re not interested in what level their symptoms are at, but in how many patients had a lower level of symptoms after taking the medicine. So you take the symptom level before the experiment minus the symptom level after the experiment. If that difference is positive, the medicine appears to have helped, and you give that person a sign of +1 (in other words, count them as a success). If the difference is zero, the medicine had no effect, and you give that person a sign of 0. Remember, though, that the difference could be negative, indicating that the symptoms before were lower than the symptoms after; in other words, the medicine made their symptoms worse. This scenario results in a sign of 0 as well.

After you’ve found the sign for each value or pair in the data set, you’re ready to analyze it by using the sign test or the signed rank test (see Chapter 17).

Rank

Ranks Are a nice way to use important information from a data set without using the actual values of the data themselves. Rank comes into play in non-parametric statistics when you’re not interested in what the values of the data are, but where they stand, compared to some supposed value for the median or to the ranks of values in another data set from another population. (You can see ranks in action in Chapter 18.)

The Rank of a value In a data set is the number that represents its place in the ordering, from smallest to largest, within the data set. For example, if your data set is 1, 10, 4, 2, 1,000, you can assign the ranks in the following way: 1 gets the rank of one (because it’s the smallest), 2 gets the rank of two, 4 gets the rank of three (being the third smallest number in the ordered data set), 10 gets the rank of four, and 1,000 gets the rank of five (being the largest).

Now suppose your data set is 1, 2, 20, 20, 1,000. How would the ranks be assigned? You know that 1 would get the rank of one (being the smallest), 2 would get the rank of two, and 1,000 would get the rank of five (being the largest). But what about the two 20s in this data set? Should the first 20 get a rank of three and the second 20 get the rank of four? That order doesn’t seem to make sense, because you can’t distinguish between the two 20s.

When two values in a data set are the same, you take the average of the two ranks the values need to fill and assign each tied value that average rank. If you have a tie between three numbers, you have three ranks, so take the sum of the ranks divided by three.

In this case, because both 20s are vying for the ranks of three and four, assign each of them the rank of 3.5, the average of the two ranks they must share. I show the final ranking for the data set 1, 2, 20, 20, 1,000 in Table 16-1.

Table 16-1 Ranks of the Values in the Data Set 1, 2, 20, 20, 1,000

Data Value Rank Assigned

11

2 2

20 3.5

20 3.5

1,000 5

OjtXNG/ The lowest a rank can be is one, and the highest a rank can be is n, where N Is the number of values in the data set. If you have a negative value in a data set, for example, if your data set is -1, -2, -3, you still assign the ranks one through three to those data values. Never assign negative ranks to negative data. (By the way, when you order the data set -1, -2, -3, you get -3, -2, -1, so -3 gets the rank of one, -2 gets the rank of two, and -1 gets the rank of three.)

Signed rank

A Signed rank Combines the idea of the sign and the rank of a value in a data set, with a small twist. The sign indicates whether that number is greater than,

Less than, or equal to a specified value. The rank indicates where that number falls in the ordering of the data set from smallest to largest.

To calculate the signed rank for each value in the data set, follow these steps:

1. Assign a sign of +1 or 0 to each value in the data set, according to whether it’s greater than some value specified in the problem.

If it’s greater than the specified value, give it a sign of +1; if it’s less than or equal to the specified value, give it a sign of 0.

2. Rank the original data from smallest to largest, according to their absolute values.

Statisticians call these values the Absolute ranks.

3. Multiply the sign times the absolute rank to get the signed rank for each value in the data set.

The absolute value of any number is the positive version of that number. The notation for absolute value is I I, where the number goes between those lines. For example, I-2I = 2 and I+2I = 2. Remember that I0I = 0.

One scenario in which you can use signed ranks is an experiment where a response variable is compared for a treatment group versus a control group. You can test for difference due to a treatment by collecting the data in pairs, either both from the same person (pretest versus post-test) or from two individuals that are matched up to be as similar as possible.

For example, suppose you compare four patients regarding their weight loss on a diet program. You’re really wondering whether the overall change in weight is less than zero for the population. Two factors are important:

Whether or not the person lost weight

How the person’s weight change measures up, compared to everyone else in the data set

You measure the person’s weight before the program (the pretest) as well as his weight after the program (the post-test). The change is the important facet of the data you’re interested in, so you apply the signs to the changes in weight. You give the change a sign of +1 if the person lost weight (constituting a success for the program) and a sign of 0 if the person stayed the same or gained weight (thus not contributing to the success of the program). You convert all the changes in weight loss to their absolute values, and then you rank the absolute values (in other words, you’ve found the absolute ranks of the changes in weight). The signed rank is the product of the sign and the absolute rank. After determining the signed rank, you can really compare the effectiveness of the program. Large signed ranks indicate a big weight loss; small signed ranks don’t.

For example, weight changes of -20, -10, +1, and +5 have signs of +1, +1, 0, 0. The absolute values of the weight changes are 20, 10, 1, and 5. Their absolute ranks, respectively, are 4, 3, 1, and 2. The signed ranks are 4 *1 = 4, 3 *1 = 3, 1 * 0 = 0, and 2 * 0 = 0.

Rank sum

A Rank sum Is just what it sounds like: The sum of all the ranks. You typically use rank sums in situations when you’re comparing two or more populations to see whether one has a central location that’s higher than the other. (In other words, if you looked at the populations in terms of their histograms, one would be shifted to the right of the other on the number line.)

Here’s a way in which researchers use rank sums: Suppose you’re looking at quiz scores for two classes, and they don’t have a normal distribution, hence you want to use nonparametric techniques to compare them. The total possible points on this quiz is 30. You collect random samples of five quiz scores from each of the classes. Suppose the sample data from class number one is:

22, 23, 20, 25, 26, and the sample data from class number two is: 23, 30, 27, 28, 25. The twist here is to combine all the data into one big data set, rank all the values, and sum the ranks for the first sample and then the second sample. Then compare the two rank sums. If one rank sum is higher, this outcome may indicate that a particular class did better on the quiz.

In the quiz example, the ordered data for the combined classes is 20, 22, 23,

23, 25, 25, 26, 27, 28, 30. Their ranks, respectively, are 1, 2, 3.5, 3.5, 5.5, 5.5, 7, 8, 9, and 10. The ranks from the first class are 1 (associated with the score 20); 2 (22); 3.5 (23); 5.5 (25); and 7 (26). The rank sum for the first class is

1 + 2 + 3.5 + 5.5 + 7 = 19, which is quite a bit lower than the rank sum for the second class (3.5 + 5.5 + 8 + 9 + 10 = 36). This result tells you that the second class did better on the quiz than the first class, for this sample.

In Chapter 18, you can see how to use a rank sum test to see whether the shapes of two population distributions are the same, meaning the values they take on and how often those values occur in each population. In Chapter 19, you can find even more on rank sums and also discover how to conduct Kruskal-Wallis tests.

At

Note that taking the mean of each data set and comparing them by using a two-sample T-test Would be wrong in the quiz example because the quiz scores admittedly don’t have a normal distribution. Indeed if the quiz were easy, you’d get many high scores and few low ones, and the population would be skewed left. On the other hand, if the quiz were hard, you’d get many low scores and few high ones, and the population would be skewed right (don’t think too much about that scenario). In either case, you need a nonparametric procedure. See Chapter 18 for more on the nonparametric equivalent of the T-test.

In This Chapter

^ Relating the formulas and procedures for one-way ANOVA and regression ^ Making the connection between these two seemingly unrelated procedures

You’re motoring on in your intermediate stat course, working your ^^way through regression (where you estimate Y, Using one or more X Variables — see Chapter 4). Then you hit a new topic, ANOVA, which stands for Analysis of variance Comparing the means of several populations (see Chapter 9). That seems to be no problem. But wait a minute; now your professor starts talking about how ANOVA is related to regression — suddenly everything starts to spin out of control. How do you reconcile two techniques that appear to be as different as apples and oranges? That’s what this chapter is all about.

Think of this chapter as your bridge across the gap that lies between regression and ANOVA, allowing you to walk smoothly across, answering any questions that a professor may throw into your path. You don’t apply these two techniques in this chapter (you can find that information in Chapters 4 and 9). The goal of this chapter is to determine and describe the relationship between regression and ANOVA so they don’t look quite so much like an apple and an orange.

Seeing Regression through the Eyes of Variation

Every statistical model tries to explain why the different outcomes (y) Are what they are. It tries to figure out what factors or explanatory variables (x) Can help explain that variability in those y’s. In this section, you start with the

Y-values by themselves and see how their variability plays a central role in the regression model. This is the first step toward applying ANOVA (the analysis of Variance) To the regression model.

Verifying Variability in the y’s and looking at x to explain it

No matter what Y Variable you’re interested in predicting, you will always have variability in those y-values. If you want to predict the length of a fish, you may notice that fish have many different lengths (indicating a great deal of variability). Even if you put all the fish of the same age and species together, you still have some variability in their lengths (it will be less than before, but still there nonetheless). The first step to understanding the basic ideas of regression and ANOVA is to understand that variability in the Y‘s is to be expected, and your job is to try to figure out what can explain most of it. This section deals with seeing and explaining variability in the y-values.

Seeing the variability in Internet use

Both regression and ANOVA work to get a handle on explaining the variability in the Y Variable using an X Variable. After you collect your data, you can find the standard deviation in the Y Variable to get a sense of how much the data varies within the sample. From there, you collect data on an X Variable and see how much it contributes to explaining that variability.

Suppose you notice that people spend different amounts of time on the Internet, and you want to explore why that may be. You start by taking a small sample of 20 people and record how many hours per month they spend on the Internet. The results (in hours) are 20, 20, 22, 39, 40, 19, 20, 32, 33, 29, 24, 26, 30, 46, 37, 26, 45, 15, 24, and 31. The first thing you notice about this data is the large amount of variability in it. The Standard deviation (average distance from the data values to their mean) of this data set is 8.93, which is quite large given the size of the numbers in the data set.

Finding an "x-planation" for Internet use

So you figure out that the Y-values (such as amount of time someone uses the Internet from the preceding section) have a great deal of variability in them. What can help explain this? Part of the variability is due to chance. But you

Suspect some variable is out there (call it X) That has some connection to the Y Variable, and that variable can help you make more sense out of this seemingly wide range of y-values.

For example, if you record the calories for five types of candy bars as 100, 200, 300, 400, and 500, you would say "Wow, that’s a lot of variation in calories; I wonder why that is?" Then you notice that the weights of the candy bars are 1, 2, 3, 4, and 5 ounces, respectively. This relationship can be expressed as Y = 100x, where Y Equals calories and X Equals weight.

Now you can look at what before was a bunch of variability in the y-values and say, "Hey, that’s not just random variability; the differing y-values can be explained by the weight of candy bar (x)." You can now use X In a nice regression model to estimate Y. Notice that you’re talking about splitting the total variability in the Y‘s into the part due to X And the part due to chance (error). That’s ANOVA language! Hey, perhaps regression and ANOVA are related after all. . .

To continue with the Internet use example, suppose you have a brainstorm that number of years of education could possibly be related to Internet use. In this case, the explanatory variable (input variable, X) Is years of education, and you want to use it to try to estimate Y, The number of hours on the Internet in a month. You take a larger random sample of 250 Internet users and ask them how many years of education they had (so N = 250). You can check out the first ten observations from your data set containing the (x, y) Pairs in Table 12-1. If a significant connection of some sort exists between the X-values and the Y-values, then you can say that X Is helping to explain some of the variability in the Y‘s. If it explains enough variability, you can place X Into a simple regression model and use it to estimate Y.

Table 12-1

First Ten Observations from the Education

And Internet Use Example

Years of Education

Hours on Internet (For One Month)

15

41

15

32

11

33

10

42

10

28

10

21

(continued)

Table 12-1 (continued)

Years of Education

Hours on Internet (For One Month)

10

17

10

14

9 18

9

14

Getting results with regression

After you have a possible X Variable picked, you collect pairs of data (x, y) On a random sample of individuals from the population, and you look for a possible linear relationship between them. To do this, use Minitab to make a scatterplot of the data and calculate the correlation (r). If the data appear to follow a straight line (as shown on the scatterplot), you go ahead and perform a simple linear regression of the response variable Y Based on the X Variable. The p-value of the X Variable in the simple linear regression analysis tells you whether or not the X Variable does a significant job in predicting Y. Some of the details of getting the regression results are described below (for full information, see Chapter 4).

Looking at the small snippet of 10 out of the 250 person data set in Table 12-1, you can begin to see that you may have a pattern between education and Internet use. It looks like as education increases so does Internet use.

To do a simple linear regression using Minitab, enter your data in two columns: the first column for your X Variable and the second column for your Y Variable (as in Table 12-1). Go to Stat>Regression>Regression. Click on your Y Variable in the left-hand box; the Y Variable then appears in the Response box on the right-hand side. Click on your X Variable in the left-hand box; the X Variable then appears in the Predictor box in the right-hand side. Click OK, and your regression analysis is done. As part of every regression analysis, Minitab also provides you with the corresponding ANOVA results, found at the bottom of the output.

The simple linear regression output that Minitab gives you for the education and Internet example is in Figure 12-1. (Notice the ANOVA output at the bottom; you can see the connection in the upcoming section "Regression and ANOVA: A Meeting of the Models.")

Figure 12-1:

Output for simple linear regression applied to education and Internet use data.

Regression Analysis: Internet versus Education

The regression equation is Internet = —8.29 + 3.15 Education

Predictor Coef SE Coef T P

Constant —8.290 2.665 —3.11 0.002

Education 3.1460 0.2387 13.18 0.000

S = 7.23134

R—Sq = 41.2%

R—Sq(adj) = 41.0%

Analysis of Variance

Source DF SS

Regression 1 9085.6

Residual Error 248 12968.5

Total 249 22054.0

MS F

9085.6 173.75 5 2.3

P

0.000

Looking at Figure 12-1, you see that the p-value on the row marked Education Is 0.000, which means the p-value’s less than 0.001. Therefore the relationship between years of education and Internet use is statistically significant. A scat-terplot of the data (not shown here) also indicates that the data appear to have a positive linear relationship. That means as you increase number of years of education, Internet use also tends to increase (on average).

Assessing the fit of the regression model

Before you go ahead and use a regression model to make predictions for Y Based on an X Variable, you must first assess the fit of your model. One way to get a rough idea of how well your regression model fits is by using a Scatterplot (a graph showing all the pairs of data plotted in the X-y Plane). Use the scatterplot to see whether the data appears to fall in the pattern of a line. If the data appears to follow a straight-line pattern (or even something close to that — anything but a curve or a scattering of points that has no pattern at all), you calculate the correlation, r, to see how strong the linear relationship between X And Y Is (the closer R Is to +1 or -1, the stronger the relationship; the closer R Is to zero, the weaker the relationship). Minitab can do scatterplots and correlations for you; see Chapter 4 for more on simple linear regression, including making a scatterplot and finding the value of R.

If the data doesn’t have a significant correlation, stop the analysis; you can’t go further to find a line that fits a relationship that doesn’t exist.

Next you come to the more general way of assessing not only the fit of a simple linear regression model, but many other models too (for example: multiple, nonlinear, and logistic regression models in Chapters 5, 7, and 8, to name a few). In simple linear regression, the value of R2, as indicated by Minitab and statisticians as a capital R (squared), is equal to the square of the Pearson correlation coefficient, R (indicated by Minitab and statisticians by a small r). In all other situations, R2 Provides a more general measure of model fit. (Note that R Only measures the fit of a straight-line relationship between one X Variable and one Y Variable; see Chapter 4.) Finally, R2 Adjusted modifies R2 To account for the number of variables in the model. R2 Is what statisticians use to assess model fit (see Chapter 5 for more).

The value of R2 Adjusted for the model of using education to estimate Internet use (Figure 12-1) is equal to 41 percent. This value reflects the percentage of variability in Internet use that can be explained by a person’s years of education. This number isn’t great, but it’s not terrible either. Note the square root of 41 percent is 0.64 for R Itself, which in the case of linear regression indicates a moderate relationship.

This evidence gives you the green light to use the results of the regression analysis to estimate number of hours of Internet use in a month by using years of education. The regression equation as it appears in the top part of the Figure 12-1 output is Internet = -8.29 + 3.15 * 16 = 42.11. So if you have 16 years of education, for example, your estimated Internet use is 42.11, or about 42 hours per month (about 10.5 hours per week).

But wait! Look again at Figure 12-1 and zoom in on the bottom part. I didn’t ask for anything special to get this info on the Minitab output, but you can see an ANOVA table there. That seems like a fish out of water doesn’t it? But in the next section you see how an ANOVA table can describe regression results (albeit it in a different way).

Regression and ANOVA: A Meeting of the Models

Okay, here it comes. You’ve already broken down the regression output into all its pieces and parts. The next step toward understanding the connection between regression and ANOVA is to apply the sums of squares from ANOVA to regression (something that is typically not done in a regression analysis). Before you start, think of this process as going to a 3-D movie, where you have to wear special glasses in order to see all the special effects!

In this section, you see the sums of squares in ANOVA applied to regression and how the degrees of freedom work out. You build an ANOVA table for regression and discover how the T-test For a regression coefficient is related to the F-test in ANOVA. I know you can hardly wait, so I won’t keep you in suspense any longer.

Comparing sums of squares

Sums of squares Is a term you may remember from ANOVA (see Chapter 9), but it certainly isn’t a term you normally use when talking about regression (as in Chapter 4). Yet, both types of models can be broken down into sums of squares, and that similarity gets at the true connection between ANOVA and regression. In step-by-step terms, you first partition out the variability in the Y Variable by using formulas for sums of squares from ANOVA (sums of squares for total, treatment, and error). Then you find those same sums of squares for regression — this is the twist on the process because you typically don’t find sums of squares for regression. You compare the two procedures through their sums of squares. This section shows you the details of how this comparison is done.

Partitioning Variability by using SSTO, SSE, and SST for ANOVA

ANOVA is all about partitioning the total variability in the Y-values into sums of squares (see all the info you ever need on one-way ANOVA in Chapter 9). The key idea is that SSTO = SST + SSE, where SSTO is the total variability in the Y-values; SST measures the variability explained by the model (also known as the treatment, or X Variable in this case); and SSE measures the variability due to error (what’s left over after the model is fit).

Y - y J ,

And X CY – Y J Respectively, where Y Is the mean of the Y‘s, Yt Is each observed

Value of Y, And Y Is each predicted value of Y From the ANOVA model. Use these formulas to calculate the sums of squares for ANOVA (Minitab does this for you when it performs ANOVA). Keep these values of SSTO, SST, and SSE. You will use them to compare to the results from regression.

Finding sums of squares for regression

In regression, you measure the deviations in the Y-values by taking each Yt Minus its mean, Y. Square each result and add them all up, and you have SSTO. Next, take the residuals, which represent the difference between each Yt And it’s estimated value from the model, Y. Square the residuals and add them up, and you get the formula for SSE.

Now that you have calculated SSTO and SSE, you need the bridge between them. That is, you need a formula that connects the variability in the y’s (SSTO) and the variability in the residuals after fitting the regression line (SSE). That bridge is SSR (equivalent to SST in ANOVA). In regression, y represents the predicted value of yi based on the regression model. These are the values on the regression line. To assess how much this regression line helps to predict the Y-values, you compare it to the model you would get without any X Variable in it.

Without any other information, the only thing you can do to predict Y Is look at the average, Y. So, SST compares the predicted value from the regression line to the predicted value from the flat line (the mean of the y’s) by subtract -

Ing them. The result is YY _Y ). Square each result and sum them all up, and you get the formula for SST.

Now for one last hoop to jump through (as if you haven’t had enough already). Instead of calling the sum of squares for the regression model SST as is done in ANOVA, statisticians call it SSR For Sum of squares regression. Consider SSR from regression to be equivalent to the SST from ANOVA. The reason this is important is because computer output lists the sums of squares for the regression model as SSR not SST.

To summarize the sums of squares as they apply to regression, you have SSTO = SSR + SSE where

SSTO measures the variability in the observed y-values around their mean. This value represents the variance of the Y-values.

SSE represents the variability between the predicted values for Y (the values on the line) and the observed Y-values. SSE represents the variability left over after the line has been fit to the data.

SSR measures the variability in the predicted values for Y (the values on the line) from the mean of Y. SSR is the sum of squares due to the regression model (the line) itself.

Minitab calculates all the sums of squares for you as part of the regression analysis. You can see this calculation in the section "Bringing regression to the ANOVA table."

Dividing up the degrees of freedom

In ANOVA, you test a model for the treatment (population) means by using an

F-test, which is F = MST. To get MST (the mean sum of squares for treatment),

MSE » v Ji

You take SST (the sum of squares for treatment) and divide by its degrees of

Freedom. You do the same with MSE (that is, take SSE, the sum of squares for error, and divide by its degrees of freedom). The question now is, what do those degrees of freedom represent and how do they relate to regression? This section addresses that issue.

Degrees of freedom in ANOVA

In ANOVA, the degrees of freedom for SSTO is N - 1, which represents the sample size minus one. In the formula for SSTO, X _Yt – y), you see there are N Observed Y-values minus one mean. That in a very general way is where the N - 1 comes from.

X _ Y - y )2

Note that if you divide SSTO by N - 1, you get —1-s-^-, The variance in the

N-1

Y-values. This calculation makes good sense because the variance also measures the total variability in the Y-values.

The degrees of freedom for SSE is N - K. In the formula for SSE, X Y Y – y J ,

You see there are N Observed Y-values, and K Is the number of treatments in the model. In regression, the number of coefficients in the model is K = 2 (the slope and the Y-intercept). So you have degrees of freedom N - 2 associated with SSE when you’re doing regression.

Degrees of freedom in regression

The degrees of freedom for SST in ANOVA equals the number of treatments minus one. How does the degrees of freedom idea relate to regression? The number of treatments in regression is equivalent to the number of parameters in a model (a parameter being an unknown constant in the model that you’re trying to estimate).

When you test a model you’re always comparing it to a different (simpler) model to see whether it fits the data better. In linear regression you compare your regression line Y = B0 + B1x, To the horizontal line Y = Y. This second, simpler model just uses the mean of Y To predict Y All the time, no matter what X Is. In the regression line, you have two coefficients: one to estimate the parameter for the Y-intercept (b0) And one to estimate the parameter for slope (b1) In the model. In the second, simpler model, you have only one parameter: the value of the mean. The degrees of freedom for SSR in simple linear regression is the difference in the parameters of the two models: 2 – 1 = 1.

Putting all this together, the degrees of freedom for regression must add up for the equation SSTO = SSR + SSE. The degrees of freedom corresponding to this equation are (n - 1) = (2 – 1) + (n - 2), which is true if you do the math. So the degrees of freedom for regression, using the ANOVA approach, all check out. Whew!

In Figure 12-1, you can see the degrees of freedom for each sums of squares listed under the DF Column of the ANOVA part of the output. You see SSR has 2 – 1 = 1 degree of freedom, SSE has 250 – 2 = 248 degrees of freedom (because N = 250 observations were in the data set and K = 2 and you find N - K To get degrees of freedom for SSE). The degrees of freedom for SSTO is 250 – 1 = 249.

Bringing regression to the ANOVA table

In ANOVA, you test your model Ho: All K Population means are equal versus Ha: At least two population means are different by using a F-test. You build your F-test statistic by relating the sums of squares for treatment to the sum of squares for error. To do this, you divide SSE and SST by their degrees of freedom (n – K And K - 1, respectively, where N Is the sample size and K Is the number of treatments) to get the mean sums of squares for error (MSE) and mean sums of squares for treatment (MST). In general, you want MST to be large compared to MSE, which would indicate that the model fits well. The results of all these statistical gymnastics are summarized by Minitab in a table called (cleverly) the ANOVA table.

The ANOVA table shown in the bottom part of Figure 12-1 for the Internet-use data represents the ANOVA table you get from using the regression line as your model. Under the Source column, you may be used to seeing treatment, error, and total. For regression, the treatment is the regression line, so you see Regression Instead of treatment. The error term in ANOVA is labeled Residual error, Because in regression, you measure error in terms of residuals. Finally you see Total, Which is the same the world around.

The SS column represents the sums of squares for the regression model. The three sums of squares listed in the SS column are SSR (for regression), SSE (for residuals), and SST (total). These sums of squares are calculated using the formulas from the previous section; the degrees of freedom, DF In the table, are found by using the formulas from the previous section also.

The MS column takes the value of SS "whatever"(you fill in the blank) and divides it by the respective degrees of freedom, just like ANOVA. For example in Figure 12-1, SSE is 12,968.5, and the degrees of freedom is 248. Take the first value divided by the second one to get 52.29 or 52.3, which is listed in the ANOVA table for MSE.

The value of the F-statistic, using the ANOVA method, is F = ^jSE = 9’52 3’6 =

173.7 in the Internet example, which you can see in column five of the ANOVA part of Figure 12-1 (subject to rounding). The F-statistics’s p-value is calculated based on an F-distribution with 2 – 1 = 1 and 250 – 2 = 248 degrees of

Freedom, respectively. (In the Internet example, the p-value listed in the last column of the ANOVA table is 0.000, meaning the regression model fits.) But remember, in regression you don’t use an F-statistic and an F-test. You use a /-statistic and a /-test. What gives? The next section explains.

Relating the F – and t-statistics: The final frontier

In regression, one way of testing whether the best-fitting line is statistically significant is to test Ho: slope = 0 versus Ha: slope ^ 0. To do this, you use a /-test (see Chapter 3). The slope is the heart and soul of the regression line, because it describes the main part of the relationship between X And y. If the slope of the line equals zero (you can’t reject Ho), you’re just left with Y = B1, A horizontal line, and your model Y = B0 + b1x Isn’t doing anything for you.

In ANOVA, you test to see whether the model fits by testing Ho: The means of the populations are all equal, versus Ha: At least two of the population means aren’t equal. To do this you use an F-test (taking MST and dividing it by MSE; see Chapter 10).

The sets of hypotheses in regression and ANOVA seem totally different, but in essence, they’re both doing the same general thing: testing whether a certain model fits. In the regression case, the model you want to see fit is the straight line, and in the ANOVA case, the model of interest is a set of (normally distributed) populations with at least two different means (and the same variance). Here each population is labeled as a treatment by ANOVA.

But more than that, you can think of it this way: Suppose you took all the populations from the ANOVA and lined them up side by side on an X-y Plane (see Figure 12-2). If the means of those distributions are all connected by a flat line (representing the mean of the Y‘s), then you would have no evidence against Ho in the F-test, so you can’t reject it — your model isn’t doing anything for you (it doesn’t fit). This idea is similar to the idea of fitting a flat horizontal line through the Y-values in regression; a straight-line model with a nonzero slope doesn’t work in that case.

The big thing is that statisticians can prove (so you don’t have to) that an F-statistic is equivalent to the square of a /-statistic, and the F-distribution is equivalent to the square of a /-distribution when the SSR has df = 2 – 1 = 1. And when you have a simple linear regression model, the degrees of freedom is exactly one! (Note that F Is always greater than or equal to zero, which is needed if you’re making it the square of something.) So there you have it! The /-statistic for testing the regression model is equivalent to an F-statistic for ANOVA when the ANOVA table is formed for the simple regression model.

Figure 12-2:

Connecting means of populations to the slope of a line.

Indeed (the stat professor’s way of saying "and this is the Really Cool part. . ."), if you look at the value of the /-statistic for testing the slope of the education variable in Figure 12-1, you see that it’s 13.18 (look at the row marked Education And the column marked T). Square that value, and you get 173.71. The F-statistic in the ANOVA table of Figure 12-1 is equal to 173.75. The F-statistic from ANOVA and the t-statistic from regression are equal to each other in Figure 12-2, subject to a little round-off error done by Minitab on the output. (Just like magic! I still get chills just thinking about it.)

(8

15 Май
0

Figure 18-1b shows that the median for Suzy (131 days on the market) is less than the median for Tommy (175 days). It may appear Suzy sells homes faster than Tommy. However, the results aren’t exactly clear-cut. A portion of the two boxplots (Figure 18-1a) overlap with each other. You may not be able to declare Suzy the clear winner as being the fastest real estate agent. You need a hypothesis test to make that final determination.

At

TEsting the hypotheses

The null hypothesis for the real estate agent test (from previous sections) is Ho: R|i = N2, Where R|i = median days on the market for the population of all Suzy’s homes sold in the last year, and R|2 = median days on the market for the population of all Tommy’s homes sold in the last year. The alternative hypothesis is Ha: R|i ^r|2.

After you looked at the data, you developed a hunch that if one of the agents sold homes faster, it was Suzy. However, before you saw the data, you had no preconceived notion as to whom was faster. You must base your Ho and Ha on what your thoughts were Before You looked at the data, not after. Setting up your hypotheses after you collect the data is unfair and unethical.

After you determine your Ho and Ha, the time has come to test your data. So, keep reading to figure out what this test looks like in a real-life example.

Combining and ranking

The first step in the data analysis is to combine all the data together and rank the days on the market from lowest (rank = 1) to highest. You can see the overall ranks for the combined data in Table 18-2.

In the case of ties, you give both of the values the average of the ranks they normally would have received. You can see in Table 18-2 that two values of 145 are in the data set. Because they represent the sixth and seventh numbers in the ordered data set, you give each of them the same rank of (6*7>2 = 6.5.

Table 18-2 Ranks of Combined Data from the Real Estate Example

Suzy Sellfast

Overall Rank

Tommy Nowait

Overall Rank

48

1

109

4

97

2

145

6.5

103

3

160

9

117

5

165

10

145

6.5

185

11

151

8

250

13

220

12

251

14

300

15

350

16

Finding the test statistic

After you’ve ranked your data, you can determine which group is group one, so you can find your test statistic, T. Because the sample sizes are equal, let group one be Suzy, because her data is given first. Now sum the ranks from Suzy’s data set. The sum of Suzy’s ranks is 1 + 2 + 3 + 5 + 6.5 + 8 + 12 + 15 = 52.5; this value of T Is your rank sum test statistic.

Determining whether you can reject Ho

Suppose you want to use a = 0.05 for this test; using this cutoff means that you use Table A-4 (see Appendix), because you have a two-sided test at level a = 0.05. Looking at Table A-4, you go to the column for n1 = 8 and the row for n2 = 8. You see TL = 49 and TU = 87. You reject Ho if T Is outside this range; in other words, reject Ho if T< TL = 49 or if T> TU = 87. Your statistic T = 52.5 doesn’t fall outside this range; you don’t have enough evidence to reject Ho at the a = 0.05 level. So you can’t say that you see a difference in the median days on the market for Suzy and Tommy.

These results may seem very strange given the fact that the medians for the two data sets were so different: 131 days on the market for Suzy compared to 175 days on the market for Tommy. However you have two strikes against you in terms of being able to find a real difference here:

The sample sizes are quite small (only eight in each group). A small sample size makes it very hard to get enough evidence to reject Ho.

The standard deviations are both in the high 70s, which is quite large compared to the medians.

Both of these problems make it hard for the test to actually find anything through all the variability the data shows.

To conduct the rank sum test by using Minitab, click on Stat>Nonparametric> Mann-Whitney. Select your two samples and choose your alternate Ha as >, <, or ^. The Confidence Level is equal to one minus your value of A. After you make all of these settings, click on OK.

Figure 18-2 shows the Minitab output when you conduct the rank sum test on the real estate data. To interpret the results in Figure 18-2, you must note that the Mann-Whitney test is just another word for the rank sum test. Also, Minitab writes ETA rather than R For the medians. The results at the bottom of the output say that the test for equal (versus nonequal) medians is significant at the level 0.1149, when adjusting for ties. This is your p-value adjusted for ties. (Note that if no ties are present in your data, you use the results just above that line. That gives you the P-value not adjusted for ties.)

Figure 18-2:

Using the rank sum test to figure out who sells homes faster.

Mann-Whitney Test and CI: Suzy, Tommy

N Median Suzy 8 131.0 Tommy 8 175.0

Point estimate for ETA1-ETA2 is -49.0 95.9 Percent CI for ETA1-ETA2 is (-137.0, 36.0) W = 52.5

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.1152

The test is significant at 0.1149 (adjusted for ties)

To make your final conclusion, compare your P-value to your pre-specified level of a (typically 0.05.) If your a level is 0.1149 (or larger), you reject Ho; otherwise you can’t. In this case, because 0.1149 is greater then 0.05, you can’t reject Ho. That means you don’t have enough evidence to say the population medians for days on the market for Suzy’s versus Tommy’s houses are different based on this data. These results confirm your conclusions from the previous section.

The Minitab output in Figure 18-2 also provides a confidence interval for the difference in the medians between the two populations, based on the data from these two samples. The difference in the sample medians (Suzy – Tommy) is 131.0 – 175.0 = -44.0. Adding and subtracting the margin of error (these calculations are beyond the scope of this book), Minitab finds the confidence interval for the difference in medians (Suzy – Tommy) is -137.0, +36.0. The difference in the population medians could be anywhere from -137.0 to 36.0. Because 0, the value in Ho, is in this interval, you can’t reject Ho in this case. So again, you can’t say that the medians are different, based on this (limited) data set.

Rank sum tests can be used to compare two groups of judges of a competition, to see whether there is a difference in their scores. For example, in the Olympic ice-skating events, the gender of the judges is sometimes suspected to play a role in the scores they give to certain skaters. Suppose you have a men’s ice-skating competition and you have ten judges: five males and five females. You want to know whether male and female judges score the competitors in the same way, so you do a rank sum test to compare their median scores. Your hypotheses are Ho: male and female judges have the same median score versus Ha: they have different median scores. For your sample, you let each

Judge score the same individual. You rank their scores in order from lowest to highest and label M for a male judge and F for a female judge. Your results are the following: F M M M M F F F F M. The value of the test statistic T Is the sum of the ranks for group one (say the males), which gives you T = 2 + 3 + 4 + 5 + 10 = 24. Now compare that to the critical values in Table A-4 (Appendix), where both sample sizes equal five, and you get TL = 18 and TU = 37. Because your test statistic, T = 24, is inside this interval, you fail to reject Ho: judging is the same for male and female judges. You just don’t have enough evidence to say that they differ.

  • Автор: Анкар
  • Категории: (8

Regression

15 Май
0

In This Chapter

^ Knowing when logistic regression is appropriate

^ Building logistic regression models for yes or no data

^ Checking model conditions and making the right conclusions

Јveryone (even yours truly) tries to make predictions about whether or not a certain event is going to happen. For example, what’s the chance it’s going to rain this weekend? What is our team’s chances of winning our next game? What is the chance that I’ll have complications during this surgery? These predictions are often based on Probability, The long-term percentage of time an event is expected to happen. In the end, you want to estimate P, The probability of an event occurring. In this chapter, you see how to build and test models for P Based on a set of explanatory (x) Variables. This technique is called Logistic regression.

Setting Up the Logistic Regression Model

Yes or no data that comes from a random sample has a binomial distribution with probability of success (the event occurring) equal to P. In the binomial problems you saw in intro stats, you had a sample of size N Trials, you had yes or no data, and you had a probability of success on each trial, denoted by P. In your intro stat course, for any binomial problem the value of P Was somehow given to be a certain value, but in intermediate stats, you operate under the much more realistic scenario that it’s not. In fact, because P Isn’t known, your job is to estimate what it is and use a model to do that.

To estimate P, The chance of an event occurring, you need data that comes in the form of yes or no, indicating whether or not the event occurred for each individual in the data set. Now because yes or no data don’t have a normal distribution, a condition needed for other types of regression, you need a new type of regression model to do this job — Logistic regression. Keep reading this section to find out more about this model.

Defining a logistic regression model

A logistic regression model ultimately gives you an estimate for P, The probability that a particular outcome will occur in a yes or no situation (for example, the chance that it will rain versus not). The estimate is based on information from one or more explanatory variables; you can call them X1, x2, X3, . . . xk. (For example, x1 = humidity, x2 = barometric pressure, x3 = cloud cover, . . . and Xk = wind speed.) Note: In this chapter, I present only the case where you use one explanatory variable. You can extend the ideas in exactly the same way as you can extend the simple linear regression model (Chapter 4) to a multiple regression model (Chapter 5).

Using an S-curve to estimate probabilities

In a simple linear regression model, the general form of a straight line is Y = p0 + p1 x. In the case of estimating P, The linear regression model is the straight line P = p0 + p1 x. However, it doesn’t make sense to use a straight line to estimate the probability of an event occurring based on another variable, due to the following reasons:

The estimated values of P Can never be outside of [0, 1], which goes against the idea of a straight line (a straight line continues on in both

Directions).

^ It doesn’t make sense to force the values of P To increase in a linear way based on X. For example, an event may occur very frequently with a range of large values of X And very frequently with a range of small values of X, With very little chance of the event happening in an area in between. This type of model would have a U-shape, rather than a straight-line shape.

To come up with a more appropriate model for P, Statisticians created a new function of P Whose graph is called an S-curve. The S-curve Is a function that involves P, But it also involves E (the natural logarithm) as well as a ratio of two functions. The values of the S-curve always fit between 0 and 1 and allows the probability, P, To change from low to high or high to low, according to a curve that is shaped like an S. The general form of the logistic regression model based on an S-curve is P = -j-^—p + p x.

1 + EPO PI

Interpreting the coefficients of the logistic regression model

The sign on the parameter P1 Tells you the direction of the S-curve. If P1 Is positive, the S-curve goes from low to high (see Figure 8-1a); if PI Is negative, the S-curve goes from high to low (Figure 8-1b).

P1 > 0

P1 < 0

Figure 8-1:

1.0 0.8 0.6 0.4

1.0 0.8 0.6 0.4

Two basic

J

Types of

0.2

0.2

S-curves.

0.0

<—

0.0

>

X

X

The magnitude of P1 (indicated by its absolute value) tells you how much curvature is in the model. High values indicate a steep curvature and low values indicate slow curvature. The parameter P0 Just shifts the S-curve to the proper location to fit your data. It shows you the cutoff point where x-values change from high to low probability and vice versa.

Estimating the chance a movie will be a hit by using logistic regression

Often, the best way to figure something out is to see it in action. In this section, I give you an example of a situation where you can use a logistic regression model to estimate a probability. (I expand on this example later in this chapter; for now, I’m just setting up a scenario for logistic regression.)

Suppose movie marketers want to estimate the chance that someone will enjoy a certain family movie, and you believe age may have something to do with it. Translating this research question into x’s and y’s, the response variable (y) Is whether or not a person will enjoy the movie, and the explanatory variable (x) Is the person’s age. You want to estimate P, The chance of someone enjoying the movie. You collect data on a random sample of 40 people, shown in Table 8-1. Based on your data, it appears that younger people enjoyed the movie more than older people, and that at a certain age, the trend switches from liking the movie to disliking it; so, you can build a logistic regression model to estimate P.

Table 8-1

Movie Enjoyment (Yes or No Data) Based on Age

Age

Enjoyed the Movie

Total Number Sampled

10

3

3

15

4

4

16

3

3

18

2

3

20

2

3

25

2

4

30

2

4

35

1

5

40

1

6

45

0

3

50

0

2

General Steps for Logistic Regression

The basic idea of any model-fitting process is to look at all possible models you can have under the general format and find the one that fits your data best. The general form of the best-fitting logistic regression model is

P = – reB +b x, where P Is the estimate of P, b0 Is the estimate of p0, and B1 Is

1 + eb0+b1

The estimate of p1 (from the previous section). The only values you have a choice about to form your particular model are the values of b0 and b1. These values are the ones you’re trying to estimate through the logistic regression analysis.

To find the best-fitting logistic regression model for your data, complete the following steps:

1. Run a logistic regression analysis on the data you collected (see the section "Running the analysis in Minitab" for these instructions.)

2. Find the coefficients of constant and X, Where X Is the name of your explanatory variable.

These coefficients are b0 and b1, the estimates of p0 and p1 in the logistic regression model.

3. Plug the coefficients from step one into the logistic regression model:

P =

1 + e " + 1 x

This equation is your best-fitting logistic regression model for the data. Its graph is an S-curve (for more on the S-curve, see the section "Using an S-curve to estimate probabilities" earlier in this chapter).

In the sections that follow, you see how to ask Minitab to do the above steps for you. You also see how to interpret the resulting computer output, find the equation of the best-fitting logistic regression model, and use that model to make predictions (being ever mindful that all conditions are met).

^\ Using Minitab, here’s how to perform a logistic regression (other statistical SUsj.gf. j ) Software packages are similar):

Running the analysis in Minitab

1. Input your data in the spreadsheet as a table that lists each value of the X Variable in column one, the number of yeses for that value of X In column two and the total number of trials at that x-value in column three.

These last two columns represent the outcome of the response variable Y. (For an example of how to enter your data, see Table 8-1 based on the movie-age data.)

2. Go to Stat>Regression>Binary Logistic Regression.

3. Beside the Success option, select your variable name from column two, and beside Trial, select your variable name for column three.

4. Under Model, select your variable name from column one, because that’s the column containing the explanatory (x) variable in your model.

5. Click OK, and you get your logistic regression output.

When you fit a logistic regression model to your data, the computer output is composed of two major portions:

The model-building portion: In this part of the output, you can find the coefficients 0 And 1 (I describe coefficients in the section "Finding the coefficients and making the model").

The model-fitting portion: You can see the results of a Chi-square goodness-of-fit test (see Chapter 15) as well as the percentage of concordant and discordant pairs in this section of the output. (A Concordant pair Means the predicted outcome from the model matches the observed outcome from the data. A Discordant pair Is one that doesn’t match.)

In the case of the movie and age data, the model-building part of the Minitab output is shown in Figure 8-2. The model-fitting part of the Minitab output from the logistic regression analysis is in Figure 8-4. In the following sections, you see how to use this output to build the best-fitting logistic regression model for your data and to check the model’s fit.

Figure 8-2:

The model-building part of the movie and age data’s logistic regression output.

Logistic Regression

Table

Odds

9 5%

CI

Predictor Coef

SE Coef

Z

P

Ratio

Lower

Upper

Constant 4.86539

1.43434

3.39

0.001

Age -0.175745

0.0499620

-3.52

0.000

0.84

0.76

0.93

Finding the coefficients and making the model

After you have Minitab run a logistic regression analysis on your data, you can find the coefficients b0 and B1 And put them together to form the best-fitting logistic regression model for your data.

Figure 8-2 shows part of the Minitab output for the movie enjoyment and age data. I call this portion of the output the Model-building Part of the output. (I discuss the remaining output in the section "Checking the fit of the model.") The first column of numbers is labeled Coef, Which stands for the coefficients in the model. The first coefficient, b0, is labeled Constant. The second coefficient is in the row labeled by your explanatory variable, X. (In the movie and age data, the explanatory variable is age. This age coefficient represents the value of B1 In the model.)

According to the Minitab output in Figure 8-2, the value of b0 is 4.87 and the value of B1 Is -0.18. After you’ve determined the coefficients b0 and B1 From the Minitab output to find the best-fitting S-curve for your data you put these

Two coefficients into the general logistic regression model: P = – j-^—b + b x. For

487_0;!^. 1 + E 01

The movie and age data, you get P = – r^-—487 _ 018x, which is the best-fitting logis -

1 + E

Tic regression model for this data set.

The graph of the best-fitting logistic regression model for the movie and age data is shown in Figure 8-3. Notice it has an S-shaped curve to it. Note that the graph’s a downward-sloping S-curve, because higher probabilities of liking the movie are affiliated with lower ages and lower probabilities are affiliated with higher ages. The movie marketers now have the answer to their question. This movie has a higher chance of being well liked by kids (and the younger, the better) and a lower chance of being well liked by adults (and the older they are, the lower the chance of liking the movie).

The point where the probability changes from high to low is between ages 25 and 30. That means that the tide of probability of liking the movie appears to turn from higher to lower in that age range. Using calculus terms, this point is called the Saddle point Of the S-curve, which is the point where the graph changes from concave up to concave down, or vice versa.

Movie

This

Enjoying

Figure 8-3:

The best-

Fitting

S-curve for

The movie

And age

Data.

EstIMatINg p

You’ve determined the best-fitting logistic regression model for your data, obtained the values of B0 And b from the logistic regression analysis, and know the precise S-curve that fits your data best (check out the previous sections). You’re now ready to estimate P And make predictions about the probability that the event of interest will happen, given the value of the explanatory variable x.

To estimate P For a particular value of X, Plug that value of X Into your equation (the best-fitting logistic regression model) and simplify it by using your algebra skills. The number you get is the estimated chance of the event occurring for that value of X, And it should be a number between 0 and 1, being a probability and all.

Continuing with the movie and age example from the preceding sections, suppose you want to predict whether a child of age 15 would enjoy the movie. To

Estimate P, Plug 15 in for X In the logistic regression model P = – r^-Mrr To

A 4.87 - 0.18*15 2.17 o 76 1 + E

Get P = -7^-—487 _ 018*15 = -r^ITT = 876 = 0.90. That answer means you’ve found 1 + e 1 + e 9.76

A 90 percent chance that a 15-year-old child will like the movie. You can see in

Figure 8-3 that when X Is 15, P Is approximately 0.90. On the other hand, if the

Person is 50 years old, the chance he will like this movie is P = -j-^—487 _ 018*50, or 0.02 (shown in Figure 8-3 for X = 50), which is only a 2 percent chance.

The results you get from a logistic regression analysis, as with any other data analysis, are all subject to the model fitting appropriately. The following section deals with that.

CheckINg the fIT of the model

To determine whether or not your logistic regression model fits, follow these steps:

1. Locate the p-value of the goodness-of-fit test (found in the Goodness-of-Fit portion of the computer output; see Figure 8-4 for an example); if the p-value is larger than 0.05, conclude that your model fits, and if the p-value is less than 0.05, conclude that your model doesn’t fit.

2. Find the p-value for the B1 Coefficient (it’s listed under P In the row for your column one [explanatory] variable); if the p-value is less than 0.05, the X Variable is statistically significant in the model, so it should be included.

If the p-value is greater than or equal to 0.05, the X Variable isn’t statistically significant and shouldn’t be included in the model.

3. Look later in the output at the percentage of concordant pairs to determine how well the model fits; the higher the percentage, the better the model fits.

That percentage pertains to the number of times that the data and the model actually agree with each other.

4P\

The conclusion in step one based on the p-value may seem backwards to you, but here’s what’s happening: Chi-square goodness-of-fit tests measure the overall difference between what you expect to see via your model versus what you actually observe in your data. (Chapter 15 gives you the lowdown on Chi-square tests.) The null hypothesis (Ho) for this test says you have a difference of zero between what you observed and what you expected from the model; that is, your model fits. The alternative hypothesis, denoted Ha, says that the model doesn’t fit. If you get a small p-value (under 0.05), reject Ho and conclude the model doesn’t fit. If you get a larger p-value (above 0.05), you can stay with your model.

Failure to reject Ho here (having a large p-value) only means that you can’t say your model doesn’t fit the population from which the sample came. It doesn’t necessarily mean the model fits with 100 percent certainty. Your data could be unrepresentative of the population just by chance.

Figure 8-4:

The model-fitting part of the movie and age data’s logistic regression output.

Goodness-of-Fit Test

Method Chi-Square DF P

Pearson 2.83474 9 0.970

Deviance 3.63590 9 0.934

Hosmer-Lemeshow 2.75232 6 0.839

Measures of Association:

(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent

Concordant 349 87.3

Discordant 30 7.5

Ties 21 5.3

Total 400 100.0

Summary Measures

Soniers’ D 0.80

Goodnian-Kruskal Gamma 0.84

Kendall’s Tau-a 0.41

Using Figure 8-4 to complete the first step of checking the model’s fit, you can see many different goodness-of-fit tests. The particulars of each of these tests are beyond the scope of this book; however, in this case (as with most cases), each test has only slightly different numerical results and the same conclusions. All the p-values in Column 4 of Figure 8-4 are over 0.80, which is much higher than the 0.05 you need to reject the model. After looking at the P-values, the model appears to fit this data.

For step two, you look at the significance of the X Variable age. In Figure 8-2, you can see the constant for age, -0.18, and farther along in its row, you can see that the Z-value is -3.52; this Z-value is the test statistic for testing Ho: P1 = 0 versus Ha: P1 ^ 0. The p-value is listed as 0.000, which means it’s smaller than 0.001 (a highly significant number). So you know that the coefficient in front of X, Also known as P1, Is statistically significant (not equal to zero), and you should include X (age) in the model.

To complete step three of the fit-checking process, look at the percentage of concordant pairs reported in Figure 8-4. This value shows the percentage of times the data actually agreed with the model (87.3). To get this result make

Predictions as to whether the event should have occurred for each individual based on the model and compare those results to what actually happened. Now the logistic regression model is for P, The probability of the event occurring, so if P Is estimated to be > 0.50 for some value of X, Your best guess is that the event will occur (versus not occurring). If the estimated value of P Is < 0.50 for a particular x-value, your best guess is that it won’t occur.

For the movie and age data, the percentage of concordant pairs (that is, the percentage of times the model made the right decision in predicting what would happen) is 87.3 percent, which is quite high. The percentage of concordant pairs was obtained by taking the number of concordant pairs and dividing by the total number of pairs. I’d start getting excited if the percentage of concordant pairs got over 75 percent; the higher, the better.

Figure 8-5 shows the logistic regression model for the movie and age data, with the actual values of the observed data added as circles. Much of the time, the model made the right decision; probabilities above 0.50 are associated with more circles at the value of 1, and probabilities below 0.50 are associated with more circles at the value of 0. It’s the outcomes that have P Near 0.50 that are hard to predict, because the results can go either way.

Figure 8-5:

Actual observed values (0 and 1) compared to the model.

1.0-

0.8′

0.6′

"o 0.4-

0.2

0.0

GO – OO— CO— SS

— acreOGflaa -aa Od >

10

20

30

Age

40

50

All of this evidence helps confirm that your model fits your data well. You can go ahead and make estimates predictions based on this model for the next individual that comes up, whose outcome you don’t know. (See the section "Estimating p" earlier in this chapter.)

In This Chapter

^ Comparing more than two population medians with the Kruskal-Wallis test

^ Determining which populations are different by using the Wilcoxon rank sum test

Statisticians who are in the nonparametrics business make it their jobs to always find a nonparametric equivalent to a parametric procedure (one that doesn’t depend on the normal distribution). And in the case of comparing more than two populations, these stats superheroes didn’t let us down. In this chapter, you see how the Kruskal-Wallis test works to compare more than two populations as a nonparametric procedure. If Kruskal-Wallis tells you at least two populations differ, you also figure out how to use the Wilcoxon rank sum test to determine which population is different.

Doing the Kruskal-Wallis Test to Compare More than Two Populations

The Kruskal-Wallis test compares the medians of several (more than two) populations to see whether or not they are different. The basic idea of Kruskal-Wallis is to collect a sample from each population, rank all the combined data from smallest to largest, and then look for a pattern in how those ranks are distributed among the various samples. For example, if one sample gets all the low ranks and another sample gets all the high ranks, perhaps their population medians are different. Or if all the samples have an equal mix of all the ranks, perhaps the medians of the populations are all deemed to be the same. In this section, you see exactly how the Kruskal-Wallis test is conducted using ranks and sums and all that good stuff, and you see it applied to an example comparing airline ratings.

Suppose your boss flies a lot, and she wants you to determine which of three airlines gets the best ratings from customers. You know that ratings involve data that is just not normal (pun intended), so you opt to use the Kruskal-Wallis test. You take three random samples of nine people each from three different airlines. You ask each person to rate his satisfaction with the one airline for which you chose that person to rate. Each person uses a scale from 1 (the worst) to 4 (the best). You can see the data from your samples in Table 19-1.

Table 19-1 Customer Ratings of Three Airlines

Airline A Rating Airline B Rating Airline C Rating

4 2 2

3 3 3

4 3 3 4 3 2 3 4 2 3 4 1 2 3 3 342 432

In looking at the data in Table 19-1, it appears that airlines A and B have better ratings than airline C. However, the data has a lot of variability in it, so you have to conduct a hypothesis test before you can make any general conclusions beyond this data set.

You may be thinking of using ANOVA to analyze this data (the test that compares the means of several populations and is found in Chapter 9). But the data from each airline is ratings from 1 to 4, and this blows the strongest condition of ANOVA — the data from each population must follow a normal distribution. (A Normal distribution Is continuous, meaning it takes on all real numbers in a certain range. Data that are whole numbers like 1, 2, 3, and 4 don’t fall under this category.)

But don’t sweat; a nonparametric alternative fits the bill. The Kruskal-Wallis test compares the medians of several (more than two) populations to see whether they are all the same or not. In other words, it’s like ANOVA, except it’s done with medians not means.

In this section, you discover how to check the conditions of the Kruskal-Wallis test, set it up, and carry it out step by step.

CheckINg the condITIOns

Following are all of the conditions of the Kruskal-Wallis test that must be met:

The random samples taken from each population are independent. (This means matched-pairs data like in Chapter 17 are out of this picture.)

All the populations have the same distribution. (That is, their shapes are the same as seen on a histogram.)

The variances of the populations are the same. That means the amount of spread in the population values is the same from one population to the next.

Note that these conditions mention shape and spread, but they don’t mention the center of the distributions. That’s what the test is trying to determine, whether the populations are centered at the same place.

,^fi-ST(/j^ In nonparametrics, you often see the word Location In reference to a population ^-^jtgjrv distribution rather than the Center, Although the two words mean about the {2 (J ) Same thing. Location indicates where the distribution is sitting on the number line. If you have two bell-shaped curves with the same variance, and one has mean 10 and the other has mean 15, the second distribution is located five units to the right of the first. In other words, it’s location is a five-unit shift to the right of the first distribution. In nonparametrics, where you don’t have bell-shaped distributions, you typically use the median as a measure of location (center) of a distribution. So throughout this discussion, you could use the word Median Instead of location (although location leaves it a bit more open).

Regarding the airline survey, you know that the samples are independent, because you didn’t use the same person to rate more than one airline. The other two conditions have to do with the distributions the samples came from; each population must have the same shape and the same spread. You can examine both conditions by looking at boxplots of the data (see Figure 19-1) and descriptive statistics, such as the median, standard deviation, and the rest of the summary statistics making up the boxplots (see Figure 19-2).

The boxplots in Figure 19-1 all have the same shape, and their standard deviations, shown in Figure 19-2, are very close. All of this evidence taken together allows you to go ahead with the Kruskal-Wallis test. (Now looking at the overlap in the boxplots for airlines A and B, in Figure 19-1, you can also make an early prediction that airlines A and B have similar ratings; whether C is different enough from A and B is impossible to say without running the hypothesis test.)

Figure 19-1:

Boxplots comparing the ratings of three airlines.

Figure 19-2:

Descriptive statistics comparing the ratings of three airlines.

Descriptive Statistics: Rating

Variable Airline StDev

Minimum

Q1

Median

Q3

Maximum

Rating A 0.707

2.000

3.000

3.000

4.000

4.000

B 0.667

2.000

3.000

3.000

4.000

4.000

C 0.667

1.000

2.000

2.000

3.000

3.000

Either a boxplot or a histogram can tell you about the shape and spread of a distribution (as well as the center). The Boxplot Is a common type of graph to use for nonparametric procedures because it displays the median (the non-parametric statistic of choice) rather than the mean. A Histogram Is at its best showing the shape of the data; it doesn’t directly tell where the center is — you just have to eyeball it. Go ahead with the boxplot versus the histogram for the airline data.

To make boxplots of each sample of data show up side by side on one graph (called side-by-side boxplots, cleverly) in Minitab, click on Graph>Box Plots and select the Multiple Y’s Simple version. In the left-hand box, click on each of the column names for your data sets. They each appear in the Graph Variables window on the right. Click OK and you get a set of boxplots that are side by side, all on the same graph using the same scale (slick, huh?).

Setting up the test

The Kruskal-Wallis test assesses Ho: All K Populations have the same location versus Ha: The location of at least two of the K Populations are different. (Here, K Is the number of populations you’re comparing.)

In Ho, you see that all the populations have the same location (which means they all sit on top of each other on the number line and are in essence the same population). Ha is looking for the opposite situation in this case. However, the opposite of "the locations are all equal" isn’t "the locations are all different." The opposite is that at least two of them are different. Failure to recognize this difference will lead you to believe all the populations differ when, in reality, there may only be two that differ, and the rest are all the same. That’s why you see Ha stated the way it is in the Kruskal-Wallis test. (The same idea holds for comparing means using ANOVA; see Chapter 9.)

For the airline satisfaction example (see Table 19-1), your setup looks like this: Ho: The satisfaction ratings of all three airlines have the same median versus Ha: The median satisfaction ratings of at least two airlines are different.

Conducting the test step by step

After you’ve determined your hypotheses, and checked the conditions, you must carry out the test. Here are the steps for conducting the Kruskal-Wallis test, using the airline example to show how each step works:

1. Rank all the numbers in the entire data set from smallest to largest (using all samples combined); in the case of ties, use the average of the ranks that the values would have normally been given.

For an example of a tie, say that on a scale from 1 to 4, the observations 1, 1, 1 would normally have gotten ranks 1, 2, 3 if they were different, but because they’re equal, give each one the average of 1, 2, 3, which is (1 + 2 + 3) _ ,

–o – = 2. Figure 19-3 shows the results for ranking and summing

3

The data in the airline example.

In Figure 19-3, you can see how to rank the ties. For example, you have only one 1, which is given rank 1. Then you have seven 2s, which normally would have gotten ranks 2, 3, 4, 5, 6, 7, and 8. Because the 2s are all equal, you give each of them the average of all these ranks, which is

(2 + 3 + 4 + 5 + 6 + 7 + 8) r „ , , , „ ,

–=7– = 5. Similarly, you see twelve 3s, whose ranks

7

Would be 9 through 20. Because they’re all equal, give them each a rank

Equal to ——10 10 —20) = 14.5. Finally, you see seven 4s, each with rank 12

24, which is the average of their would-be ranks, ranging from 21 to 27.

Figure 19-3:

Rankings and rank sum for the airline example.

Airline A Rating Rank

Airline B Rating Rank

Airline C Rating Rank

4

24

2

5

2

5

3

14.5

3

14.5

3

14.5

4

24

3

14.5

3

14.5

4

24

3

14.5

2

5

3

14.5

4

24

2

5

3

14.5

4

24

1

1

2

5

3

14.5

3

14.5

3

14.5

4

24

2

5

4

24

3

14.5

2

5

71=159

1

72 = 149.5

2

73 = 69.5

3

2. Total the ranks for each of the samples; call those totals T1, T2, . . ., Tk, Where K Is the number of populations.

The totals of the ranks in each column of Figure 19.3 for the airline data are T1 = 159, T2 = 149.5, and T3 = 69.5. In the steps that follow, you use these rank totals in the Kruskal-Wallis test statistic (denoted KW). (Note T1 and T2 are close to equal, but T3 is much lower, giving the idea that airline C may be the odd man out.)

3. Calculate the Kruskal-Wallis test statistic, KW = -tt – ! -J – - 3 (N +1),

‘ n(n + 1)"^ NJ v ;’

Where N Is the total number of observations (all sample sizes combined). Continuing with the airline example, the Kruskal-Wallis test statistic is KW = 27(27 + 1)(+ 14q5 + 695 J - 3(27 + 1), which equals 0.0159 * 5,829.056 – 3(28) = 8.52.

4. Find the p-value.

You find the p-value for your KW test statistic by comparing it to the Chi-square distribution with K - 1 degrees of freedom (Table A-3 in the Appendix). For the airline example, you look at the Chi-square table (Table A-3 in Appendix) and find the row for with 3 – 1 = 2 degrees of freedom. Then look at where your test statistic (8.52) falls in that row. Because 8.52 lies between 7.38 and 9.21 (shown on the table in row two) that means the p-value for 8.52 lies between 0.025 and 0.010 (shown in their respective column headings.)

5. Make your conclusion about whether you can reject Ho by examining the p-value.

You can reject Ho: All populations have the same location, in favor of Ha: At least two populations have differing locations, if the p-value associated with KW is < a, where a is 0.05 (or your prespecified a level). Otherwise, you must fail to reject Ho.

Following the airline example, because the p-value is between 0.010 and 0.025, which are both less than a = 0.05, you can reject Ho. You conclude that the ratings of at least two of the three airlines are different.

To conduct the Kruskal-Wallis test by using Minitab, enter your data in two columns, the first column represents the actual data values and the second column represents which population the data came from (for example, 1, 2, 3). Then click on Stat>Nonparametrics>Kruskal-Wallis. In the left-hand box, click on column one; it appears on the right side as your Response variable. Then click on column two in the left-hand box. This column appears on the right side as the Factor variable. Click OK, and the KW test is done. The main results of the KW test are shown in the last two lines of the Minitab output.

The results of the Minitab data analysis of the airline data are shown in Figure 19-4. On the second-to-last line of Figure 19-4, you can see the KW test statistic for the airline example is 8.52, which matches the one you found by hand (whew!). The exact P-value from Minitab is 0.014.

Kruskal-Wallis Test: Rating versus Airline

Kruskal-Wallis Test

On Rating

Figure 19-4:

Airline

N

Median

Ave

Rank

Z

Comparing

A

9

3.000

17.7

1.70

Ratings

B

9

3.000

16.6

1.21

Of three

C

9

2.000

7.7

-2.91

Airlines by

Overall

27

14.0

Using the

Kruskal-

H = 8.52

DF = 2

P = 0.014

Wallis test.

H = 9.70

DF = 2

P = 0.008

(adjusted

For ties)

-1

However, quite a few ties are in this data set, and the formulas adjust a bit for that (in ways that go outside the scope of this book). Taking those ties into account, the computer gives you KW = 9.70 with a p-value of 0.008. The total evidence here says the same result loud and clear — reject Ho: The ratings for the three airlines have the same location. You conclude that the ratings of at least two of the airlines are different. (But which ones? The answer comes in the next section.)

Most people want life — from football to food portions — to be fair. And nothing appears to be more unfair than car insurance rates, right? You’ve heard the ads; one company claims to offer the lowest possible rates one day and a competing company makes the same claim the very next day. Who can you believe? You decide to grab the wheel and run your own test. You take a random sample of 20 different car and driver combinations (for example, a 40-year-old female with a Ford pickup, or a 78-year-old lady driving a Caddy) and you get the corresponding car insurance estimates from each company for each car and driver combo based on a six-month premium. Knowing that the distribution of prices for each company has no real reason to

Be normal (as in distribution) you go for the Kruskal-Wallis test of their medians. You rank all the premiums from smallest to largest, you sum the ranks that correspond to estimates from each company, and you compare them using the KW statistic. In the end, you might very well find that the companies’ prices don’t look that different after all, because the prices they talk about in their advertisements represent a selective sample of the population of all their prices, and your sample gets more at the heart of the pricing that is actually going on overall. The moral of the story is don’t listen to everything you hear about car insurance rates. Get a cross section of prices and do the Kruskal-Wallis. Your pocketbook will thank you for it.

Pinpointing the Differences: The Wilcoxon Rank Sum Test

Suppose you reject Ho in the Kruskal-Wallis test. That means you have enough evidence to conclude that at least two of the populations have different medians. But you don’t know which ones are different. When someone finds that a set of populations don’t all share the same median, the next question is very likely to be, "Well then, which ones are different?" To find out which populations are different after the Kruskal-Wallis test has rejected Ho, you can use the Wilcoxon rank sum test (also known as the Mann-Whitney test; refer to Chapter 18).

You can’t go looking for differences in specific pairs of populations until you’ve first established that the populations aren’t all the same (that is, Ho is rejected in the Kruskal-Wallis test). If you don’t make this check first, you can encounter a ton of problems, not the least of which being much-increased chance of making the wrong decision.

In the following sections, you see how pairwise comparisons are conducted and interpreted in order to find out where the differences lie among the K Population medians you’re studying.

Pairing off with pairwise comparisons

The rank sum test is a nonparametric test that compares two population locations (for example, their medians). When you have more than two populations, you conduct the rank sum test on every pair of populations in order to see whether differences exist. This procedure is called conducting Pairwise comparisons Or Multiple comparisons. (See Chapter 10 for info on the parametric version of multiple comparisons.) For example, because you’re comparing three airlines in the airline satisfaction example (see Table 19-1), you have to run the rank sum test three times to compare airlines A and B, A and C, and B and C, respectively. So you need three pairwise comparisons to figure out which populations are different.

To determine how many pairs of comparisons you need if you’re given K Populations, you use the formula —(kr-—-. You have K Populations to choose

From first, and then K - 1 populations left to compare them with. Finally, you don’t care what the order is among the populations (as long as you keep track of them); so you divide by two because you have two ways to order any pair (for example, comparing A and B gives you the same results as comparing B and A). In the airlines example, you have K = 3 populations, so you

Should have ^r;- = —^—- = 3 pairs of populations to compare, which

Matches what was determined previously. (For more information and examples on how to count the number of ways to choose or order a group of items by using permutations and combinations, see another book I authored, Probability For Dummies [Wiley].)

Carrying out comparison tests to see who’s different

The Wilcoxon rank sum test assesses Ho: The two populations have the same location versus Ha: The two populations have different locations. Here are the general steps for using the Wilcoxon rank sum test for making comparisons (for detailed step-by-step instructions for the Wilcoxon rank sum test see Chapter 18):

1. Check the conditions for the test by using descriptive statistics and histograms for the last two and proper sampling procedures for the first one:

• The two samples must be from independent populations

• The populations must have the same distribution (shape)

• The populations must have the same variance

2. Set up your Ho: Medians are equal versus Ha: Medians aren’t equal.

3. Combine all the data and rank the values from smallest to largest.

4. Add up all the ranks from the first sample (or the smallest sample if the sample sizes are not equal).

This result is your test statistic, T.

5. Compare T To the critical values in Table A-4 (Appendix) in the row and column corresponding to the two sample sizes.

If T Is at or beyond the critical values (less than or equal to the lower one or greater than or equal to the upper one), reject Ho and conclude the two population medians are different. Otherwise, you can’t reject Ho.

6. Repeat Steps 1-5 on every pair of samples in the data set and draw conclusions.

Sort through all the results to see the overall picture of which pairs of populations have the same median and which ones don’t.

To conduct the Wilcoxon rank sum test for pairwise comparisons in Minitab, refer to Chapter 18. Note that Minitab calls this test by its other name, the Mann-Whitney test.

You can see the Minitab results of the three Wilcoxon rank sum tests comparing airlines A and B, A and C, and B and C, respectively, in Figures 19-5a, 19-5b, and 19-5c.

Before you make any judgments about your hypotheses, you must analyze your data. Figure 19-5a compares the ratings of airlines A and B. The p-value (adjusted for ties) is 0.7325, which is much higher than the 0.05 you need to reject Ho. So you can’t conclude that airlines A and B have satisfaction ratings with different medians. Figure 19-5b shows that the P-value for comparing airlines A and C is 0.0078. Because this P-value is a lot smaller than the typical a level of 0.05, this is very convincing evidence that airlines A and C don’t have the same median ratings. Figure 19-5c also has a small P-value (0.0107), which gives evidence that airlines B and C have significantly different ratings.

Examining the medians to see how they’re different

Now that you know two or more populations have different medians, the next question to answer is how they are different; which one has the higher

Median, which one has the lower median. In this section, you see how to take the results of your pairwise comparisons combined with some descriptive statistics to get your answers.

Point estimate for ETA1-ETA2 is -0.000

95.8 Percent CI for ETA1-ETA2 is (-1.000,1.000)

W = 89.5

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.7573 The test is significant at 0.7325 (adjusted for ties)

A

Mann-Whitney Test and CI: Airline A, Airline C

N Median

A 9 3.000 C 9 2.000

Point estimate for ETA1-ETA2 is 1.000

95.8 Percent CI for ETA1-ETA2 is (0.000,2.000)

W = 114.5

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.0118

The test is significant at 0.0078 (adjusted for ties)

B

Figure 19-5:

Wilcoxon rank sum tests comparing ratings of two airlines at a time.

Mann-Whitney Test and CI: Airline B, Airline C

N Median B 9 3.000

C 9 2.000

Point estimate for ETA1-ETA2 is 1.000

95.8 Percent CI for ETA1-ETA2 is (0.000,2.000)

W = 113.0

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.0171

The test is significant at 0.0107 (adjusted for ties)

After you’ve rejected Ho for a multiple comparison, that means the two populations you examined have different medians. There are two ways to proceed from here to see how the medians differ:

You can look at side-by-side boxplots of all the samples and compare their medians (located at the line in the middle of each box).

You can calculate the median of each sample and see which ones are higher and which ones are lower (from the populations you have concluded are statistically different).

From the previous section, you see that the pairwise comparisons for the airline data conducted by Wilcoxon rank sum tests conclude that the ratings of airlines A and B aren’t found to be different, but both of them are found to be different from airline C.

But you can say even more; you can say how the differing airline compares to the others. Going back to Figure 19-2, you see the medians of both airlines A and B are 3.0, while the median of airline C is only 2.0. That difference means airlines A and B have similar ratings, but airline C has lower ratings than A and B.

The boxplots in Figure 19-1 confirm these results. By looking at these box-plots first, you may have had an idea that A and B were the same, but you didn’t know whether airline C was statistically significantly different from airlines A and B. And now you know it is.

  • Автор: Анкар
  • Категории: (8

In This Chapter

^ Reading and interpreting two-way tables ^ Figuring probabilities and checking for independence Watching out for Simpson’s Paradox

Ooking for relationships between two categorical (qualitative) variables is a very common goal for researchers. For example, many medical studies center on how some characteristic about a person either raises or lowers his chance of getting some disease. Marketers ask questions like, "Who is more likely to buy our product: males or females?" Sports stat freaks wonder about things like "Does winning the coin toss at the beginning of a football game increase your team’s chance of winning the game?"

To answer each of the above questions, you must first collect data (from a random sample) on the two categorical variables being compared — call them X And Y. Then you organize that data into a table that contains columns and rows, showing how many individuals from the sample appear in each combination of X And Y. Finally, you use the information in the table to conduct a hypothesis test (called the Chi-square test). Using the Chi-square test, you can determine whether you can see a relationship between X And Y In the population from which the data was drawn. This last step needs the machinery from Chapter 14 to accomplish it. The goals of this chapter are to understand what it means for two qualitative variables (x And Y) To be associated and to discover how to use percentages to determine whether a sample data set appears to show a relationship between X And Y.

Suppose you’re collecting data on cell-phone users, and you want to find out whether more females use cell phones than males. A study of 508 randomly selected male cell-phone users and 508 randomly selected female cell-phone users conducted by a wireless company found that women tend to use their phones for personal calls more than men (big shocker). The survey showed that 427 of the women said they used their wireless phones primarily to talk with friends and family, while only 325 of the men admitted to doing so.

But you can’t stop there. You need to break down this information, calculate some percentages, and compare them to see how close they really are. Sample results vary from sample to sample, and differences can appear by chance.

In this chapter, you find out how to organize data from qualitative variables (data based on categories rather than measurements) into a table format. This skill is especially useful when you’re trying to look for relationships between two qualitative variables, such as using a cell phone for personal calls (a yes or no category) and gender (male or female). You also summarize the data to answer your questions. And, finally, you get to figure out, once and for all, what’s going on with that Simpson’s Paradox thing.

Breaking Down a Two-Way Table

A Two-way table Is a table that contains rows and columns, which help you organize data from categorical (qualitative) variables in the following ways:

The rows represent the possible categories for one categorical variable, such as males and females.

The columns represent the possible categories for a second categorical variable, such as using your cell phone for personal calls, or not.

Here I review the basic ideas of organizing and filling in a two-way table.

Organizing data into a two-way table

To organize your data into a two-way table, first set up the rows and columns. Table 13-1 shows the setup for the cell-phone data (refer to the example I give at the beginning of the chapter).

Table 13-1 Two-Way Table Set Up for the Cell-Phone Data

Personal Calls: Yes Personal Calls: No

Males

Females

Notice that Table 13-1 has four empty cells inside of it (not counting the empty space in the upper-left corner). Because gender has two choices (male or female), and personal cell-phone use has two choices (yes or no), the resulting two-way table has 2 * 2 = 4 cells.

To figure out the number of cells in any two-way table, multiply the number of possible categories for the row variables times the number of possible categories for the column variable.

Fitting in the cell counts

After you set up the table with the appropriate number of rows and columns, you need to fill in the appropriate numbers in each of the cells of the two-way table. The number in each cell of a two-way table is called the Cell count For that cell. The upper-left cell in the two-way table shown in Table 13-1 represents the number of males who use their cell phones for personal calls. With the information you have in the cell-phone problem, the cell count for this cell is 325. Because you know that 427 females use their cell phones for personal calls, this number goes into the lower-left cell.

Now, to figure out the numbers in the remaining two cells, you do a bit of subtraction. You know from the information given that the total number of male cell-phone users in the survey is 508. Each male either uses his cell phone for personal calls (falling into the Yes Group), or he doesn’t (falling into the No Group). Because 325 males fall into the Yes Group, and you have 508 males total, 183 males (508 – 325 = 183) don’t use their cell phones for personal calls. This number is the cell count for the upper-right cell of the two-way table. Finally, because 508 females took the survey, and 427 of them use their cell phones for personal calls, you know that the rest of them (508 -427 = 81) don’t. Therefore, 81 is the cell count for the lower-right cell of the table. Table 13-2 shows the completed table for the cell-phone user problem, with the four cell counts filled in.

Table 13-2 Completed Two-Way Table for the Cell-Phone Data

Personal Calls: Yes Personal Calls: No

Males 325 183 (508 – 325)

Females

427

81 (508 – 427)

Just to save you a little time, if you have the total number in a group and how many of those individuals fall into one of the categories of the two-way table, you can determine the number falling into the remaining category by subtracting the total number in the group minus the number in the given category. You can complete this process for each remaining group in the table.

Making marginal totals

One of the most important aspects of a two-way table is to have easy access to all the pertinent totals. Because every two-way table is made up of rows and columns, you can imagine that the totals for each row and the totals for each column are important. Also, the grand total is important to know.

If you take a single row and add up all the cell counts in the cells of that row, you get what is called a Marginal row total For that row. Where does this marginal row total go on the table? You guessed it — out in the margin at the end of that row. You can find the marginal row totals for every row in the table and put them into the margins at the end of the rows. This group of marginal row totals for each row represents what statisticians call the Marginal distribution For the row variable. The marginal row totals should add up to the Grand total, Which is the total number of individuals in the study. (The individuals may be people, cities, dogs, companies, and so on, depending on the scenario of the problem at hand.)

Similarly, if you take a single column and add up all the cell counts in the cells of that column, you get the Marginal column total For that column. This number goes in the margin at the bottom of the column. Follow this pattern for each column in the table, and you have the marginal distribution for the column variable. Again, the sum of all the marginal column totals equals the grand total. The grand total is always located in the lower-right corner of the two-way table.

The marginal row total, marginal column totals, and the grand total for the cell-phone example are shown in Table 13-3.

Table 13-3

Marginal and Grand Totals for the Cell Phone Data

Personal Personal Marginal

Calls: Yes Calls: No Row Totals

Males

325 183(508 – 325) 508

Females

427 81 (508 – 427) 508

Marginal Column Totals 752 264 1,016 (Grand Total)

The marginal row totals add the cell counts in each row; yet the marginal row totals show up as a column in the two-way table. This phenomenon occurs because when summing the cell counts in a row, you put the result in the margin at the end of the row, and when you do this for each row, you’re stacking the row totals into a column. Similarly, the marginal column totals add the cell counts in each column; yet they show up as a row in the two-way table. Don’t let this be a source of confusion when you’re trying to navigate or set up a two-way table. It’s always a good idea to label your totals as marginal row, marginal column, or grand total to help keep it clear.

Breaking Down the Probabilities

A percentage, when applied to a two-way table, represents the portion of the individuals in the sample falling into a certain group. This idea can be expanded to a probability, which gives the chance that an individual person selected from this group falls into a certain category.

A two-way table gives you the opportunity to find many different kinds of probabilities to help you find the answers to different questions about your data or to look at the data another way. In this section, I cover the three most important types of probabilities found in a two-way table: marginal probabilities, joint probabilities, and conditional probabilities. (If you need more info on these terms, check out Probability For Dummies [Wiley].)

When you find probabilities based on a sample, as you do in this chapter, you have to realize that those probabilities pertain to that sample only. They do not transfer automatically to the population being studied. For example, if you take a random sample of 1,000 adults and find that 55 percent of them watch reality TV, this study doesn’t mean that 55 percent of all adults in the entire population watch reality TV. (The media makes this mistake every day.) You need to take into account the fact that sample results vary. In Chapters 14 and 15, you do just that. But this chapter zeros in on summarizing the information in your sample, which is the first step toward that end (but not the last step in terms of making conclusions about your corresponding population).

Marginal probabilities

A Marginal probability Makes a probability out of the marginal total, for either the rows or the columns. A marginal probability represents the proportion of the entire group that belongs in that single row or column category. Each

Marginal probability represents only one category for only one variable — it doesn’t consider the other variable at all. In the cell-phone example, you have four possible marginal probabilities (refer to Table 13-3):

Marginal probability of female (50?-i, oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were females.

E Marginal probability of male (5%oi6 = 0.50). That means, 50 percent of all the cell-phone users in this sample were males.

E Marginal probability of using a cell phone for personal calls (75X,0i6 = 0.74). Therefore, 74 percent of all cell-phone users in this sample make personal calls with their cell phones.

E Marginal probability of not using a cell phone for personal calls (2%>ie = 0.26). In other words, 26 percent of all the cell-phone users in this sample don’t make personal calls with their cell phones.

Statisticians use shorthand notation for all probabilities. If you let M = male, F = female, Yes = personal cell-phone use, and No = no personal cell-phone use, then each of the preceding marginal probabilities is written this way:

E P(F) = 0.50 E P(M) = 0.50 E P(Yes) = 0.74 E P(No) = 0.26

Notice that P(F) and P(M) add up to i.00. This result is no coincidence, because these two categories make up the entire gender variable. Similarly, P(Yes) and P(No) sum up to i.00 because those choices are the only two for the personal cell-phone use variable. Everyone has to be classified somewhere.

Jf»Nfi/

Be advised that some probabilities aren’t useful in terms of discovering information about the population in general. For example, P(F) = 0.50 in the previous example because the researchers determined ahead of time that they wanted exactly 508 females and exactly 508 males. The fact that 50 percent of the sample is female and 50 percent of the sample is male doesn’t mean that in the entire population of cell-phone users 50 percent are males and 50 percent are females. The sample was just set up that way. If you want to study what proportion of cell-phone users are females and males, you need to take a combined sample instead of two separate ones, and see how many males and females appear in the combined sample.

Joint probabilities

A Joint probability Gives the probability of the intersection of two categories, one from the row variable and one from the column variable. It’s the probability that someone selected from the whole group has two particular characteristics at the same time. A joint probability is found by taking the cell count for those having both characteristics and dividing by the grand total. In other words, both characteristics happen jointly, or together.

The cell-phone example has four joint probabilities:

U The probability that someone from the entire group is male and uses his cell phone for personal calls. This probability is 32/i,oi6 = 0.32, meaning that 32 percent of all the cell-phone users in this sample are males using their cell phones for personal calls.

U The probability that someone from the entire group is male and doesn’t use his cell phone for personal calls is 18J1,oi6 = 0.18.

U The probability that someone from the entire group is female and makes personal calls with her cell phone is 42^,ois = 0.42.

U The probability that someone from the entire group is female and doesn’t make personal calls with her cell phone is 8Koie = 0.08.

The notation for the joint probabilities previously listed is as follows, where + represents the intersection of the two categories listed:

U P(M + Yes) = 0.32 U P(M + No) = 0.18 U P(F + Yes) = 0.42 U P(F + No) = 0.08

The sum of all the joint probabilities for any two-way table should be 1.00, unless you have a little round-off error, which makes it very close to, but not exactly, 1.00. The sum is 1.00, because everyone in the group is classified somewhere with respect to both variables. It’s like dividing the entire group into four parts and showing which proportion falls into each part.

Conditional probabilities

A Conditional probability Is what you use if you want to compare subgroups in the sample. In other words, if you want to break down the table further, a conditional probability is what you use. Each row has a conditional probability

For each cell within the row, and each column has a conditional probability for each cell within that column.

Note: Because conditional probability is one of the sticking points for a lot of students, I want to spend extra time on it. My goal in this section is for you to have a good understanding of what a conditional probability really means and how you can use it in the real world (something many statistics textbooks neglect to mention, I have to say).

Figuring conditional probabilities

Consider the cell-phone example in Table 13-3. Suppose you want to look at just the males who took the survey. The total number of males is 508. You can break this group down into two subgroups by using conditional probability. You can find the probability of using cell phones for personal calls (males only), and you can find the probability of not using cell phones for personal calls (males only). Similarly, you can break down the females by those females who use cell phones for personal calls and those females who don’t.

In each case, to find a conditional probability, you first look at a single row or column of the table that represents the known characteristic about the individuals. The marginal total for that row or column now represents your new grand total, because this group becomes your entire universe when you examine it. Then take the cell counts from that row or column and divide the sum by that row or column’s marginal total.

In the cell-phone example, you have the following conditional probabilities when you break the table down by gender:

The conditional probability that a male uses a cell phone for personal calls is 32548 = 0.64.

The conditional probability that a male doesn’t use a cell phone for personal calls is 1833508 = 0.36.

The conditional probability that a female uses a cell phone for personal calls is *%8 = 0.84.

The conditional probability that a female doesn’t use a cell phone for personal calls is *%8 = 0.16.

To interpret these results, you say that within this sample if you’re male, you’re more likely than not to use your cell phone for personal calls (64 percent compared to 36 percent). However, the percentage of personal-call makers is higher for females (84 percent versus 16 percent).

The conclusions you can make from two-way tables in this chapter must refer only to the sample, not the population it came from. Before going on to make general statements about the conditional probability within a population, you need to conduct a confidence interval for a population proportion (which is

Equivalent to a probability). See Chapter 3 or your intro stats book for information on a hypothesis test for a population proportion.

Notice that for the males in the previous example, the two probabilities (0.64 and 0.36) add up to 1.00. This is no coincidence. The males have been broken down by cell-phone use for personal calls, and because everyone in the study is a cell-phone user, each male has to be classified in one group or the other. Similarly, the two probabilities for the females sum to 1.00.

Notation for conditional probabilities

Conditional probabilities are denoted by a straight up-and-down line that lists and separates the event that is known to have happened (what’s given) and the event for which you want to find the probability. You can write the notation like this: P(XXIXX). You place the given event to the right of the line and the event for which you want to find the probability to the left of the line. For example, suppose you know someone is female (F) and you want to find out the chance she is a Democrat (D). In this case, you’re looking for P(DIF). On the other hand, say you know a person is a Democrat and you want the probability that person is female — you’re looking for P(FID).

The straight up-and-down line in the conditional probability notation isn’t a division sign; the line is just a line separating events A and B. Also, be careful of the order in which you place A and B into the conditional probability notation. In general, P(AIB) ^ P(BIA).

Following is the notation used for the conditional probabilities in the cell-phone example:

P(Yes I M) = 0.64. You can say it this way: "The probability of Yes given Male is 0.64."

P(No I M) = 0.36. In human terms, say "The probability of No given Male is 0.36."

P(Yes I F) = 0.84. Say this one with gusto: "The probability of Yes given Female is 0.84."

P(No I F) = 0.16. You translate this notation by saying "The probability of No given Female is 0.16."

You can see that P(Yes I M) + P(No I M) = 1.00 because you’re breaking all males into two groups: those using cell phones for personal calls (Y) and those not (N). Notice however, that P(Yes I M) + P(Yes I F) doesn’t sum to 1.00. In the first case, you’re looking only at the males, and in the second case, only at the females.

Comparing two groups with conditional probabilities

One of the most common questions regarding two categorical (qualitative) variables is this: Are they related? To answer this question, you use conditional probabilities. You set up and find the conditional probabilities you need to see whether two variables are related.

To compare the conditional probabilities, take one variable and find the conditional probabilities based on the other variable. Do this for each category of the first variable. Compare those conditional probabilities (you can even graph them for the two groups) and see whether they’re different or the same. (If the conditional probabilities are the same for each group, the variables aren’t related in the sample. If they’re different, the variables are related in the sample.) To be able to generalize the results, you need to use the sample results to draw a conclusion from the overall population involved by doing a Chi-square test (see Chapter 14).

Revisiting the cell-phone example from the previous section, you can ask specifically: Is personal use related to gender? You know that you want to compare cell-phone use for males and females to find out whether use is related to gender. However, it’s very difficult to compare cell counts — for example, 325 males use their phones for personal calls, compared to 427 females. In fact, it’s impossible to compare these numbers without using some total for perspective. Three hundred twenty-five out of what?

You have no way of comparing the cell counts in two groups without creating percentages (dividing each cell count by the appropriate total). Percentages give you a means of comparing two numbers on equal terms. For example, suppose you give a one-question opinion survey (yes, no, no opinion) to a random sample of 1,099 people; 465 respondents said yes, 357 said no, and 277 had no opinion. To truly interpret this information, you’re probably in your head trying to compare these numbers to each other. That’s what percentages do for you. Showing the percentage in each group in a side-by-side fashion gives you a relative comparison of the groups with each other.

But first, you need to bring conditional probabilities into the mix. In the cellphone example, if you want the percentage of females who use their cell phones for personal calls, you take 427 divided by the total number of females (508) to get 84 percent. Similarly, to get the percentage of males who use their cell phones for personal calls, take the cell count (325) and divide it by that row total for males (508), which gives you 64 percent. This percentage is the conditional probability of using a cell phone for personal calls, given the person is male.

Now you’re ready to compare the males and females by using conditional probabilities. Take the percentage of females who use their cell phones for personal calls and compare it to the percentage of males who use their cell phones for personal calls. By finding these conditional probabilities, you can easily compare the two groups and say that in this sample at least, more

Females use their cell phones (84 percent) for personal calls than men (64 percent).

Using graphs to display conditional probabilities

One way to highlight conditional probabilities as a tool for comparing two groups is to use graphs such as a pie chart comparing the results of the other variable for each group or a bar chart comparing the results of the other variable for each group.

Figures 13-1a and 13-1b use two pie charts to compare males and females on cell-phone use. Figure 13-1a shows cell-phone use for only the males; this pie chart shows the conditional distribution of use for (given) males. Figure 13-1b shows the conditional distribution of cell phone use for (given) females. A comparison of Figures 13-1a and 13-1b shows the slices for cellphone use aren’t equal (or even close) for males compared to females. That result means that gender and cell-phone use for personal calls are dependent in this sample.

You may be wondering how close the two pie charts need to look (in terms of how close the slice amounts are for one pie compared to the other) in order to say the variables are independent. This question isn’t one you can answer completely until you conduct a hypothesis test for the proportions themselves (see the Chi-square test in Chapter 14). For now, with respect to your sample data, if the difference in the appearance of the slices for the two graphs is enough that you would write a newspaper article about it, then I’d go for dependence. Otherwise, conclude independence.

You can also make a bar chart to show the same idea. (For more info on pie charts and bar charts, see Statistics For Dummies [written by me and published by Wiley] or your intro stats textbook.)

Another way you can make comparisons is to break down the two-way table by the column variable. (You don’t always have to use the row variable for comparisons.) In the cell-phone example (Table 13-3), you can compare the group of personal-call makers to the group of no-personal-call makers and see what percentage in each group is male and female. This type of comparison puts a different spin on the information, because you’re comparing the behaviors to each other, in terms of gender.

With this new breakdown of the two-way table, you get the following:

The conditional probability of being male, given you use your cell phone for personal calls, is P(M I Yes) = 3%2 = 0.43. Note: The denominator is 752, the total number of people who make personal calls with their cell phones.

The conditional probability of being female, given you use your cell phone for personal calls, is P(F I Yes) = 4%2 = 0.57.

Figure 13-1:

Pie charts comparing male versus female personal cell-phone use.

A

Again, these two probabilities add up to 1.00, because you’re breaking down the personal-call makers according to gender (male or female), and the last two probabilities sum to 1.00, because you’re breaking down the non-personal-call makers by gender (male and female).

The overall conclusions are similar to those found in the previous section, but the specific percentages and the interpretation are different. Interpreting the data this way, if you use your cell phone for personal calls, you’re more likely to be female than male (57 percent compared to 43 percent). And if you don’t use your cell phone to make personal calls, you’re more likely to be male (69 percent versus 31 percent).

To get the correct answer for any probability in a two-way table, here’s the trick: Always be sure to identify the group that is being examined. What is the probability "out of"? In the cell phone example (refer to Table 13-3):

If you want the percentage Of all users Who are males using their phones for personal calls, then you take the cell count 325, and divide by 1,016, the grand total.

If you want the percentage Of males Who are using their cell phones for personal calls, you take 325 divided by 508, the total number of males.

If you want the percentage Of personal-call makers Who are male, you take 325 divided by 752 (the total number of people who make personal calls with their cell phones).

In each of these three cases, the numerator is the same, but the denominators are different, leading you to very different answers. Deciding which number to divide by is a very common source of confusion for people, and this trick can really help give you an edge on keeping it straight.

Trying to be Independent

Independence is a big deal in statistics. The term generally means that two items have outcomes whose probabilities don’t affect each other. The items could be events A and B, variables X And Y, Or survey results from two people selected at random from a population, and so on. If the outcomes of the two items do affect each other, statisticians call those two items Dependent (or not independent). In this section, you check for and interpret independence of two categories of qualitative variables in a sample, and you check for and interpret independence of two qualitative variables in a sample.

Checking for independence between two categories

Statistics instructors often have students check to see whether two categories (one from a qualitative variable X And the other from a qualitative variable Y) Are independent. I prefer to just compare the two groups and talk about how similar or different the percentages are, broken down by another variable. However, to cover all the bases and make sure you can answer this very popular question, here’s the official definition of independence, straight from the statistician’s mouth: Two categories are Independent If their joint probability equals the product of their marginal probabilities. The only caveat here is that neither of the categories can be completely empty.

For example, if being female is independent of being a Democrat, then P(F + D) = P(F) * P(D), where D = Democrat and F = Female. So, to show that two categories are independent, find the joint probability and compare it to the product of the two marginal probabilities. If you get the same answer both times, the categories are independent. If not, then the categories are not independent, but rather, they are dependent.

You may be wondering: Don’t all probabilities work this way, where the joint probability equals the product of the marginals? No, they don’t. For example, if you draw a card from a standard 52-card deck, you get a red card with probability K. You draw a black card with probability >2. The chance, though, of drawing both a black and red card with one draw is 0, while the product of the probabilities for black times red comes out to K * 34 = J4.

Now, if you look at a red card that is a two, the joint probability of a red two, which is 252 = !4, Equals the probability of a red card @2) times the probability of a two, which is %2 (because K * %2 =

Another way to check for independence is to compare the conditional probability to the marginal probability. Specifically, if you want to check whether being female is independent of being Democrat, check either of the following two situations (they’ll both work if the variables are independent):

Is P(F I D) = P(F)? That is, if you know someone is a Democrat, does that affect the chance that they will also be female? If yes, F and D are independent. If not, F and D are dependent.

Is P(D I F) = P(D)? This question is asking whether being female changes your chances of being a Democrat. If yes, D and F are independent. If not, D and F are dependent.

Is knowing that you’re in one category going to change the probability of being in another category? If so, the two categories aren’t independent. If it doesn’t affect the probability, then the two categories are independent.

Checking for independence between two Variables

The discussion in the previous section focuses on checking if two specific categories are independent in a sample. If you want to extend this idea to showing that two entire categorical variables are independent, you must check the independence conditions for every combination of categories in those variables. All of them must work, or independence is lost. The first case where dependence is found between two categories means that the two variables are dependent. If you find that the first case shows independence, you must continue checking all the combinations before declaring independence.

Suppose a doctor’s office wants to know whether calling patients to confirm their appointments is related to whether they actually show up. The variables are X = called the patient (called or didn’t call) and Y = patient showed up for their appointment (showed or didn’t show). Here are the four conditions that need to hold before you declare independence:

1. P(showed) = P(showed I called)

2. P(showed) = P(showed I didn’t call)

3. P(didn’t show) = P(didn’t show I called)

4. P(didn’t show) = P(didn’t show I didn’t call)

If any one of these conditions isn’t met, you stop there and declare the two variables to be dependent in the sample. If (and only if) all the conditions are met, you declare the two variables independent in the sample.

You can see the results of a sample of 100 randomly selected patients in Table 13-4.

Table 13-4

Confirmation Calls Related to Showing Up for the Appointment

Called

Didn’t Call

Row totals

Showed

57

33

90

Didn’t Show

3

7

10

Column Totals

60

40

100

Checking the conditions for independence, you can start at the first condition and check to see whether P(showed) = P(showed | called). From the last column of Table 13-4, you can see that P(showed) is equal to 9°ioo = 0.90, or 90 percent. Next, you can find P(showed I called) by looking at the first column of Table 13-4. This probability is % = 95 percent. Because these two probabilities aren’t equal (although they’re close), then you say that showing up and calling first are dependent. In other words, people come a little more often when you call them first. (To determine whether these sample results carry through to the population, which also takes care of the question of how close the probabilities need to be in order to conclude independence, see Chapter 14.)

Demystifying Simpson’s Paradox

Simpson’s Paradox Is a phenomenon where results appear to be in direct contradiction to one another, which can make even the best student’s heart race. This situation can go unnoticed unless three variables (or more) are examined, in which case you organize the results into a Three-way table, With columns within columns or rows within rows.

Simpson’s Paradox is a favorite among statistics instructors (because it’s so mystical and magical — and the numbers get so gooey and complex) but Simpson’s Paradox is a nonfavorite among many students, mainly because of the following two reasons (in my opinion):

Due to the way Simpson’s Paradox is presented in most statistics courses, you can easily get buried in the details and have no hope of seeing the big picture: Simpson’s Paradox presents a big problem in terms of interpreting data, and you need to understand it fully in order to avoid it.

Most textbooks do a good job of showing you examples of Simpson’s Paradox, but they do a not-so-good job of explaining why it occurs (some even neglect to explain the why part at all).

My goals in this section are for you to know what Simpson’s Paradox is, to be able to understand and explain why and how it happens, and to know how to be watchful for it. This is a tall order, I know, but stick with me.

Experiencing Simpson’s Paradox

Simpson’s Paradox was discovered in 1951 by an American Statistician named E. H. Simpson. He realized that if you analyze some data sets one way, by breaking them down by two variables only, you can get one result, but when you break the data down further by a third variable, the results switch direction. That’s why his result is called Simpson’s Paradox — a paradox being an apparent contradiction in results.

In the following sections, you can see Simpson’s Paradox play out in an example and all the details in between.

Simpson’s Paradox in action: Video games and the gender gap

Suppose I am interested in finding out who is better at playing video games, men or women. I watch males and females choose and play a variety of video games, and each time someone plays a video game, I record whether he or she wins or loses. Suppose I record the results of 200 video games, as seen in Table 13-5. (Note that the females played 120 games, and the males played 80 games.)

Table 13-5

Video Games Won and Lost for Males versus Females

All Games

Won

Lost

Marginal Row Totals

Males

44

36

80

Females

84

36

120

Marginal Column Totals

128

72

200 (Grand Total)

Looking at Table 13-5, you see the proportion of males who won their video games, P(Won | Male), is % = 0.55. The proportion of females who won their video games, P(Won I Female), is 8>i20 = 0.70. So overall, the females won more of their video games than the males did. Does this finding mean that women are better than men at video games in general in the sample?

Not so fast, my friend. Notice that the people in the study were allowed to choose the video games they played. This factor blows the study wide open. Suppose females and males choose different types of video games: Can this affect the results? The answer may be Yes. Considering other variables that could be related to the results but weren’t included in the original study (or at least not in the original data analysis) is important. These additional variables that cloud the results are called Confounding variables.

Factoring in difficulty level

Many people may expect the video game results from the previous section to be turned around, that men are better at playing video games than women. According to the research, men spend more time playing video games, on average, and are by far the primary purchaser of video games, compared to women. So what explains the eyebrow-raising results in this study? Is there another possible explanation? Is important information missing that is relevant to this case?

One of the variables that wasn’t considered when I made Table 13-5 was the difficulty level of the video game being played. Suppose I go back and include the difficulty level of the chosen game each time, along with each result (won or lost). Level one indicates easy video games, comparable to the level of Ms. Pac Man (games that are my speed), and level two means more challenging video games (like war games or sophisticated strategy games).

Table 13-6 represents the results with this new information added on difficulty level of games played. You have three variables now: level of difficulty (one or two); gender (male or female); and outcome (won or lost). Statisticians therefore call Table 13-6 a three-way table.

Table 13-6

A Three-Way Table for Gender,

Game Level, and Game Outcome

Level-One Games

Level-Two Games

Won Lost

Won Lost

Males

9 1

35 35

Females

72 18

12 18

Note in Table 13-6 that the number of level-one video games chosen was 9 + 1 + 72 + 18 = 100, and the number of level-two video games chosen was 35 + 35 + 12 + 18 = 100. But now you need to look at who chose which level of game. The next section probes this very issue.

Comparing success rates with conditional probabilities

To compare the success rates for males versus females using Table 13-6, you can figure out the appropriate conditional probabilities, first for level-one games and then for level-two games.

For level-one games (only), the conditional probability of winning given male is P(Won I Male) = Ko = 0.90. So for the level-one games, males won 90 percent of the games they played. For level-one games, the percentage of games won by the females is P(Won I Female) = % = 0.80, or 80 percent. These results mean that at level one, the males did 10 percent better than the females at winning their games. But this percentage appears to contradict the results found in Table 13-5. (Just wait — the contradictions don’t end here!)

Now figure the conditional probabilities for the level-two video games won. For the men, the percentage of males winning level-two games was = 0.50, or 50 percent. For the ladies, the percentage of women winning level-two games was % = 0.40, or 40 percent. Once again, the males outdid the females!

Step back and think about this scenario for a minute. Table 13-5 shows that females won a higher percentage of the video games they played overall. But Table 13-6 shows that males won more of the level-one games and that males won more of the level-two games. What’s going on? No need to check your math. No mistakes were made — no tricks were pulled. This inconsistency in results happens in real life from time to time in situations where an important third variable is left out of a study, a situation aptly named Simpson s Paradox. (See why it’s called a paradox?)

Asking why: Simpson’s Paradox

Confounding variables are the underlying cause of Simpson’s Paradox. (A Confounding variable Is a third variable that’s related to each of the other two variables and can affect the results if not accounted for.)

In the video game example, when you look at the video game outcomes (won or lost) broken down by gender only (Table 13-5), females won a higher percentage of their overall games than males (70 percent overall winning percentage for females compared to 55 overall winning percentage for males). Yet, when you split up the results by the level of the video game (level one or

Level two; see Table 13-6), the results reverse themselves, and you see that males did better than females on the level-one games (90 percent to 80 percent), and males also did better on the level-two games (50 percent versus 40 percent).

To see why this seemingly impossible result happens, take a look at the marginal row Probabilities Versus the marginal row Totals In Table 13-6 (for the level-one games). The percentage of times a male won when he played an easy video game was 90 percent. However, males chose level-one video games only 10 times (out of 80 total level-one games played by men. That’s only 12.5 percent).

To break this idea down further, the males’ non-stellar performance on the challenging video games (50 percent — but still better than the females) coupled with the fact that the males chose challenging video games 70 out of 80 = 87.5 percent of the time really brought down that overall winning percentage (55 percent). And even though the men did really well on the level-one video games, they didn’t play many of them (compared to the females), so their high winning percentage on level-one video games (90 percent) didn’t count much toward their overall winning percentage.

Meanwhile, in Table 13-6, you see that females chose level-one video games 90 times (out of 120). Even though the females only won 72 out of the 90 games (80 percent, a lower percentage than the males), they chose to play many more of the level-one games, boosting their overall winning percentage.

Now the opposite situation happens when you look at the level-two video games in Table 13-6. The males chose the harder video games 70 times (out of 80), while the females only chose the harder ones 30 times out of 120. The males did better than the females on level-two video games (winning 50 percent of them versus 40 percent for the females). However, level-two video games are harder to win than level-one video games. This factor means that the males’ winning percentage on level-two video games, being only 50 percent, doesn’t contribute much to their overall winning percentage. However, the low winning percentage for females on level-two video games doesn’t hurt them much, because they didn’t play many level-two video games.

The bottom line is that the occurrence or non-occurrence of Simpson’s Paradox is a matter of weights. In the overall totals from Table 13-5, the males don’t look as good as the females. But when you add in the difficulty of the games (shown in Table 13-6), you see that most of the males’ wins came from harder games (which have a lower winning percentage). The females played many more of the easier games on average, and easy games have a higher chance of winning no matter who plays them. So it all boils down to this: Which games did the males choose to play, and which games did the females choose to play? The males chose harder games, which contributed in a negative way to their overall winning percentage and made the females look better than they actually were.

Level of game wasn’t included in the original summary, Table 13-5, but it should have been included because it’s a variable that affected the results. Level of game, in this case, was the confounding variable.

Keeping one eye open for Simpson’s Paradox

Simpson’s Paradox shows you the importance of including data about possible confounding variables when attempting to look at relationships between qualitative variables.

In the video game example I use in previous sections, level of difficulty of the game was a confounding variable; more men chose to play the more difficult games, which are harder to win, thereby lowering their overall success rate.

You can avoid Simpson’s Paradox by making sure that obvious confounding variables are included in a study; that way, when you look at the data you get the relationships right the first time, and no room exists for misconstruing the results. And as with all other statistical results, if it looks too good to be true, or too simple to be correct, it probably is! Beware of someone that tried to oversimplify any result. While three-way tables are more difficult to examine, they are often worth using.

In This Chapter

^ Testing for independence in the population (not just the sample) ^ Using the Chi-square distribution

^ Discovering the connection between the Z-test and the Chi-square test

Ou’ve seen these hasty judgments before — people who collect one sample of data and try to use it to make conclusions about the whole population. When it comes to two qualitative variables (where data falls into categories and don’t represent measurements), the problem seems to be even more widespread.

For example, a TV news show finds that out of 1,000 presidential voters, 200 females are voting Republican, 300 females are voting Democrat, 300 males are voting Republican, and 200 males are voting Democrat. The news anchor shows the data and then states that 30 percent (30%,000) of all voters are females voting Democrat (and so on for the other counts). This conclusion is misleading. It is true that in this sample of 1,000 voters, 30 percent of them are females voting Democrat. However, this result doesn’t automatically mean that 30 percent of the entire population of voters are females voting Democrat. Results change from sample to sample.

People often understand that they can expect sample results to change, yet they don’t seem to realize that some conclusions come out differently due to even small changes in the sample results. For example, if you ask ten people about their views on an issue, you may get six people in favor (the majority) and four against. But the next time you take a sample of ten people, the results may reverse, and you’ll have four people in favor and six people against (the majority). This inconsistency is especially prone to happening if the sample size is small.

In this chapter, you see how to move beyond just summarizing the sample results from a two-way table (discussed in Chapter 13) to using those results in a hypothesis test to make conclusions about an entire population. This process

Requires a new probability distribution called the Chi-square distribution, Which you get very familiar with in this chapter. You also find out how to answer a very popular question among researchers: Are these two categorical (qualitative) variables independent (not related to each other) in the entire population?

A Hypothesis Test for Independence

A recent survey conducted by American Demographics asked men and women about the color of their next house. The results showed that 36 percent of the men wanted to paint their houses white, and 25 percent of the women wanted to paint their houses white. Table 14-1 illustrates the results from a sample of 1,000 people (500 men and 500 women).

Table 14-1

Gender and House-Paint Preference:

Observed Cell Counts

White Paint Nonwhite Paint Marginal Row Totals

Men

180 320 500

Women

125 375 500

Marginal Column Totals

305 695 1,000 (Grand Total)

The Marginal row totals Represent the total number in each row; the Marginal column totals Represent the total number in each column (see Chapter 13 for more information on row and column marginal totals). Notice that of the males, the percentage who want to paint their houses white is "%0 = 0.36, or 36 percent, as stated previously. And the percentage of females who want to paint their houses white is 1:%>0 = 0.25, or 25 percent. (Both of these percentages represent conditional probabilities as explained in Chapter 13.)

The American Demographics report concluded from this data that ". . . men and women agree on exterior house paint colors; the main exception being the top male choice, white (36 percent would paint their next house white versus 25 percent of women)." This type of conclusion is commonly formed, but it’s an overgeneralization of the results at this point. You know that in this sample, more men wanted to paint their houses white than women, but is 180 really that different from 125, with a sample size of 1,000 people whose results will vary the next time you do the survey? How do you know these results carry over to the population of all men and women? That question can’t be answered without a formal statistical procedure called a Hypothesis test (see Chapter 3 for the basics on hypothesis tests).

To show that men and women in the population differ according to favorite house color, first note that you have two qualitative variables — gender (male or female) and paint color (white or nonwhite). What you really want to know is whether these two variables are related to each other or not. If they are related, then favorite paint color depends on gender, which means these two variables are dependent. If they aren’t related, then favorite paint color doesn’t depend on gender, and the two variables are independent.

To test whether two qualitative variables are independent, you need a Chi-square test. The steps for the Chi-square test are the following, with full details supplied in the next sections (note that Minitab can conduct this test for you also, from step three on down):

1. Collect your data and summarize it in a two-way table.

These numbers represent the observed cell counts. (For more on two-way tables, see Chapter 13.)

2. Set up your null hypothesis, Ho: Variables are independent; and the alternative hypothesis, Ha: Variables are dependent.

3. Calculate the expected cell counts under the assumption of independence.

The expected cell count for a cell is the row total times the column total divided by the grand total.

4. Check the conditions of the Chi-square test before proceeding; each expected cell count must be greater than or equal to five.

5. Figure the Chi-square test statistic.

This statistic finds the observed cell count minus the expected cell count, squares the difference, and divides it by the expected cell count. Do these steps for each cell and then add them all up.

6. Look up your test statistic on the Chi-square table (Table A-3 in the Appendix) and find the p-value (or one that’s close).

7. If your result is less than your prespecified cutoff ( the A Level), usually 0.05, reject Ho and conclude dependence of the two variables.

If your result is greater than the a level, fail to reject Ho; the variables can’t be deemed dependent.

Ski

To conduct a Chi-square test in Minitab, enter your data in the spreadsheet exactly as it appears in your two-way table (see Chapter 13 for setting up a two-way table for qualitative data). Go to Stat>Tables>Chi-Square Test. Click on the two variable names in the left-hand box corresponding to your column variables in the spreadsheet. They appear in the Columns Contained in the Table box. Then click on OK.

Collecting and organizing the data

The first step toward any data analysis is collecting your data. In the case of two categorical (qualitative) variables, you collect data on the two variables at the same time for each person. In the house-color example from the previous section, you note each person’s gender, and then ask each person his or her preference for exterior house color. Keeping the data together in pairs (for example: male, white paint; female, nonwhite paint), you then organize it into a two-way table where the rows represent the categories of one qualitative variable (for example, males and females for gender), and the columns represent the categories of the other qualitative variable (for example, white paint and nonwhite paint).

The data for the house-paint example is organized in Table 14-1. You can see by looking at the grand total in the lower-right-hand corner of the table that 1,000 people participated in the survey; you see by the row totals that the 1,000 people were comprised of 500 men and 500 women. The connection between the two pieces of information collected is kept by organizing the data into one two-way table versus two individual tables, one for gender and one for house-paint preference. That way, you can look at the relationship between the two variables. (For the full details on organizing and interpreting the results from a two-way table, see Chapter 13.)

Determining the hypotheses

Every hypothesis test (whether it be a Chi-square test or some other test) has two hypotheses:

A Null hypothesis, Which you have to believe unless someone showed you otherwise. The notation for this hypothesis is Ho.

An Alternative hypothesis, Which you want to conclude in the event that you can’t support the null hypothesis anymore. The notation for this hypothesis is Ha.

For a full discussion of hypothesis testing, see my other book Statistics For Dummies (Wiley) or your intro stats textbook. For a quick review, see Chapter 3 of this book.

In the case where you’re testing for the independence of two qualitative variables, the null hypothesis is when no relationship exists between them. In other words, they’re independent. The alternative hypothesis is when the two variables are related, or dependent.

For the paint color example from the previous section, you write Ho: gender and paint color are independent versus Ha: gender and paint color are dependent. You have now completed step two of the Chi-square test.

FIGurINg expected cell counts

When you’ve collected your data and set up your two-way table (for example, see Table 14-1), you already know what the observed values are for each cell in the table. Now you need something to compare them to. You’re now ready for step three of the Chi-square test —finding expected cell counts. The null hypothesis says that the two variables X And Y Are independent. That’s the same as saying X And Y Have no relationship. Assuming independence, you can determine which numbers should be in each cell of the table by using a formula for what is called the expected cell counts. (Each individual square in a two-way table is called a Cell, And the number that falls into each cell is called the Cell count; See Chapter 13 for more information.)

Standing alone: Independent data

In general, Independence Means that you can find no major difference in the way the rows look, as you move down a column. That is, the proportion of the data falling into each column across the row is about the same for each row. So to find the expected cell counts for any two-way table, take the row total times the column total divided by the grand total, and do this process for each cell in the table.

Table 14-2 shows an example of independent data from a two-way table. Suppose that in this case the table represents data collected from men and women regarding whether they agree with a certain policy (yes or no). The proportion of all men who said yes is % = 0.17, or 17 percent. When you look at the same percentage for the women, you get the same number, 0.17. For both males and females, you get % = 0.83, or 83 percent, for the No group. Because males and females voted exactly the same way, these variables are likely going to be independent in the population as well as the sample.

Table 14-2

Gender and Opinion: Observed Cell Counts = Expected Cell Counts (Independent)

Yes

> No

Marginal Row Totals

Men

10

50

60

Women

10

50

60

Marginal Column Totals 20

100

120 (Grand Total)

To get the expected cell counts for the upper-left cell in Table 14-2, take 60 (row one total) times 20 (column one total) divided by 120 (grand total) = 10. For the next cell in the first row, you multiply 60 by 10°120 = 50. The same results occur in row two, because the numbers are all the same as in row one. Because Table 14-2 represents two independent variables, you get the same expected cell counts for each row.

Under independence, you can find no difference between what you observed and what you expected.

The expected cell-count formula can actually make sense if you look at it the right way. That is, if the two variables are independent, the proportion of the data falling into each column across the row is about the same for each row. So to find the expected cell count for any cell, you take the row total for the row that cell is in, and you multiply that total by the proportion of the table that falls into the column that cell is in (that is, the column total divided by the grand total).

Tying the knot: Dependent data

If two variables are dependent, then the value of one variable affects the value of the other variable. For example, suppose you believe women chew gum more than men. Then gender and gum chewing would be dependent, because if you knew someone’s gender, that would change the probability of them being a gum chewer. Dependent variables affect each other’s probabilities. In the end, the cell counts you actually observe from variables that are dependent won’t match what you expected the cell counts to look like under Ho: The variables are independent. Big differences between observed and expected cell counts means that the variables are dependent.

Table 14-3 shows some data that is dependent because the relationship isn’t the same for each row. More men in the sample said no to gum chewing (%> = 58 percent) than women in this sample (% = 42 percent). However, this may not hold for all men and women in the population.

Table 14-3

Gum Chewing: Observed Cell Counts

Yes

No

Marginal Row Totals

Men

25

35

60

Women

35

25

60

Marginal Column Totals

60

60

120 (Grand Total)

Making conclusions about the population based on the sample (observed) data in a two-way table is taking too big of a leap. You need to conduct a Chi-square test in order to broaden your conclusions to the entire population. Ignoring the fact that sample results vary is where the media, and even some researchers, can get into trouble. Stopping with the sample results only and going merrily on your way can lead to conclusions that others can’t confirm when they take new samples.

To check whether a two-way table is dependent, you first find the expected cell counts by taking the row total times the column total divided by the grand total and do this for each cell in the table. For Table 14-3, the expected cell count for the males who chew gum is 60 * %>o = 30. The expected cell count for the males who don’t chew gum is 60 * 6°i2o = 30. For the females who chew gum, you take 60 * %0 = 30, and the same for females who don’t chew gum. If gender and gum chewing are independent, you should expect to observe 30 in each cell (on average).

Next you compare the expected cell counts to the actual observed cell counts by looking at their differences (see Table 14-3 for the observed cell counts and Table 14-4 for the expected cell counts for the gum chewing example). You can see by Table 14-3 that the observed cell counts are 25, 35, 35, and 25. The expected cell count is 30 for each cell, as you can see in Table 14-4. The differences between the observed and expected cell counts are 25 – 30 = -5; 35 – 30 = 5; 35 – 30 = 5; and 25 – 30 = -5. These differences appear to be small with the naked eye, which may indicate gum chewing preference knows no gender. However, until you do a Chi-square test for independence (Chapter 15), you

Can never really know for

Sure.

Table 14-4

Gum Chewing: Expected Cell Counts

Yes No Marginal Row Totals

Men

60 * (%o) = 30 60 * (%o) = 30 60

Women

60 * (%0) = 30 60 * (%0) = 30 60

Marginal Column Totals

60 60 120 (Grand Total)

Checking the conditions for the test

The time has come for step four of the Chi-square test: checking conditions. The Chi-square test has one main condition that must be met in order to test for independence on a two-way table: The expected count for each cell must be at least five, that is, greater than or equal to five. Expected cell counts that fall below five aren’t reliable in terms of the variability that can take place. This problem is similar to trying to predict the outcome of only five flips of a coin — almost anything can happen. But if you flip the coin more times, you have a better idea of what you can expect to flip.

If you’re analyzing data and you find that your data set doesn’t meet the expected cell count of at least five for one or more cells, you can combine some of your rows and/or columns. This combination makes your table smaller, but it increases the cell counts for the cells that you do have, and that helps.

Calculating the Chi-square test statistic

Every hypothesis test uses data to make the decision about whether or not to reject Ho in favor of Ha. In every hypothesis test, you take information from the data and put it together into a test statistic. The Test statistic, In general, finds the distance between your observed results (your data) and the results you expect if Ho were true. If that difference is large, then you reject Ho in favor of Ha. If that difference is small, you fail to reject Ho. (For more information on test statistics, see another book I wrote, Statistics For Dummies [Wiley], or your intro stats book.)

In the case of testing for independence in a two-way table, you use a hypothesis test based on the Chi-square test statistic. In the following sections, you can see the steps for calculating and interpreting the Chi-square test statistic, which is step five of the Chi-square test.

Working out the formula

A major component of the Chi-square test statistic is the expected cell count

For each cell in the table. The formula for finding the expected cell count, Eif,

, , „ . . .. row I Total * column J Total.. .

For the cell in row i, column I Is Eil =-, . . ,—–. Note that

Ij grand total

The values of I And I Vary for each cell in the table. In a two-way table, the

Upper-left cell of the table is in row one, column one. The cell in the upper -

Right corner is in row one, column two. The cell in the lower-left corner is in

Row two, column one, and the lower-right-hand cell is in row two, column two.

(O – e J2

The formula for the Chi-square test statistic is %2 = ! ! -‘~e——, where OJ Is

The observed cell count for the cell in row I, Column, and EI Is the expected cell count for the cell in row I, Column.

When you calculate the expected cell count for some cells, you typically get a number that has some digits after the decimal point (in other words, the number isn’t a whole number). Don’t round this number off, despite the temptation to do so. This expected cell count is actually an overall-average expected value, and you can keep the count as it is, with decimal included.

Here are the major steps in how the Chi-square test statistic is calculated (Minitab does these steps for you as well):

1. Subtract the observed cell count from the expected cell count for the upper-left-hand cell in the table.

2. Square the result from step one to make the number positive.

3. Divide the result from step two by the expected cell count.

4. Repeat this process for all the cells in the table and add up all the results.

The final sum that you get is your Chi-square test statistic.

The reason you divide by the expected cell count in the Chi-square test statistic is to account for cell-count sizes. If you expect a big cell count, say 100, and are off by only 5 for the observed count of that cell, that difference shouldn’t count as much as if you expected a small cell count (like 10) and the observed cell count was off by 5. Dividing by the expected cell count puts a more fair weight on the differences that go into the Chi-square test statistic.

To perform a Chi-square test in Minitab, enter the raw data (the data on each person) in two columns. The first column is the values of your first variable in your data set. (For example, if your first variable is gender, go down the column entering the gender of each person.) Then enter your second variable in the second column, using the same row to represent each person in the data set. (If your second variable is paint preference, for example, enter each person’s house-paint preference in column two, keeping the data from each person together in each row.) Go to Stat>Tables>Cross-tabulation and %-square. (But don’t stop here: Keep reading.)

On the left-hand side, click on the variable that you wish to be in the rows of your two-way table (you may click on the first variable if you wish). Click Select, and the variable name appears in the row variable portion of the table on the right. Then go to the column variable blank on the right-hand side and click on it. You will be asked to choose your column variable. Go to the left-hand side and click on the name of your second variable. Click Select. Then click on the Chi-square button and choose Chi-square analysis by checking the box. If you want the expected cell counts included, check that box also. Then click OK, and OK.

The Chi-square test statistic can never be negative, because it’s built on sums of squares of differences in the numerator and expected cell counts in the denominator (which are always positive).

The Minitab output for the Chi-square analysis for the house-paint example (from Table 14-1) is shown in Figure 14-1. You can pick out quite a few numbers from the output in Figure 14-1 that are especially important. First, you see three numbers listed in each cell. The first (top) number is the observed cell count for that cell; this matches the observed cell count for each cell shown in Table 14-1. (Notice the marginal row and column totals of Figure 14-1 also match those from Table 14-1.)

The second number in each cell of Figure 14-1 is the expected cell count for that cell; you find it by taking the row total times the column total divided by the grand total (see the section "Figuring the expected cell counts"). For example, the expected cell count for the upper-left cell (males who prefer white house paint) is 500 * 305fooo = 152.50.

The third number in each cell of Figure 14-1 is that part of the Chi-square test statistic that comes from that cell. (See steps one through three of the previous section, "Working out the formula.") The sum of the third numbers in each cell equals the value of the Chi-square statistic listed in the last line of the output. (For the house-paint example, the Chi-square test statistic is 14.27.)

Interpreting the Chi-square test statistic is step six of the Chi-square test; you work through that process in the next section.

Chi-Square Test: Gender, House-Paint Preference

Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts

White Paint

Nonwhite Paint

Total

M

180

320

500

152.50

347.50

4.959

2.176

Figure 14-1:

F

125

375

500

152.50

347.50

Minitab

4.959

2.176

Output for

The house-

Total

305

695

1000

Paint data.

Chi-Sq

= 14.271, DF

= 1, P-Value =

0.000

Finding your results on the Chi-square table

The only way to be able to make an assessment about your Chi-square test statistic is to compare it to all the possible Chi-square test statistics you would get if you had a two-way table with the same row and column totals, yet you distributed the numbers in the cells in every way possible. (You can do that in your sleep, right?) Some resulting tables give large Chi-square test statistics, and some give small Chi-square test statistics.

Putting all these Chi-square test statistics together gives you what’s called a Chi-square distribution. You find your particular test statistic on that distribution (step six of the Chi-square test), and see where it stands compared to

The rest. If your test statistic is large enough that it appears way out on the right tail of the Chi-square distribution (boldly going where no test statistic has gone before), you reject Ho. If the test statistic isn’t that far out, then you can’t reject Ho.

In the next sections, you find out more about the Chi-square distribution and how it behaves, so you can make a decision about the independence of your two variables based on your Chi-square statistic.

.57

Determining degrees of freedom

Each type of two-way table has its own Chi-square distribution, depending on the number of rows and columns it has, and each Chi-square distribution is identified by its Degrees of freedom. In general, a two-way table with R Rows and C Columns uses a Chi-square distribution with (r – 1) * (c – 1) degree of freedom. A two-way table with two rows and two columns uses a Chi-square distribution with one degree of freedom. Notice that 1 = (2 – 1) * (2 – 1). A two-way table with three rows and two columns uses a Chi-square distribution with (3 – 1) * (2 – 1) = 2 degrees of freedom.

Understanding Why Degrees of freedom are calculated this way is likely to be beyond the scope of your statistics class. But if you really want to know, the degrees of freedom represents the number of cells in the table that are flexible, or "free," given all the marginal row and column totals. For example, suppose that a two-way table has all row and column totals equal to 100 and the upper-left cell is 70. Then the upper-right cell must be 100 (row total) -30 = 70. Because the column one total is 100, and the upper-left cell count is 70, the lower-left cell count must be 100 – 70 = 30. Similarly, the lower-right cell count must be 70.

So you have only one free cell in a two-way table after you have the marginal totals set up. That’s why the degree of freedom for a two-way table is 1. In general, you always lose one row and one column because of knowing the marginal totals, because these last row and column values can be calculated through subtraction. That’s where the formula (r - 1) * (c - 1) comes from. (That’s more than you wanted to know, isn’t it?)

Discovering how Chi-square distributions behave

Figure 14-2 shows pictures of Chi-square distributions with one, two, four, six, eight, and ten degrees of freedom, respectively. Here are some important points about Chi-square distributions:

For one degree of freedom, the distribution looks like a hyperbola (see Figure 14-2, top left); for more than one degree of freedom, it looks like a mound that has a long right tail (see Figure 14-2, lower right).

All the values are greater than or equal to zero.

The shape is always skewed to the right (tail going off to the right).

As the number of degrees of freedom increases, the mean (the overall average) increases (moves to the right) and the variances increase (resulting in more spread).

No matter what the degree of freedom is, the values on the Chi-square distribution (known as the Density) Approaches zero for increasingly larger Chi-square values. That means that larger and larger Chi-square values are less and less likely to happen.

Figure 14-2:

Chi-square distributions with 1, 2, 4, 6, 8, and 10 degrees of freedom (moving from upper left to lower right).

0.0 5.5 11.0 16.5 22.0 27.0 33.0 38.5

DF = 1

DF = 2

DF = 4

DF = 6

DF = 8

DF = 10

Jftliw..

0.8 0.6 0.4 0.2 0.0

4

Using the Chi-square table

After you find your Chi-square test statistic and its degrees of freedom, you want to determine how large your statistic is, relative to its corresponding distribution. (You’re now venturing into step seven of the Chi-square test.) If you think about it graphically, you want to find the probability of being beyond (getting a larger number than) your test statistic. If that probability is small, your Chi-square test statistic is something unusual — it’s out there — and you can reject Ho. You then conclude that your two variables are not independent (they are related somehow).

In case you’re following along at home, the Chi-square test statistic for the independent data from Table 14-2 is zero, because the observed cell counts are equal to the expected cell counts for each cell, and their differences are

Always equal to zero. (This result never happens in real life!) This scenario represents a Perfectly independent Situation and results in the smallest possible value of a Chi-square test statistic.

If the probability of being to the right of your Chi-square test statistic (on a graph) isn’t small enough, you don’t have enough evidence to reject Ho. You then stick with Ho; you can’t reject it. You conclude that your two variables are independent (unrelated).

How small of a probability do you need to reject Ho? For most hypothesis tests, statisticians generally use 0.05 as the cutoff. (For more information on cutoff values, also known as a levels, flip to Chapter 3, or check out my other book Statistics For Dummies [Wiley].)

Your job now is to find the probability of being beyond your Chi-square test statistic on the corresponding Chi-square distribution with (r – 1) * (c – 1) degrees of freedom. Each Chi-square distribution is different, and because the number of possible degrees of freedom is infinite, showing every single value of every Chi-square distribution isn’t possible. In Table A-3 (in the Appendix in the back of this book), you see some of the most important values on each Chi-square distribution with degrees of freedom from 1 to 50.

To use the Chi-square table (Table A-3 in the Appendix), you find the row that represents your degrees of freedom (abbreviated Df). Move across that row until you reach the value that is closest to your Chi-square test statistic, without going over. (It’s like a game show, when you’re trying to win the showcase by guessing the price.) Then go to the top of the column you’re in. That number represents the area to the right (above) of the Chi-square test statistic you saw in the table. The area above your particular Chi-square test statistic is less than or equal to this number. This result is the approximate P-value of your Chi-square test.

Using the house-paint example (see Figure 14-1), the Chi-square test statistic was 14.27. You have (2 – 1) * (2 – 1) = 1 degree of freedom. On Table A-3 (in the Appendix), you go to the row for Df = 1, and go across to the number closest to 14.27 (without going over). That number is 7.88, in the last column. (This number is much less than 14.27, but it’s the biggest number on the table for that row.) The number at the top of that column is 0.005.

DrawINg your conclusIOns

You have two alternative ways to draw conclusions from the Chi-square test statistic. You can look up your test statistic on the Chi-square table (located in Table A-3 in the Appendix) and see the probability of being greater than

That. This method is known as Approximating the p-value. (The P-value Of a test statistic is the probability of being at or beyond your test statistic on the distribution to which the test statistic is being compared — in this case, the Chi-square distribution.) Or you can have the computer calculate the exact p-value for your test. (For more on p-values and a levels, see my other book Statistics For Dummies. For a quick review on these topics, see Chapter 3 of this book.)

Before you do anything though, set your a, the cutoff probability for your P-value, in advance. If your P-value is less than your a level, reject Ho. If it is more, you can’t reject Ho.

Approximating p-Value from the table

For the house-paint example (see Figure 14-1), the Chi-square test statistic was 14.27 with 1 df (degree of freedom). The closest number in row one of Table A-3 (in the Appendix), without going over, is 7.88 (in the last column). The number at the top of that column is 0.005. This number is less than your typical a level of 0.05, so you reject Ho. You know that your p-value is less than 0.005 because your test statistic was more than 7.88. In other words, if 7.88 is the minimum evidence you need to reject Ho, you have more evidence than that with a value of 14.28. More evidence against Ho means a smaller P-value. However, because Table A-3 only gives a few values for each Chi-square distribution, the best you can say using this table is that your P-value for this test is less than 0.005.

Here’s the big news: Because your p-value is less than 0.05, you can conclude based on this data that gender and house-paint color are likely to be related in the population (dependent), like the Demographics Survey said (located at the beginning of this chapter). Only now, you have a formal statistical analysis that says this result found in the sample is also likely to occur in the entire population. This statement is much stronger!

If your data shows you can reject Ho, you only know at that point that the two variables have some relationship. The Chi-square test statistic doesn’t tell you what that relationship is. In order to explore the relationship between the two variables, you find the conditional probabilities in your two-way table (see Chapter 13). You can use those results to give you some ideas as to what may be happening in the population. For example, in the house-paint data (because paint preference is related to gender), you can examine the relationship further by first finding the percentage of men that prefer white houses, which comes out to 18%0 = 0.36, or 36 percent, calculated from Table 14-1. Now compare this result to the percentage of women who prefer white houses: 125500 = 0.25, or 25 percent. You can now conclude that in this population (not just the sample), men prefer white houses more than women do.

Extracting the p-value from computer output

After Minitab calculates the test statistic for you, it reports the exact p-value for your hypothesis test. The p-value measures the likelihood that your results were found just by chance while Ho is still true. It tells you how much strength you have against Ho. If the p-value is 0.001, for example, you have much more strength against Ho than if the P-value, say, is 0.10.

Looking at the Minitab output for the house-paint data in Figure 14-1, the P-value is reported to be 0.000. This means that the P-value is smaller than 0.001; for example, it may be 0.0009. That’s a very small p-value! (Minitab only reports results to three decimal points, which is typical of many statistical software packages.)

The Chi-square test for the gum-chewing data from Table 14-3 results in a p-value of 0.068. This calculation is what statisticians call a Marginal result, Because it’s just on the other side of 0.05. (The test statistic turned out to be only 3.33, and that didn’t seem to be very large.) This p-value is larger than the typical a of 0.05, but not a lot larger. Technically speaking, you can’t reject Ho at level a = 0.05. In practical terms, even though gum chewing and gender seem to be dependent in the sample, you can’t say that you can expect to find this relationship in the population.

I’ve seen situations where people who get a result that isn’t quite what they want (like a p-value of 0.068) do some tweaking to get what they want. What they do is change their a level from 0.05 to 0.10 after the fact. This change makes the P-value less than the a level, and they feel they can reject Ho and say that a relationship exists. But what’s wrong with this? They changed the a after they looked at the data, which isn’t allowed. That’s like changing your bet in blackjack after you find out what the dealer’s cards look like. (Tempting, but a serious no-no.) Always be wary of large a levels, and make sure that you always choose your a before collecting any data — and stick to it. The good news is that when P-values are reported, anyone reading them can make his own conclusion; no cut-and-dry rejection and acceptance region is set in stone. But setting an a level once, then changing it after the fact to get a better conclusion is never good!

Comparing Two Tests for Comparing Two Proportions

You can use the Chi-square test to check whether two population proportions are equal (for example, is the proportion of female cell-phone users the same as the proportion of male cell-phone users?). Now you may be thinking, "But

Wait a minute, don’t statisticians already have a test for two proportions? I seem to remember it from my intro stats course. . . I’m thinking. . . yeah, it’s the Z-test for two proportions. What’s that test got to do with a Chi-square test?" In this section, you answer that question, and use both methods to investigate a possible gender gap in cell-phone use.

Getting reacquainted with the Z-test for two population proportions

The way that most people figure out how to test the equality of two population proportions is to use a Z-test for two population proportions (where you collect a random sample from each of the two populations, find and subtract their two sample proportions, and divide by their pooled standard error; see your intro stats book for details on this particular test). This test is possible to do as long as the sample sizes from the two populations are large — at least five successes and five failures in each sample.

The null hypothesis for the Z-test for two population proportions is Ho: p1 = p2, where p1 is the proportion of the first population that falls into the category of interest and p2 is the proportion of the second population that falls into the category of interest. And as always, the alternative hypothesis is one of the following choices Ha: not equal to, greater than, or less than.

Suppose you want to compare the proportion of cell-phone users for men versus women. You make p1 be the proportion of males who own a cell phone, and p2 is the proportion of all females who own a cell phone. You collect data, find the sample proportions from each group, P1 and P2, take their difference

And make a Z-statistic out of it using the formula Z =

Where P = n + ^. Here, X! And X2 Are the number of individuals from samples one and two, respectively, with the desired characteristic; N! And N2 Are the two sample sizes.

Suppose that you collect data on 100 men and 100 women and find 45 male cell-phone owners and 55 female cell-phone owners,. This means that p1 equals %0 = 0.45, and p2 equals %0 = 0.55. Your samples have at least five Successes (having the desired characteristic; in this case, cell-phone ownership) and five Failures (not having the desired characteristic, which is cell-phone ownership.) So you go ahead and compute the Z-statistic for comparing the two population proportions (males versus females) based on this data is -1.41, as shown on the last line of the Minitab output in Figure 14-3.

Figure 14-3:

Minitab output comparing proportion of male and female cell-phone owners.

Test Cell Phone for Two Proportions

Sample X N Sample p M 45 100 0.450000

F 55 100 0.550000

Difference = p (1) – p (2)

Estimate for difference: -0.1

95% CI for difference:(-0.237896, 0.0378957)

Test for difference = 0 (vs not = 0): Z = -1.41 P-Value

0.157

The p-value for the test statistic of Z = -1.41 is 0.157 (calculated by Minitab, or by looking at the area below the Z-value of -1.41 on a Z-table; see your intro stats text for one of those). This p-value (0.157) is greater than the typical a level (prespecified cutoff) of 0.05, so you can’t reject Ho. You can’t say that the two population proportions aren’t equal. That is, you must conclude that the proportion of cell-phone owners for males is no different than for females. Even though the sample seemed to have evidence for a difference (after all, 45 percent isn’t equal to 55 percent), you don’t have enough evidence in the data to say that this same difference carries over to the population. So you can’t lay claim to a gender gap in cell-phone use, at least with this sample.

Equating Chi-square tests and Z-tests for a two-by-two table

Here’s the key to relating the Z-test to a Chi-square test for independence. If you use the Z-test to see whether the proportion of male cell-phone owners is equal to the proportion of female cell-phone owners, you’re really looking at whether you can expect the same proportion of cell-phone owners despite gender (after you take the sample sizes into account). And that means you are testing whether gender (male or female) is independent of cell-phone ownership (yes or no).

If the proportion of female cell-phone owners equals the proportion of male cell-phone owners, then the proportion of cell-phone owners is the same regardless of gender, so gender and cell-phone ownership are independent. On the other hand, if you find the proportion of male cell-phone owners to be unequal to the proportion of female cell phone owners, then you can say that cell-phone use differs by gender — so gender and cell-phone ownership are dependent.

Therefore, the Z-test for two proportions and the Chi-square test for independence in a two-by-two table (one with two rows and two columns) are equivalent if the sample sizes from the two populations are large enough; that is, when the number of successes and the number of failures in each cell of the two samples is at least five.

With the cell-phone data from the previous section, you have 45 males using cell phones (out of 100 males) and 55 females using cell phones (out of 100 females). The Minitab output for the Chi-square test for independence (complete with observed and expected cell counts, degrees of freedom, test statistic, and p-value) is shown in Figure 14-4. The p-value for this test is 0.157, which is greater than the typical a level (prespecified cutoff) of 0.05, so you can’t reject Ho.

Because the Chi-square test for independence and the Z-test tests are equivalent when you have a two-by-two table, the P-value from the Chi-square test for independence is identical to the P-value from the Z-test for two proportions. If you compare the p-values from Figures 14-3 and 14-4, you can see that for yourself.

Chi-Square Test: Gender, Cell Phone

Expected

Counts

Are printed below observed counts

Chi-Square contributions

Are printed below expected counts

Y

N

Total

M

45

55

100

50.00 0.500

50.00 0.500

Figure 14-4:

Minitab

F

55

45

100

Output testing inde-

50.00

0.500

50.00 0.500

Pendence of gender and

Total

100

100

200

Cell-phone ownership.

Chi-Sq =

2.000,

DF =

1, P-Value = 0.157

Also, note that if you take the Z-test statistic for this example (from Figure 14-3), which is -1.41, and square it, you get 2.02, which is equal to the Chi-square test statistic for the same data (last line of Figure 14-4). It is also the case that when the square of the Z-test statistic (when testing for the equality of two proportions) is equal to the corresponding Chi-square test statistic for independence.

Researchers are doing a great deal of study of the effects of cell-phone use while driving. One study published in the New England Journal of Medicine Observed and recorded data in 1997 on 699 drivers who had cell phones and were involved in motor vehicle collisions resulting in substantial property damage but no personal injury. Each person’s cell-phone calls on the day of the collision and during the previous week were analyzed through the use of detailed billing records. A total of 26,798 cell-phone calls were made during the 14-month study period.

One conclusion the researchers made was that ". . . the risk of a collision when using a cell phone is four times higher than the riskof a collision when a cell phone was not being used." They basically conducted a test to see whether cell-phone use and having a collision are independent, and when they found out they were not, they were able to examine the relationship further using appropriate ratios. In particular, they found that the risk of a collision is four times higher for those drivers using cell phones than for those who aren’t.

Researchers also found out that the relativerisk was similar for drivers who differed in personal characteristics, such as age and driving experience. (This finding means that they conducted similar tests to see whether the results were the same for drivers of different age groups and

Drivers of different levels of experience, and the results always came out about the same. Therefore, age and the experience of the driver were not related to the collision outcome.)

The research also shows that ". . . calls made close to the time of the collision were found to be particularly hazardous (p < 0.001). Hands-free cell phones offered no safety advantage over hand-held units (p-value not significant) . . ." Note: The items in parentheses show the typical way that researchers report their results — using p-values. The P In both cases of parentheses represent the p-value of each test.

In the first case, the p-value is very tiny, less than 0.001, indicating strong evidence for a relationship between collisions and cell-phone use at the time. The second p-value in parentheses was stated to be insignificant, meaning that it was substantially more than 0.05, the usual a level people use. This second result indicates that whether or not the drivers used hands-free equipment didn’t affect the chances of a collision happening. That is, the proportion of collisions using hands-free cell phones versus using regular cell phones were found to be statistically the same (they could’ve easily occurred by chance under independence). Whether you use a regular or hands-free cell phone, may this study be a lesson to everyone!

The Chi-square test and Z-test are equivalent only if the table is a two-by-two table (two rows and two columns) and if the Z-test is two tailed (the alternative hypothesis is that the two proportions aren’t equal, instead of using Ha: one proportion is greater than or less than the other). If the Z-test is not two tailed, a Chi-square test isn’t appropriate. If the two-way table has more than two rows or columns, use the Chi-square test for independence (because you no longer have only two proportions if you have many categories, so the Z-test isn’t applicable).