Pointing Out Correlations with Spearman’s Rank

15 Май
0

In This Chapter

^ Understanding correlation from a nonparametric point of view Finding and interpreting Spearmen’s rank correlation

M «ata analysts commonly look for and try to quantify relationships W\f Between two variables, X And Y. Depending on the type of data you’re dealing with in X And Y, There are different procedures to use for quantifying their relationship.

When X And Y Variables are Quantitative (that is, their possible outcomes are measurements or counts), the correlation coefficient (also known as the Pearson’s correlation coefficient) Measures the strength and direction of their linear relationship. (See Chapter 4 for all the info on Pearson’s correlation coefficient, denoted by r.) If X And Y Are both Categorical Variables (their possible outcomes are categories that have no numerical meaning; for example male and female), you use Chi-square procedures and conditional probabilities to look for and describe their relationship. All of that machinery is laid out in Chapters 13 and 14.

Then there is a third type of variable, called Ordinal Variables (their values fall into categories, but the possible values can be placed into an order and given a numerical value that has some meaning, for example, grades on a scale of A = 4, B = 3, C = 2, D = 1, and E = 0 or a student’s evaluation of a teacher on a scale from best [5] to worst [1]). To look for a relationship between two ordinal variables like these, use Spearman’s rank correlation; it’s the nonparametric counterpart to Pearson’s correlation coefficient (Chapter 4). In this chapter, you see why ordinal variables don’t meet Pearson’s conditions, and you see how to use and interpret Spearman’s rank correlation to correctly quantify and interpret the relationship between two ordinal variables.

Pickin’ On PeaRSon and His PREcious Conditions

Pearson’s correlation coefficient is the most common correlation measure out there, and many data analysts think it’s the only one out there. Trouble is, Pearson’s correlation has certain conditions that must be met before using it. If those conditions are not met, Spearman’s correlation is waiting in the wings. In this section, you see the conditions for Pearson’s correlation and how they are easy pickin’s for Spearman’s rank correlation.

The Pearson correlation coefficient R (the correlation) is a number that measures the direction and strength of the linear relationships between two variables X And Y. (For more info on the correlation, see Chapter 4.)

Several conditions have to be met for ol’ Pearson:

The variables X And Y Must have a linear relationship (as shown on a scatterplot; see Chapter 4).

Both variables X And Y Must be numerical (or quantitative). That is, they must represent measurements with no restriction on their level of precision. For example, numbers with many places after the decimal point (such as 12.322 or 0.219) must be possible.

The Y Values must have a normal distribution for each X, With the same variance at each X.

One of the most common instances where Pearson’s conditions aren’t met is when the two variables are ordinal. Ordinal data Comes in categories that can be assigned numerical values that make sense. However, typically with ordinal variables, you won’t see many different categories offered or compared for simplicity reasons. This means there won’t be enough numerical values to try to build a linear regression model for two ordinal variables like you can with two quantitative variables. (Because there are typically not enough categories offered with an ordinal variable, Pearson’s conditions aren’t met.) That also makes condition three impossible.

As well, if you have a gender variable with categories male and female, you can assign the numbers 1 and 2 to each gender, but those numbers have no numerical meaning. Gender isn’t an ordinal variable; rather it is a Categorical variable (a variable that places individuals into categories only). Categorical variables, such as gender, also don’t lend themselves to linear relationships, so they don’t meet Pearson’s conditions either. (To explore relationships between categorical variables, see Chapter 14.)

Some people are lucky enough to have a statistic actually named after them. Typically, the person who came up with the statistic in the first place, recognizing a need for it and coming up with a solution, gets the honor. If the new statistic gets picked up and used by others, it eventually takes on the name of its inventor.

Spearman’s rank correlation is named after its inventor, Charles Edward Spearman, who lived from 1863 to 1945. He was an English psychologist who studied experimental psychology and worked in the area of human intelligence. He was a professor for many years at the University College London. Spearman followed closely the

Work of Francis Galton, who Originally Developed the concept of correlation. Spearman developed his rank correlation in 1904.

Pearson’s correlation coefficient was developed several years prior, in 1893 by Karl Pearson, one of Spearman’s fellow colleagues at University College London and another follower of Galton. Pearson and Spearman didn’t get along. Pearson had an especially strong and volatile personality, and had problems getting along with quite a few people in fact. Such is the way of some of the more brilliant people of the 19th century.

Scoring with Spearman’s Rank Correlation

Spearman’s rank correlation doesn’t require the relationship between the variables X And Y To be linear, nor does it require the variables to be numerical. You use Spearman’s rank when the variables are ordinal and/or quantitative. Rather than examining a linear relationship between X And Y, Spearman’s rank correlation tests whether two ordinal and/or quantitative variables are dependent (in other words, related to each other).

Note: Spearman’s rank applies to ordinal data only. To test to see if two categorical (and non-ordinal) variables are independent, you use a Chi-square test; see Chapter 14.

Spearman’s rank correlation is the same as Pearson’s correlation except that it’s calculated based on the ranks of the X Variable and the ranks of the Y Variable rather than their actual values. You interpret the value of Spearman’s rank correlation, Rs The same way you interpret Pearson’s correlation, R (see Chapter 4). The values of Rs Can go between -1 and +1. The higher the magnitude of Rs (in the positive or negative directions), the stronger the relationship

Between X And Y. If Rs Is zero, this indicates that X And Y Are independent. However, if the correlation between X And Y Is not zero, you can’t say whether or not they’re independent.

In this section, you see how to calculate and interpret Spearman’s rank correlation and apply it to an example.

Figuring Spearman’s rank correlation

The notation for Spearman’s rank correlation is Rs, Where S Stands for Spearman. To find Rs, You do the steps listed in this section. Minitab does the work for you in steps two through six, although some professors may ask you to do the work by hand (not me of course).

1. Collect the data in the form of pairs of values X And Y.

2. Rank the data from the X Variable where 1 = lowest to N = highest, where N Is the number of pairs of data in the data set. (This gives you a new set of data for the X Variable called the Ranks Of the X Values.)

If any of the values appear more than once, Minitab assigns each tied value the average of the ranks they would normally be given if they were not tied.

3. Complete step two with the data from the Y Variable. (This gives you a new data set called the Ranks Of the Y-values.)

4. Find the standard deviation of the ranks of The x-valueS, using the

, , , , ….. Ix (X – xH2 „ . ,

Usual formula for standard deviation, Sx = J —–s-^-; call it sx. In

N-1

A similar manner find the standard deviation of the ranks of the

Ix _ Y – y I2

Y-values using Sy = J —–Y*—; call it sy.

(Note that n is the sample size, X Is the mean of the ranks of the X Values, and Y Is the mean of the ranks of the Y Values.

5. Find the Covariance Of the X-y Values, using the formula

Cov _ X, y I =-(-^i———; call it sxy.

The covariance of x and y is a measure of the total deviation of the X And Y Values from the point _ X,~y ^.

6. Calculate the value of Spearman’s rank correlation by using the formula Rs = ss .

Notice that the formula for Spearman’s rank correlation is just the same as the formula for Pearson’s correlation coefficient, except the data Spearman uses for his correlation formula is the ranks of X And the ranks of Y, Rather than the original X – And y-values as used by Pearson. So Spearman just cares about the order of the values of the X’s And the Y’s, Not their actual values.

To calculate Spearman’s rank correlation straightaway by using Minitab, rank the X-values, rank the Y-values, and then find the correlation of the ranks. That is, go to Data>Rank and click on the X Variable to get X Ranks. Then do the same thing to get the Y Ranks. Now go to Stat>Basic Statistics>Correlation, click on the two columns representing ranks, and click OK.

Watching Spearman at Work: Relating aptitude to performance

Knowing the process of how to calculate Spearman’s rank correlation is one thing, but if you can apply it to real-world situations, you’ll be the golden child of the statistics world (or at least your intermediate stats class). So, try to put yourself in this section’s scenario to get the full effect of Spearman’s rank correlation.

You’re a statistics professor, and you give exams every now and then (it’s a dirty job, but someone’s got to do it). After looking at students’ final grades over the years (yes, you’re an old professor, or at least in your mid-forties), you notice that students who do well in your class tend to have a better aptitude (background ability) for math and statistics. You want to check out this theory, so you give students a math and statistics aptitude test on the first day of the course; you want to compare students’ aptitude test scores with their final grades at the end of the course.

Now for the specifics. Your variables are X = aptitude test score (using a 100-point pretest on the first day of the course) and Y = final grade, on a scale from 1 to 5 where 1 = F (failed the course); 2 = D (passed); 3 = C (average); 4 = B (above average); and 5 = A (excellent). The Y Variable, final grade, is an ordinal variable, and the X Variable, aptitude, is a numerical variable. You want to find out whether there’s a relationship between X And Y. You collect data on a random sample of 20 students; the data are shown in Table 20-1. This is step one of the process of calculating Spearman’s rank correlation (from the steps listed in the previous section).

Table 20-1

Aptitude Test Scores and Final Grades in Statistics

Student

Aptitude

Final Grade

1

59

3

2

47

2

3

58

4

4

66

3

5

77

2

6

57

4

7

62

3

8 68 3

9

69

5

10

36

1

11

48

3

12

65

3

13

51

2

14

61

3

15

40

3

16

67

4

17

60

2

18

56

3

19

76

3

20

71

5

Using Minitab for the aptitudes and final grades example, you get a correlation of 0.379. The following discussion walks you through steps two through six as you do this correlation yourself. This is likely what you may be asked to do on an exam.

Steps two and three of finding Spearman’s rank correlation are to rank the aptitude test scores (x) From lowest (1) to highest; then rank the final grades (y) From lowest (1) to highest. Note that the final exam grades have several ties, so you use average ranks. For example, in column three of Table 20-1 you

See a single 1, which gets rank 1. Then you see four 2s. Their ranks, had they not been tied, would be 2, 3, 4, and 5. The average of these four ranks is

= 2 + 3 + 4 + 5 = 14 = 3.5. Each of the 2s in column three, therefore, receive

Rank 3.5.

Table 20-2 shows the original data, the ranks of the aptitude scores (x), and the ranks of the final grades (y) as calculated by Minitab.

Table 20-2

Aptitude Test Scores, Final Exam Grades, and Rank

Student

Aptitude Rank of Aptitude

Final Grade

Rank of Final Grade

1

59

9

3

10.5

2

47

3

2

3.5

3

58

8

4

17.0

4

66

14

3

10.5

5

77

20

2

3.5

6

57

7

4

17.0

7

62

12

3

10.5

8

68

16

3

10.5

9

69

17

5

19.5

10

36

1

1

1.0

11

48

4

3

10.5

12

65

13

3

10.5

13

51

5

2

3.5

14

61

11

3

10.5

15

40

2

3

10.5

16

67

15

4

17.0

17

60

10

2

3.5

18

56

6

3

10.5

19

76

19

3

10.5

20

71

18

5

19.5

For step four of the process of finding Spearman’s rank correlation, you have Minitab calculate the standard deviation of the aptitude test score ranks (located in column two of Table 20-2) and the standard deviation of the final grades (located in column four of Table 20-2). In step five, you have Minitab calculate the covariance of the ranks of aptitude test scores and final grade ranks. These statistics are shown in Figure 20-1.

Figure 20-1:

Standard deviations and

Covariance of ranks of aptitude (x) And final grade (y).

For the sixth and final step of finding Spearman’s rank correlation, calculate Rs By taking the covariance of the ranks of X And Y, Divided by the standard deviation of the ranks of X(sx) Times the standard deviation of the ranks of Y(sy).

You get 5 no23r 50 = 0.379. This matches the value for Spearman’s correla -

5.92 * 5.50

Tion that was found by Minitab straightaway.

This correlation of 0.379 is fairly low, indicating a weak relationship between aptitude scores before the course and final grades at the end of the course. The moral of the story? If you aren’t the sharpest tack in the bunch, you can still hope, and if you come in on top, you may not go out the same way. Although, there is still something to be said about working hard during the course (buying Intermediate Statistics For Dummies Certainly doesn’t hurt!).

Комментарии закрыты.