The *Pearson Product-Moment Correlation Coefficient* (r), or

The computation of the correlation coefficient is most easily accomplished with the aid of a statistical calculator. The value of r was found on a

The correlation coefficient may take on any value from minus one to plus one.

The

Taking the

The correlation coefficient may be understood by various means, each of which will now be examined in turn.

Scatter plots can best illustrate how the correlation coefficient changes as the linear relationship between the two variables is altered. When r=0.0 the points scatter widely about the plot, the majority falling roughly in the shape of a circle. As the linear relationship increases, the circle becomes more and more

A number of

The correlation coefficient is the

This interpretation of the correlation coefficient is perhaps best illustrated with an example involving numbers. The raw score values of the X and Y variables are presented in the first two columns of the following table. The second two columns are the X and Y columns transformed using the z-score transformation.

X | Y | z_{X} | z_{Y} | |
---|---|---|---|---|

12 | 33 | -1.07 | -0.61 | |

15 | 31 | -0.07 | -1.38 | |

19 | 35 | -0.20 | 0.15 | |

25 | 37 | 0.55 | .92 | |

32 | 37 | 1.42 | .92 | |

20.60 | 34.60 | 0.0 | .0 | |

8.02 | 2.61 | 1.0 | 1.0 |

That is, the mean is subtracted from each raw score in the X and Y columns and then the result is divided by the sample standard deviation, as these formulas show:

There are two points to be made with these numbers

Computing the correlation coefficient first with the raw scores X and Y yields r=0.85. Computing the correlation coefficient with z_{X} and z_{Y} yields the same value, r=0.85. Since the z-score transformation is a special case of a _{Y} or Y and z_{X}. What this means essentially is that changing the

The fact that the correlation coefficient is the slope of the regression line when both X and Y have been converted to z-scores can be demonstrated by computing the regression parameters predicting z_{X} from z_{Y} or z_{Y} from z_{X}. In either case the

The ^{2}) is the

One of the most important properties of variance is that it may be

If one knows the sex of an individual, one knows something about that person's shoe size, because the shoe sizes of males are on the average somewhat larger than females. The

Rather than having just two levels the X variable will usually have many levels. The preceding argument may be extended to encompass this situation. It can be shown that the

The

This formula may be rewritten in terms of the error variance (rather than the

The ^{2}_{ERROR}, is estimated by the ^{2}_{Y.X}, discussed in the previous chapter. The ^{2}_{TOTAL}) is simply the variance of Y, s^{2}_{Y}.The formula now becomes:

Solving for s_{Y.X}, and adding a

This captures the essential relationship between the correlation coefficient, the variance of Y, and the standard error of estimate. As the standard error of estimate becomes larger relative to the total variance, the correlation coefficient becomes smaller. Thus the correlation coefficient is a function of both the standard error of estimate and the total variance of Y. The standard error of estimate is an absolute measure of the amount of error in prediction, while the correlation coefficient squared is a relative measure, relative to the total variance.

The easiest method of computing a correlation coefficient is to use a statistical calculator or computer program. Barring that, the correlation coefficient may be computed using the following formula:

The example data in the following table are used to demonstrate computation using this. Computation is rarely done in this manner and is provided as an example of the application of the

X | Y | z_{X} | z_{Y} | z_{X} * z_{Y} |
---|---|---|---|---|

12 | 33 | -1.07 | -0.61 | 0.65 |

15 | 31 | -0.07 | -1.38 | 0.97 |

19 | 35 | -0.20 | 0.15 | -0.03 |

25 | 37 | 0.55 | .92 | 0.51 |

32 | 37 | 1.42 | .92 | 1.31 |

SUM = | 3.40 |

Plug in the numbers, and the formula resolves as follows:

A convenient way of summarizing a large number of correlation coefficients is to put them in a single table, called a * correlation matrix*. A correlation matrix is a table of all possible correlation coefficients between a set of variables. For example, suppose a questionnaire of the following form (Reed, 1983) was given to forty Psychology 121 students.

age - What is your age? _____

know - Number of correct answers out of 10 possible to a Geology quiz which consisted of correctly locating 10 states on a state map of the United States.

visit - How many states have you visited? _____

comair - Have you ever flown on a commercial airliner? (No = 0, Yes = 1)

sex (1=Male, 2=Female)

The responses produced the following data matrix (only the first seven of the forty subjects are illustrated):

Since there are five questions on the example

age | know | statvis | comair | sex | ||

age | ||||||

know | ||||||

statvis | ||||||

comair | ||||||

sex |

One would not need to calculate all possible correlation coefficients, however, because the correlation of any variable with itself is necessarily 1.00. Thus the

To calculate a correlation matrix using

Select the variables that are to be included in the correlation matrix. In this case all variables will be included, and optional means and standard deviations will be output, as shown in the following figure.

The results of the preceding are:

Interpretation of the data analysis might proceed as follows. The table of

The analysis of the

Age was positively correlated with number of states visited (r=.22) and flying on a commercial airplane (r=.19) with older students more likely both to have visited more states and flown, although the relationship was not very strong. The greater the number of states visited, the more states the student was likely to correctly identify on the map, although again relationship was weak (r=.28). Note that one of the students who said he had visited 48 of the 50 states could identify only 5 of 10 on the map.

Finally, sex of the participant was slightly correlated with both age, (r=.17) indicating that females were slightly older than males, and number of states visited (r=-.16), indicating that females visited fewer states than males These conclusions are possible because of the sign of the correlation coefficient and the way the sex variable was coded: 1=male 2=female. When the correlation with sex is positive, females will have more of whatever is being measured on Y. The opposite is the case when the correlation is negative.

Correct interpretation of a correlation coefficient requires the assumption that both variables, X and Y, meet the

As discussed in Chapter 4, "

An exception to the preceding rule occurs when the nominal categorical scale is

An

An outlier that falls near where the

An outlier that falls some distance away from the original regression line would decrease the size of the correlation coefficient, as seen below:

The effect of the outliers on these examples is somewhat muted because the sample size is fairly large (N=100). The smaller the sample size, the greater the effect of the outlier. At some point the outlier will have little or no effect on the size of the correlation coefficient.

When a researcher encounters an outlier, a decision must be made whether to include it in the data set. It may be that the respondent was deliberately

No discussion of correlation would be complete without a discussion of

For example, suppose there exists a high correlation between the number of Popsicles sold and the number of drowning deaths on any day of the year. Does that mean that one should not eat Popsicles before one swims? Not necessarily. Both of the above variables are related to a

Much of the early evidence that

Sociologists are very much concerned with the question of correlation and causation because much of their data is correlational. Sociologists have developed a branch of correlational analysis, called path analysis, precisely to determine causation from correlations (Blalock, 1971). Before a correlation may imply causation, certain requirements must be met. These requirements include: (1) the

If a high correlation was found between the age of the teacher and the

A simple correlation may be interpreted in a number of different ways: as a

A number of qualities which might affect the size of the correlation coefficient were identified. They included missing parts of the distribution, outliers, and common variables. Finally, the relationship between correlation and causation was discussed.