The purpose of

The prediction of Y is accomplished by the following equation:

$Y\text{'}$_{i} = b_{0} + b_{1}X_{1i} + b_{2}X_{2i }+ ... + b_{k}X_{ki}

The "b" values are called

in the same manner as in simple linear regression. In this case there are K independent or _{0}) term.

The data used to illustrate the inner workings of multiple regression will be generated from the "Example Student." The data are presented below:

Subject | Age | Gender | Married | IncomeC | HealthC | ChildC | LifeSatC | SES | Smoke | Spirit | Finish | LifeSat7 | Income7 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 16 | 0 | 0 | 0 | 38 | 0 | 17 | 17 | 1 | 30 | 1 | 22 | 26 |

2 | 28 | 1 | 0 | 0 | 38 | 0 | 16 | 21 | 1 | 39 | 1 | 20 | 15 |

3 | 16 | 1 | 16 | 52 | 1 | 39 | 40 | 0 | 30 | 1 | 42 | 88 | |

4 | 23 | 1 | 0 | 6 | 51 | 0 | 22 | 31 | 0 | 60 | 1 | 48 | 73 |

5 | 18 | 0 | 1 | 7 | 52 | 0 | 25 | 38 | 0 | 32 | 0 | 14 | |

6 | 30 | 0 | 1 | 25 | 43 | 2 | 53 | 36 | 1 | 39 | 0 | 33 | 38 |

7 | 19 | 0 | 1 | 19 | 55 | 0 | 28 | 41 | 0 | 51 | 1 | 33 | 45 |

8 | 19 | 1 | 0 | 0 | 52 | 2 | 17 | 52 | 0 | 35 | 1 | 21 | 16 |

9 | 34 | 0 | 0 | 29 | 60 | 2 | 20 | 56 | 0 | 23 | 1 | 26 | 64 |

10 | 16 | 1 | 0 | 0 | 53 | 0 | 21 | 27 | 0 | 29 | 0 | 37 | 19 |

11 | 25 | 1 | 0 | 3 | 39 | 0 | 18 | 34 | 1 | 61 | 1 | 40 | 56 |

12 | 16 | 1 | 1 | 1 | 42 | 0 | 31 | 29 | 1 | 58 | 1 | 35 | 70 |

13 | 16 | 0 | 0 | 43 | 0 | 15 | 28 | 1 | 39 | 1 | 32 | 71 | |

14 | 16 | 0 | 1 | 18 | 54 | 1 | 34 | 38 | 0 | 40 | 0 | 37 | 44 |

15 | 16 | 1 | 0 | 0 | 52 | 0 | 20 | 38 | 0 | 27 | 1 | 35 | 25 |

16 | 32 | 1 | 1 | 26 | 54 | 1 | 39 | 37 | 0 | 30 | 47 | 38 | |

17 | 19 | 0 | 0 | 0 | 46 | 0 | 17 | 25 | 0 | 36 | 1 | 26 | 39 |

18 | 17 | 1 | 1 | 10 | 55 | 2 | 48 | 53 | 0 | 43 | 0 | 42 | 6 |

19 | 24 | 0 | 0 | 17 | 52 | 0 | 16 | 36 | 0 | 54 | 1 | 38 | 75 |

20 | 26 | 1 | 1 | 57 | 1 | 39 | 41 | 0 | 32 | 1 | 42 | 67 |

- Age
- Gender (0=Male, 1=Female)
- Married (0=No, 1=Yes)
- IncomeC Income in College (in thousands)
- HealthC Score on Health Inventory in College
- ChildC Number of Children while in College
- LifeSatC Score on Life Satisfaction Inventory in College
- SES Socio Economic Status of Parents
- LifeSatC Score on Life Satisfaction Inventory in College
- Smoker (0=No, 1=Yes)
- SpiritC Score on Spirituality Inventory in College
- Finish Finish the program in college (0=No, 1=Yes)
- LifeSat7 Score on Life Satisfaction Inventory seven years after College
- Income7 Income seven years after College (in thousands)

The major interest of this study is the prediction of

After doing a

The

The best and only significant (

Income seven years after college was best predicted by knowing whether the student finished the college program or not (.499). Other variables that predicted income included the measure of spirituality (.340) and income in college (.282).

The matrix of correlations of all predictor variables is presented below.

The other boundary in multiple regression is called the

Note that the ^{2} means that all subsets of predictor variables will have a value of multiple R that is smaller than .976. Note also that these variables in combination do not significantly (Sig. F Change = .094) predict life satisfaction seven years after college.

The middle table ^{2} change in the previous table. Note that the "Sig. F Change" in the preceding table is the same as the "Sig." value in the "ANOVA" table. This table was more useful in previous incarnation of multiple regression analysis (see

The full model is not

The "Sig." column on the "Coefficients" table presents the statistical significance of that variable given all the other variables have been entered into the model. Note that no variables are statistically significant in this table. The variable "Married" comes close (Sig. = .055), but close doesn't count in significance testing.

Previously it was found that the

Partial output for the full model predicting the other dependent measure, income seven years after college, is presented below.

The results are similar to the prediction on life satisfaction, with an unadjusted multiple R of .905, giving an upper limit to the combined predictive power of all the predictor variables.

After the boundaries of the regression analysis have been established, the area between the extremes may be examined to get an idea of the interaction between the independent variables with respect to prediction. There are different schools of thought about how this should be accomplished. One school, hierarchical regression, argues that theory should drive the statistical model and that the decision of what and when terms enter the regression model should be determined by theoretical concerns. A second school of thought, stepwise regression, argues that the data can speak for themselves and allows the procedure to select predictor variables to enter the regression equation.

^{2} is calculated. An hypothesis test is done to test whether the change in R^{2} is significantly different from zero.

Using the example data, suppose a researcher wishes to examine the prediction of life satisfaction seven years after college in several stages. In the first stage, he/she enters demographic variables that the individual has little or no control over, age, gender, and socio-economic status of parents. In the second block variables are entered that the individual has at least some control, such as smoking, having children, being married, etc. The third block consists of the two attitudinal variables, life satisfaction and spirituality. This is accomplished in ^{2} change box is selected as a "Statistics" option.

The first table is a table of what variables were entered or removed at the different stages. The second table is summary of the results of the different models.

The largest change in R^{2} was from model 1 to model 2, with an R^{2} change of .708 from .102 to .810. This value was not significant, however, as were R^{2} changes associated with either of the other two models. Then final model has the same multiple R as the full model presented in an earlier section.

The third table presents the

Note how the values of the regression weights and significance levels change as a function of when they have been entered into the model and what other variables are present.

The fifth table presents information about variables *not* in the regression equation at any particular stage, called

The value of "Beta In" is the size of the ^{2} change significance level that the variable would enter the regression equation. In this case, it can be seen that individually both INCOMEC and SPIRITC would significantly enter the regression model in the second stage. The "^{2} if that variable were entered into the equation by itself at the next stage.

As described in the help files of SPSS, the "Collinearity Statistics Tolerance" is "calculated as 1 minus R squared for an independent variable when it is predicted by the other independent variables already included in the analysis." This statistic may be interpreted as "A variable with very low

At any stage, rather than entering all the variables as a block, ^{2} increase, given the variables already entered into the model. To do a step-up regression using SPSS, enter all the variables in the first block and select "Method" as "Forward."

The results of the step-up regression can be better understood if the

Note that the correlation coefficients have changed from the original table and that the highest correlation is with SPIRITC with a value of .587. The SPIRITC variable, then, would enter the step-up regression in the first step. The

The "Model Summary" table shows that two variables, SPIRITC and FINISH, are entered into the prediction model with a

The "Coefficients" table is presented next.

The final table presents information about variables not in the regression equation.

At the conclusion of the first model, both FINISH and HEALTHC would significantly (p<.05) enter the regression equation at the next step. Since FINISH had a higher partial correlation (-.653) than HEALTHC (.544) it was entered into the equation at the next step. When FINISH was entered into the equation in model 2, HEALTHC would no longer significantly enter the regression model.

By starting with a full model and eliminating variables that do not significantly enter the regression equation, a

The Model Summary table is presented below.

As the table above illustrates, this method starts with the ^{2} of .978. The variable of HEALTC is eliminated at the first step because it has the lowest

As before, the table of

Note that none of these variables were significant in the final model.

Stepwise procedures allow the data to drive the theory. Some statisticians (I would have to include myself among them) object to the mindless application of statistical procedures to multivariate data.

There is no guarantee that the Forward and Backward procedures would agree on the same model if the options were set to different values so that the same number of variables were entered into the model. At some point a variable may no longer contribute to the regression model because of other variables in the model, even if it did contribute at an earlier point in time. For that reason SPSS provides methods of "STEPWISE" and "REMOVE" which test at each stage to see if a variable still belongs in the model. These methods could be considered a combination of

The manner is which *for the existing set of data*. If a statistician wishes to predict a different set of data, the regression weights are no longer optimal. There will be substantial ^{2} if the weights estimated on one set of data are used on a second set of data. The amount of shrinkage can be estimated using a

In cross-validation, regression weights are estimated using one set of data and are tested on a second set of data. If the regression weights estimated on the first set of data predict the second set of data, the weights are said to be cross-validated.

Suppose an industrial/organization psychologist wished to predict job success using four different test scores. The psychologist could collect the four test scores from a randomly selected group of job applicants. After hiring all the selected group of job applicants, regardless of their scores on the tests, a measure of success on the job is taken. Success on the job is now predicted from the four test scores using a multiple regression procedure. Stepwise procedures may be used to eliminate tests that are predicting similar variance in job success. In any case, the psychologist is now ready to predict job success from the test scores for a new set of job applicants.

Not so fast! Careful application of multiple regression methods require that the regression weights be cross-validated on a different set of job applicants. Another random sample of job applicants is taken. Each applicant is given the test battery and then hired, again regardless of what scores they made on the tests. After some time on the job a measure of job success is taken. Job success is then predicted by using the regression weights found using the first set of job applicants. If the new data is successfully predicted using old regression weights, the regression procedure is said to be cross-validated. It is expected that the accuracy of prediction will not be as good for the second set of data. This is because the regression procedure is subject to variances in data from sample to sample, called "^{2}.

The above procedure is an idealized method of the use of multiple regression. In many real life applications of the procedure, random samples of job applicants are not feasible. There may be considerable pressure from administration to select on the basis of the test battery for the first sample, let alone the second sample needed for cross-validation. In either case the multiple regression procedure is compromised. In most cases application of regression procedures to a selected rather than a random sample will result in poorer predictions. All this must be kept in mind when evaluating research on prediction models.

Multiple regression provides a powerful method to analyze multivariate data. Considerable caution, however, must be observed when interpreting the results of a multiple regression analysis. Personal recommendations include a theory that drives the selection of variables and cross-validation of the results of the analysis.