Two additional regression topics, hypothesis testing and dichotomous variables, are necessary to understand how linear regression is used in practice and the relationship between linear regression and other multivariate methods. After these topics are presented, they will be applied to demonstrate how a t-test is a special case of hypothesis testing in linear regression and how linear regression can be used to predict membership in dichotomous groups.
The purpose of hypothesis testing in regression is to make rational decisions about the effect of adding additional information in terms of the independent or X variable on the accuracy of prediction. The basic idea is to sequentially compare the accuracy of prediction of a more complex regression model, that is, a model with a greater number of independent or predictor variables, with a subset of the model. Because the more complex model contains both the information available in the subset and additional information, it will always provide a prediction of the dependent variable that is equal to or better than the subset. The critical question is whether the gain in predictive accuracy is large enough to attribute it to something other than chance or random effects.
The application of hypothesis testing to simple linear regression may seem like laying the foundation for a skyscraper in the middle of Kansas. The reader must believe that the time spent understanding these concepts will prove valuable in the future. A city will appear in the middle of a cornfield.
To review, the goal of simple linear regression is to predict a single dependent variable, usually labeled Y, from a single independent variable, usually labeled X. The predicted value of Y is labeled Y' and is a linear transformation of the X variable. This transformation is described by the following formula:
Y'i = b0 + b1 * Xi
Note that the notational scheme is somewhat different than that presented in the introductory text where:
Y'i = a + b * Xi
This reason for the change is a need to keep a consistent notational system when more variables are added to the regression equation. The intercept of the regression equation is now called b0 instead of "a" and the slope is b1 instead of "b".
The values for b0 and b1 are found such that the sum of the squared deviations of the observed and predicted Y's are a minimum. This expression is expressed in mathematical notation as the following and is a measure of the error in prediction:
The reader is referred to the chapter on Simple Linear Regression in Introduction to Statistics: Concepts, Models, and Applications for a more thorough review.
Even though the regression equation, Y'i = b0 + b1 * Xi, appears relatively simple, it can be broken down into simpler models which are subsets of the final model. These subsets include the baseline case of no model, Y'i = 0, the model where each value of Y is predicted by a single number, Y'i = b0, and the final model, Y'i = b0 + b1 * Xi. At each stage in building sequential models, the value of adding complexity to the regression model in terms of additional predictive power of the dependent measure may be assessed using hypothesis testing procedures. This allows the statistician to decide which terms and independent variables belong in the final model.
The data used in illustrating hypothesis testing in regression will be taken from the example in the introductory text involving predicting number of widgets from score on a form board test.
In this case each value of Y is predicted by the model Y'i = 0. Since Yi - Y'i = Yi - 0 = Yi, the value of the sum of squared deviations is equal to the sum of Yi2 or
where Y' = 0.
This value is a baseline to which sequential models may be compared. Calculating the sum of squared errors for the example data results in the following values:
In this model, each value of Yi is predicted by a single number and the model takes the form, Y'i = b0. It can be proven that the value of b0 which minimizes
is the mean of Y, or . In this case the sum of squared deviations around the predicted value of Y is the sum of squared deviations around the mean.
where Y'i = b0= .
The calculation of this value is demonstrated in the following table.
|Yi||Yi -||(Yi - )2||2|
Note that the Sum of Squares from the no model case, 2907, has been partitioned into two parts, that for which the single value model can account, 2553.8, and that which it cannot, 353.2. The two parts add up to the whole, that is 2553.8 + 353.2 = 2907. This will always be the case, although it will not be proven here. The interested reader is directed to Draper and Smith (1981) 1981 Draper .
The results of the preceding analysis are often summarized in an ANOVA source table as presented below:
|Source||Sum of Squares||df||Mean Square||F||sig.|
The degrees of freedom (df) for each row is the number of values from which it is calculated. For example, the total sum of squares is based on N=5 values. In a like manner, the Sum of Squares for b0 is based on a single value, the mean. The Error Sum of Squares is based on N-1 df because one df is lost for the estimation of b0. The df for each of the terms must add up to the Total df.
The Mean Square for each row is calculated by dividing the Sum of Squares by the degrees of freedom for each row. For example, the Mean Square for the Error Source is 353.2 / 4 = 88.3. The F ratio for the b0 source is calculated by dividing the Mean Square for b0 by the Mean Square for Error (2553.8 / 88.3 = 28.92).
In order to determine whether the b0 term is statistically significant, the exact significance level for the F ratio is obtained from the probability calculator as shown in the following figure.
In this case the exact significance level for an F ratio of 28.92 on an F distribution with 1 and 4 degrees of freedom is .00577. Because the exact significance level is less than alpha (.05), the b0 term is statistically significant, that is, it significantly increases the predictive power of the model. Some theory is necessary to justify the use of the F distribution to test for effects and the interested reader is again directed to Draper and Smith (1981).
The purpose of the preceding exercise is to lay a foundation and develop an understanding of how hypothesis testing works in simple linear regression. In the case of most statistical packages, the statistical test for b0 is not automatically done and one must optionally request a regression analysis without the intercept. The reason for its exclusion is that it is usually not very interesting as most measurements in the social sciences are different from zero.
In a similar manner, another model with a single value, Y'i = b1Xi, could be compared with the no model case to see if it added significant predictive accuracy. Again, this is not a standard analysis and the statistical package user must optionally request that the regression be done without a constant term.Probability Calculator
A much more common and interesting approach to testing hypotheses in simple linear regression is to examine the effect of adding the b1Xi term to the model after the b0 term has been entered. The general procedure is similar to the preceding case. The predicted values of the full model are compared to the observed values of Y to find a measure of error. The predicted values of the full model are compared to the predicted values of the subset of the full model to find the increase in predictive power. The increase in predictive power is divided by error variance to find a ratio to test for additional predictive power.
In this case the predictive power of the full model, Y'i = b0 + b1*Xi, is compared to a subset of the model with only the b0 term included, Y'i = b0= . When both b0 and b1 are included in the model, the example data yields the model Y'i = 40.01 - .9565 * Xi, where b0 = 40.01 and b1 = -.9565. A comparison of the predictions of the two models in the example data is presented in the following table.
|Form-Board Test||Widgets/hr||Y'i=b0+b1X||Error||Squared Error||Additional Predictive Power|
|Xi||Yi||Y'i||Yi - Y'i||(Yi - Y'i)2||Y'i -||(Y'i - )2|
Columns (1) and (2) contain the raw data, while column (3) contains the predicted value of Y for the full model, column (4) contains the residuals, and column (5) contains the squared residuals. Up to this point the table recreates the table in the introductory text. Column (6) contains the difference between the predicted value of Y for the full model and the predicted value of Y for the subset of the full model. The reader should note that if the value of b1 was equal to zero, then the predictions of the two different models would be identical and this column would contain zeros. As the size of b1 increases, the difference between the two predictions becomes larger. The values in column (7) indicate the increase in predictive power of the full model over the subset.
The sum of column (5) is called the error sum of squares. The sum of column (7) is the sum of squares regression and is a measure of the increase in predictive power of the full model, Y'i = b0 + b1*Xi, over the partial model, Y'i = b0. This is often shortened to the sum of squares of b1 given b0 or in mathematical notation b1|b0. It should be noted that the error sum of squares and the sum of squares of b1 given b0 add up to the error sum of squares for the partial model (54.185 + 298.981 = 353.166), within rounding error. This will always be the case, although the proof is beyond the scope of this text. The interested reader is again directed to Draper and Smith (1981).
The variability that cannot be predicted by the partial model (subset) is partitioned into two parts, that which can be predicted by the addition of terms to the model and that which cannot. Everything adds up, which is nice, and the theoretical distribution of the resulting ratios can be mathematically determined, which is even nicer, because it allows an hypothesis test to be done.
All the above is neatly summarized in an ANOVA source table. The source table for the example data is presented below.
|Source||Sum of Squares||df||Mean Square||F|
There is one degree of freedom used when the value of b1 is found. The mean squares are found as before by dividing the sum of squares by the degrees of freedom. The observed F ratio for the b1|b0 term is found by dividing its mean square by the error mean square. The observed F ratio is then entered in the probability calculator to find an exact significance level. In this case the exact significance level is .02679 for an F distribution with 1 and 3 degrees of freedom. Because the exact significance level is less than alpha (.05), the addition of the b1 term to the model is statistically significant.
A much simpler test of the increase in predictive power can be done when the statistical package optionally allows for tests of R2 change. The multiple correlation coefficient, or R, is the correlation coefficient between the observed and predicted values of Y. The value of R2 is simply the value of R squared. The more terms added to the model, the larger the value of R2. A significance test may be done to test whether the increase in R2 was significant. The results of this significance test will be identical to the results of the procedure described above.
All of the above and more is done when a statistical package is used to compute a simple linear regression. The default output for the SPSS regression analysis includes a source table as follows:
The results of this table are within rounding error of those computed by hand.Probability Calculator
A dichotomous variable may take on one of two values. For example, gender is a dichotomous variable because it may take one of only two values, male or female. Dichotomous variables are a special case of nominal categorical variables with two or more levels. An understanding of the special case will lead to a later understanding of the more general case.
As discussed in the introductory text, the interval property of measurement assumes that an equal change in the number system reflects the same difference in the real world. In order for the results of a linear regression to be meaningfully interpreted, the interval property of the scale must be assumed to be reasonably close to being correct. In some cases this assumption is clearly unwarranted as in the example of religious preference: 1=Protestant, 2=Catholic, 3=Jewish, and 4=Other. Because of the clear violation of the assumptions of the model, direct use of religious preference as either an independent or dependent variable in simple linear regression would be clearly inappropriate.
A dichotomous variable has only a single interval between its two levels. The interval property is thus assumed to be satisfied and therefore dichotomous variables may be used as variables in simple linear regression. When the independent variable is dichotomous and hypothesis testing is done, the results of the analysis will be identical to a t-test. Thus, the t-test is a special case of simple linear regression when the independent variable is dichotomous. When the dependent variable is dichotomous, the analysis is a special case of discriminant function analysis.
A close examination of simple linear regression when the independent variable is dichotomous proves insightful. These data may be represented in both graphs and statistics and the representations share relationships in common.
The dichotomous independent variable may be coded using any two numbers without affecting the statistical procedure. For example gender could be coded as "-145.5 = male" and "136.43 = female". If one wishes to meaningfully interpret the output, however, some coding systems prove much easier than others. For various reasons, I prefer the use of "0" and "1" as numbers representing the two levels of the variable. In addition, if the positive level is assigned the number "1", interpretation is made easier. For example, on a "yes-no" item, code "no=0" and "yes=1" and on a "true-false" item, code "false=0" and "true=1". In coding gender, the preference is strictly up to the person doing the coding. In later chapters when an interaction term is computed, a coding of -1 and 1 for the two levels of a dichotomous variable will be used.
Example data from the interactive exercise demonstrating dichotomous independent variables is presented below.
This data may be represented in a number of different ways. The first is as a scatter plot, which is presented below.
Because the X variable has only two levels, many of the points share the same space on the graph. It can be observed, however, that the slope of the regression line is positive, as is the slope of the regression line. The spacing of the points at the two values of X gives some idea of the variability within each level, but may be deceiving because duplication of points remains unknown.
A more informative representation is to draw two overlapping relative frequency polygons, as presented below.
In this case, the relative frequency of the Y values is drawn in blue and green when X equals 0 and 1, respectively. This representation provides information about duplication of points that is unavailable in the scatter plot. This representation can be difficult to interpret if the polygons are sawtoothed. It can be easily observed that the values for the dependent variable are generally higher when X=1 then when X=0.
Means and standard deviations may be computed for the Y values for each of the two levels of the X variable. A means table for the example data is presented below.
This representation is neat, concise, and informative. It can be seen that the mean of Y is 12.75 when X=0 and 16.75 when X=1. The standard deviations of the two subsets are approximately equal.
The data may also be represented as a correlation coefficient and regression equation as presented below.
Note that the correlation coefficient and slope of the regression line are both positive, as predicted from the scatter plot. A number of interesting relationships between these statistics and the table of means may be observed. Note that the intercept of the regression line is the mean of Y (12.75) when X=0. This will occur whenever one value of the dichotomous independent variable is coded as 0. Note also that the slope of the regression line is the difference between the two means (16.75 - 12.75 = 4). Finally, note that the standard error of estimate, 1.44, is the mean of the standard deviations of Y when X=0 and X=1, (1.39+1.49)/2 = 1.44. This will occur whenever there are equal numbers of scores in each group. Otherwise the standard error of estimate is a weighted average of the two standard deviations.
An interactive exercise has been provided to allow the student to explore the relationship between these different representations of dichotomous independent variables. By adjusting the scroll bars,
different types of data may be generated. The "Effect Size" scroll bar corresponds to the slope of the regression line. Because different random data are generated each time the "New Data" button is pressed, the two will seldom be identical. Over the long run, however, the expected value of the slope should equal the effect size. The "Error" scroll bar controls the size of the standard deviations of the Y variable within each group. Changing this value will vary the size of the standard error of estimate, correlation coefficient, and standard deviations.
When the independent variable is dichotomous, simple linear regression basically predicts the value of Y using the mean of the Y for each level of the X variable. The correlation coefficient reflects both the direction and strength of this prediction. A positive correlation indicates that the mean of the Y variable for the larger value of X will be greater than that for the smaller value of X. A negative correlation indicates the opposite, namely that the mean of the Y variable will be larger for the smaller value of X. The absolute value of the correlation coefficient measures the difference between the two means relative to their standard deviations. The bigger the difference of the means relative to their standard deviations, the larger the absolute value of the correlation coefficient.
All this is really easier than it sounds. Students should run the interactive exercise until they feel that they have a solid understanding of the procedure and then test themselves using the testing exercise.
The case where the Y variable is dichotomous and the X variable is interval shares many similarities with the dichotomous independent variable case. For example, the data files appear similar
and the scatter plot is rotated ninety degrees.
The relationships between the table of means and the slope and intercept of the regression line are not apparent in this case.
The goal of this analysis is to predict membership in one of two groups as a function of score on another variable. An interactive exercise has been provided to allow the student to explore simple linear regression when the dependent measure is dichotomous.
The topics of hypothesis testing in regression and dichotomous independent variables can be combined to show how the t-test is a special case of linear regression. The test of whether the dichotomous independent variable adds significant additional predictive power in a linear regression will yield conclusions identical to the conclusions of a convention t-test.
The following table presents data and answers from the homework assignment that has a dichotomous independent variable in a regression model.
The first window of the SPSS data matrix is presented below.
The ANOVA source table for the REGRESSION procedure is presented below.
A similar ANOVA source table is produced using the COMPARE MEANS procedure and requesting the optional ANOVA analysis. It is readily observed that they are identical.
Comparing the computational difficulty of the regression approach with that presented in the introductory text on the chapter on ANOVA, the student might object that the regression approach is much more time consuming to compute and gives much less insight into the reasoning behind the analysis. Since these are the basic criteria for including an analysis in these statistics text, why are they included?
First of all, computational difficulty is not an issue after the homework assignments have been completed. All computation will be done with computers.
Secondly, the regression procedure provides greatly increased flexibility in analysis when multiple independent measures are included, the regression approach to testing hypotheses about means is preferred to the more traditional analyses of ANOVA and t-tests.
Thirdly, the computational formulas provided in the chapter on ANOVA in the introductory text were simplified in order provide a cleaner view of the underlying theory. The formulas in the introductory text assumed that there were an equal number of subjects in each group, an assumption that is hardly ever true in practice. The computational formulas for the Mean Squared terms must be modified in order to weigh each group by the sample size of that group. The changes in the computational formulas are similar to the changes in computing the standard error of the difference between the means under the assumption of equal n's and unequal n's in the Nested t-test.
The Mean Square within Groups is no longer a mean of group variances, but becomes a weighted mean of group variances. If the sample sizes, means, and variances of the two groups are represented by n1, n2, X1, X2, s12, and s22, respectively, then the computational formula for the Mean Square Within becomes:
Substituting numbers from the example data for the terms in this equation yields:
In a similar manner, the equation for the Mean Square Between Groups must be modified in order to deal with unequal n in the two groups. The revised formula for MSbetween is
Again substituting numbers for variables, the Mean Square Between Groups for the example data is:
These results are within rounding error of those generated by the statistical package.
Comparing these results with the computed results of the homework exercise yields the following.
Again it is observed that the two ANOVA source tables are identical. In addition, as described in the introductory text, the t-test and ANOVA results are related by the function F=t2. If the reader is still not convinced of the similarity of the tests, the results of the INDEPENDENT SAMPLES t-TEST is presented below.
Because the regression procedure provides greatly increased flexibility in analysis when multiple independent measures are included, the regression approach to testing hypotheses about means is preferred to the more traditional analyses of ANOVA and t-tests.
This chapter started with a discussion and demonstration of hypothesis testing in linear regression. Hypothesis testing in linear regression answers the question whether the addition of weighted variables to a regression equation increases the predictive power of the model enough to attribute the difference to something other than chance. In this section you saw how to find the F ratio in an ANOVA table and test the appropriate hypotheses. In later chapters the same underlying procedure will be used, only change in predictive ability will be measured by R2 change and only the exact significance level of the ANOVA will be reported.
The second section of this chapter used the hypothesis testing procedure in linear regression discussed in the first section and applied it to dichotomous variables. This section then demonstrated that the hypothesis testing procedure in linear regression was functionally equivalent to the t-test and ANOVA procedures to test the equality of means.