<?xml version='1.0'?>
<?xml:stylesheet type="text/xsl" href="MultiBook.xsl" ?>
<chapter>
<number>1</number>
<author>David W. Stockburger</author>
<title>Additional Regression Topics</title>
<modified>8/31/2000</modified>
<URL>mlt02.xml</URL>
<section>
<P>
Two additional regression topics, <index>hypothesis testing</index> and <index>dichotomous variables</index>, are necessary to understand how linear regression is used in practice and the relationship between linear regression and other <index>multivariate methods</index>. After these topics are presented, they will be applied to demonstrate how a t-test is a special case of hypothesis testing in <index>linear regression</index> and how linear regression can be used to predict membership in dichotomous groups.
</P> 
<definition word="dichotomous variable">a variable with only two possible levels.</definition>
</section>
<section>
<TestItem type="MC">
			<question>The critical question in testing hypotheses in linear regression</question>
			<answer type="correct">Is the gain in predictive accuracy large enough to attribute it to something other than chance or random effects. </answer>
			<answer>Is the full model a better predictor of the dependent variable than a subset of the independent variables.</answer>
			<answer>Is the full model a better predictor of the independent variable than a subset of the dependent variables.</answer>
			<answer>Is there a positive correlation signifying a causal relationship between the dependent and independent measures.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>9/01/2000</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>
<h3>Hypothesis Testing in Regression</h3>
The purpose of <index>hypothesis testing in regression</index> is to make rational decisions about the effect of adding additional information in terms of the independent or X variable on the accuracy of prediction. The basic idea is to sequentially compare the accuracy of prediction of a more complex regression model, that is, a model with a greater number of independent or predictor variables, with a subset of the model. Because the more complex model contains both the information available in the subset and additional information, it will <EN>always</EN> provide a prediction of the dependent variable that is equal to or better than the subset. The critical question is whether the gain in predictive accuracy is large enough to attribute it to something other than chance or <index>random effects</index>. </P>
		<TestItem type="MC">
			<question>Adding additional independent variables to a regression equation</question>
			<answer type="correct">will never decrease the predictive power of the model.</answer>
			<answer>will make the regression line straighter.</answer>
			<answer>will cause the standard error of estimate to increase or remain the same.</answer>
			<answer>will cause the regression model to become less complex.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/03/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
		<TestItem type="MC">
			<question>Statistical significance in testing whether additional variables indicates</question>
			<answer type="correct">the gain in predictive accuracy is large enough to attribute it to something other than chance.</answer>
			<answer>the variables increase the predictive power of the model.</answer>
			<answer>the correlation between the dependent and independent variables is different than zero.</answer>
			<answer>the standard error of estimate unlikely due to chance.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The application of hypothesis testing to simple linear regression may seem like laying the foundation for a skyscraper in the middle of Kansas. The reader must believe that the time spent understanding these concepts will prove valuable in the future. A city will appear in the middle of a cornfield.</P>
</section>
<section>
<P><h4>Simple Linear Regression</h4></P>
		<TestItem type="MC">
			<question>In simple linear regression, the goal is to</question>
			<answer type="correct">predict a single dependent variable from a single independent variable.</answer>
			<answer>find statistical significance, no matter what the state of the real world.</answer>
			<answer>find a constant that best describes the dependent variable.</answer>
			<answer>build a skyscraper in a field in the middle of Kansas.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/03/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>To review, the goal of <index>simple linear regression</index> is to predict a single dependent variable, usually labeled Y, from a single independent variable, usually labeled X. The predicted value of Y is labeled Y' and is a linear transformation of the X variable. This transformation is described by the following formula:</P>
<P><math type="equation">Y'<SUB>i </SUB>= b<SUB>0</SUB> + b<SUB>1</SUB> * X<SUB>i</SUB></math></P>
<P>Note that the notational scheme is somewhat different than that presented in the introductory text where:</P>
<P><math type="equation">Y'<SUB>i</SUB> = a + b * X<SUB>i</SUB></math></P>
<P>This reason for the change is a need to keep a consistent notational system when more variables are added to the regression equation. The intercept of the regression equation is now called b<SUB>0</SUB> instead of "a" and the slope is b<SUB>1</SUB> instead of "b".</P>
		<TestItem type="MC">
			<question>The slope of the regression line in multiple regression is represented by</question>
			<answer type="correct">b<SUB>1</SUB></answer>
			<answer>b<SUB>0</SUB></answer>
			<answer>a</answer>
			<answer>b</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>9/01/2000</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
		<TestItem type="MC">
			<question>The intercept of the regression line in multiple regression is represented by</question>
			<answer>b<SUB>1</SUB></answer>
			<answer type="correct">b<SUB>0</SUB></answer>
			<answer>a</answer>
			<answer>b</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>9/01/2000</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The values for b<SUB>0</SUB> and b<SUB>1</SUB> are found such that the <index>sum of the squared deviations</index> of the observed and predicted Y's are a <index>minimum</index>. This expression is expressed in mathematical notation as the following and is a measure of the error in prediction:</P>
		<TestItem type="MC">
			<question>The sum of Y minus Y prime squared is also called</question>
			<answer type="correct">sum of squared error.</answer>
			<answer>sum of residuals.</answer>
			<answer>sum of squares predicted.</answer>
			<answer>squared degrees of freedom.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P><math type="equation">
	<figure>
		<description>Sum of Y minus Y prime squared.  This is called the sum of squares, sum of squared error, or sum of squared residuals.</description>
		<url>images/Mlt023.gif</url>
		<width>83</width>
		<height>49</height>
		<align></align>
		<caption>Sum of Squared Error</caption>
		<alt>Sum of Squared Error</alt>
	</figure>
</math>
</P>
<P>The reader is referred to the chapter on <link HREF="/introbook/sbk16.xml">Simple Linear Regression</link> in<I> Introduction to Statistics: Concepts, Models, and Applications</I> for a more thorough review.</P>
</section>
<section>
<P><h4>Sequential Regression Models</h4></P>
<P>Even though the regression equation, Y'<SUB>i</SUB> = b<SUB>0</SUB> + b<SUB>1</SUB> * X<SUB>i</SUB>, appears relatively simple, it can be broken down into simpler models which are subsets of the final model. These subsets include the baseline case of no model, Y'<SUB>i </SUB>= 0, the model where each value of Y is predicted by a single number, Y'<SUB>i</SUB> = b<SUB>0</SUB>, and the final model, Y'<SUB>i</SUB> = b<SUB>0</SUB> + b<SUB>1</SUB> * X<SUB>i.</SUB> At each stage in building sequential models, the value of adding complexity to the regression model in terms of additional predictive power of the dependent measure may be assessed using hypothesis testing procedures. This allows the statistician to decide which terms and independent variables belong in the final model.</P>
		<TestItem type="MC">
			<question>In hypothesis testing in regression, when there is no model the predicted value of Y is </question>
			<answer type="correct">0</answer>
			<answer>the mean of Y.</answer>
			<answer>the standard deviation of Y.</answer>
			<answer>the intercept plus the slope times the value of X.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The data used in illustrating hypothesis testing in regression will be taken from the example in the introductory text involving predicting number of <index>widgets</index> from score on a <index>form board test</index>.</P>
<P>
<table cellPadding="3" cellSpacing="" summary = "" title=" Raw Data for Testing Hypotheses in Simple Linear Regression ">
  <tcaption>Raw Data for Testing Hypotheses in Simple Linear Regression</tcaption>
  <TR>
    <TH scope = "col" colspan = "1" abbr = "" >Form-Board Test</TH>
    <TH scope = "col" colspan = "1" abbr = "" >Widgets/hr</TH>
  </TR><TR>
    <TH scope = "col" colspan = "1" abbr = "X sub i" >X<SUB>i</SUB></TH>
    <TH scope = "col" colspan = "1" abbr = "Y sub i" >Y<SUB>i</SUB></TH>
  </TR> 
<TR><TD>13</TD><TD>23</TD></TR>
<TR><TD>20</TD><TD>18</TD></TR>
<TR><TD>10</TD><TD>35</TD></TR>
<TR><TD>33</TD><TD>10</TD></TR>
<TR><TD>15</TD><TD>27</TD></TR>
</table>
</P>
<P></P>
</section>
<section>
<P><h4>No Model</h4></P>
<P>In this case each value of Y is predicted by the model

 Y'<SUB>i</SUB> = 0. Since Y<SUB>i</SUB> - Y'<SUB>i</SUB> = Y<SUB>i</SUB> - 0 = Y<SUB>i</SUB>, the value of the sum of squared deviations is equal to the sum of Y<SUB>i</SUB><SUP>2</SUP> or</P>
<P><math type="equation">
	<figure>
		<description></description>
		<url>images/Mlt024.gif</url>
		<width>134</width>
		<height>51</height>
		<align></align>
		<caption></caption>
		<alt></alt>
	</figure>
    where Y' = 0. </math>
</P>
<P>This value is a <index>baseline</index> to which <index>sequential models</index> may be compared. Calculating the <index>sum of squared errors</index> for the example data results in the following values:</P>
<P>
<table cellPadding="" cellSpacing="" summary = "" title=" Error Sum of Squares when there is No Model in Simple Linear Regression ">
  <tcaption>Error Sum of Squares when there is No Model in Simple Linear Regression</tcaption>
  <TR>
    <TH scope = "col" colspan = "" abbr = "" > Widgets/hr </TH>
    <TH scope = "col" colspan = "" abbr = "" > No Model </TH>
  </TR><TR>
    <TH scope = "col" colspan = "" abbr = "Y" > Y<SUB>i</SUB></TH>
    <TH scope = "col" colspan = "" abbr = "Y squared" > Y<SUB>i</SUB><SUP>2</SUP></TH>
  </TR> 
<TR><TD>23</TD><TD>529</TD></TR>
<TR><TD>18</TD><TD>324</TD></TR>
<TR><TD>35</TD><TD>1225</TD></TR>
<TR><TD>10</TD><TD>100</TD></TR>
<TR><TD>27</TD><TD>729</TD></TR>
<TR><TD>SUM</TD><TD>2907</TD></TR>
</table>
</P>
</section>
<section>
		<TestItem type="MC">
			<question>The value that minimizes the sum of squared deviations around a constant prediction of Y is</question>
			<answer type="correct">the mean of Y.</answer>
			<answer>the mean of X.</answer>
			<answer>zero.</answer>
			<answer>the cross-products of X and Y</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>9/11/2000</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P><h4>A Single Value Model</h4></P>
<P>In this model, each value of Y<SUB>i</SUB> is predicted by a single number and the model takes the form, Y'<SUB>i</SUB> = b<SUB>0</SUB>. It can be proven that the value of b<SUB>0</SUB> which minimizes</P>
<P><math type="equation">
	<figure>
		<description></description>
		<url>images/Mlt021.gif</url>
		<width>96</width>
		<height>59</height>
		<align></align>
		<caption></caption>
		<alt>the sum of Y minus Y prime squared</alt>
	</figure></math>

</P>
<P>is the mean of Y, or 
	<figure>
		<description> The mean of Y minimizes the sum of squared deviations when Y predicted = c.</description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption> </caption>
		<alt></alt>
	</figure>
. In this case the sum of squared deviations around the predicted value of Y is the sum of squared deviations around the mean.</P>
<P>	<figure>
		<description>The sum of squared deviations around the predicted value of Y is the sum of squared deviations around the mean.</description>
		<url>images/Mlt026.gif</url>
		<width>169</width>
		<height>52</height>
		<align></align>
		<caption></caption>
		<alt> The sum of squared deviations around the mean.</alt>
	</figure>
   where Y'<SUB>i </SUB>= b<SUB>0</SUB>= 	
<figure>
		<description></description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt></alt>
	</figure>
.</P>
<P>The calculation of this value is demonstrated in the following table.</P>
<P>
<table cellPadding="" cellSpacing="6" summary = "" title=" Calculating the Sum of Squares for the Simple Model ">
  <tcaption>Calculating the Sum of Squares for the Simple Model</tcaption>
  <TR>
    <TH scope = "col" colspan = "" abbr = "" > Y<SUB>i</SUB></TH>
    <TH scope = "col" colspan = "" abbr = "" >
	<figure>
		<description>The mean of Y</description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>
    </TH>
    <TH scope = "col" colspan = "" abbr = "" > Y<SUB>i</SUB> -
	<figure>
		<description>The mean of Y</description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure> 
    </TH>
    <TH scope = "col" colspan = "" abbr = "" > (Y<SUB>i</SUB> -
	<figure>
		<description>The mean of Y</description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>)<SUP>2</SUP> 
    </TH>
    <TH scope = "col" colspan = "" abbr = "" > 
	<figure>
		<description>The mean of Y</description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure><SUP>2</SUP> 
    </TH>
</TR>
<TR><TD>23</TD><TD>22.6</TD><TD>0.4</TD><TD>.16</TD><TD>510.76</TD></TR>
<TR><TD>18</TD><TD>22.6</TD><TD>-4.6</TD><TD>21.16</TD><TD>510.76</TD></TR>
<TR><TD>35</TD><TD>22.6</TD><TD>12.4</TD><TD>153.76</TD><TD>510.76</TD></TR>
<TR><TD>10</TD><TD>22.6</TD><TD>-12.6</TD><TD>158.76</TD><TD>510.76</TD></TR>
<TR><TD>27</TD><TD>22.6</TD><TD>4.4</TD><TD>19.36</TD><TD>510.76</TD></TR>
<TR><TD>SUM</TD><TD></TD><TD>0.0</TD><TD>353.2</TD><TD>2553.8</TD></TR>
</table>
</P>
		<TestItem type="MC">
			<question>This component can be partitioned into two parts, that which the model can account for and that which it cannot, which sum to the total.</question>
			<answer type="correct">Sum of Squares</answer>
			<answer>Mean Square</answer>
			<answer>F ratio</answer>
			<answer>multiple R</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>Note that the Sum of Squares from the no model case, 2907, has been <index>partitioned</index> into two parts, that for which the single value model can account, 2553.8, and that which it cannot, 353.2. The two parts add up to the whole, that is 2553.8 + 353.2 = 2907. This will always be the case, although it will not be proven here. The interested reader is directed to Draper and Smith (1981)
		<ReferenceSource type="book">
			<title></title>
			<ISBN></ISBN>
			<edition></edition>
			<price type="hard/soft/web"></price>
			<refs></refs>
			<URL></URL>
			<pubdate type="year">1981</pubdate>
			<author vCard=""><first></first><last>Draper</last></author>
			<location></location>
			<publisher vCard=""></publisher>
			<booknote date=""></booknote>
			<quote page=""></quote>
		</ReferenceSource>
. </P>
<P>The results of the preceding analysis are often summarized in an <index>ANOVA source table</index> as presented below:</P>
<P>
<table cellPadding="" cellSpacing="3" summary = "" title="ANOVA source table">
  <tcaption>ANOVA Source Table</tcaption>
  <TR>
    <TH scope = "col" colspan = "" abbr = "" > Source </TH>
    <TH scope = "col" colspan = "" abbr = "" > Sum of Squares </TH>
    <TH scope = "col" colspan = "" abbr = "" > df </TH>
    <TH scope = "col" colspan = "" abbr = "" > Mean Square </TH>
    <TH scope = "col" colspan = "" abbr = "" > F </TH>
    <TH scope = "col" colspan = "" abbr = "" >sig.</TH>
 </TR> 
<TR><TD>
b<SUB>0</SUB></TD><TD>2553.8</TD><TD>1</TD><TD>2553.8</TD><TD>28.92</TD><TD>.000</TD></TR>
<TR><TD>Error</TD><TD>353.2</TD><TD>4</TD><TD>88.3</TD><TD></TD></TR>
<TR><TD>Total</TD><TD>2907</TD><TD>5</TD><TD></TD><TD></TD></TR>
</table>
</P>
<P>The <index>degrees of freedom</index> (<index>df</index>) for each row is the number of values from which it is calculated. For example, the total sum of squares is based on N=5 values. In a like manner, the Sum of Squares for b<SUB>0</SUB> is based on a single value, the mean. The Error Sum of Squares is based on N-1 df because one df is lost for the estimation of b<SUB>0</SUB>. The df for each of the terms must add up to the Total df.</P>
		<TestItem type="MC">
			<question>Mean squares in an ANOVA source table for a regression analysis are</question>
			<answer type="correct">sum of squares divided by degrees of freedom.</answer>
			<answer>average squared deviations.</answer>
			<answer>proportion of variance accounted for.</answer>
			<answer>the number of values that are free to vary.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
		<TestItem type="MC">
			<question>The F ratio in an ANOVA source table is calculated by</question>
			<answer type="correct">finding the ratio of two Mean Square values.</answer>
			<answer>dividing the Mean Square by the degrees of freedom.</answer>
			<answer>dividing the sum of squared predicted by the sum of squares error.</answer>
			<answer>finding the proportion of variance accounted for.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem><P>The <index>Mean Square</index> for each row is calculated by dividing the Sum of Squares by the degrees of freedom for each row. For example, the Mean Square for the Error Source is 353.2 / 4 = 88.3. The <index>F ratio</index> for the b<SUB>0</SUB> source is calculated by dividing the Mean Square for b<SUB>0</SUB> by the <index>Mean Square for Error</index> (2553.8 / 88.3 = 28.92). </P>
<definition word="mean square">a measure of variability, calculated by dividing the sum of squares by the degrees of freedom.</definition>
<P>In order to determine whether the b<SUB>0</SUB> term is statistically significant, the exact significance level for the F ratio is obtained from the probability calculator as shown in the following figure.
</P>
<P>
	<figure>
		<description> The Probability Calculator is used to illustrate an F distribution with 1 and 4 degrees of freedom.  A blue arrow labeled 1 points to the F distribution in the list of distributions.  A series of arrows labeled 2 point to degrees of freedom one and two and a score.  A value of 1 is entered in df1, 4 in df2, and 28.92 as the score. A third blue arrow points to a small box on the screen with an arrow pointing to the right.  A value of .00577 appears in the text box labeled probability. The graph shows a very positively skewed distribution with none of the rightmost area shaded red.</description>
		<url> images/Mlt0229.gif </url>
		<width>507</width>
		<height>291</height>
		<align></align>
		<caption>Finding the exact significance level for an F ratio.</caption>
		<alt> Finding the exact significance level for an F ratio </alt>
	</figure>
</P>
<ProbabilityCalculator/>
		<TestItem type="MC">
			<question>With 1 and 19 degree of freedom, an F ratio of 4.307 would</question>
			<answer type="correct">not be statistically significant.</answer>
			<answer>be statistically significant when alpha equaled .05 but not when alpha equaled .01.</answer>
			<answer>be statistically significant when alpha equaled .01.</answer>
			<answer>be the result of a calculation error.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem> 
<P>
In this case the exact significance level for an F ratio of 28.92 on an F distribution with 1 and 4 <index>degrees of freedom</index> is .00577. Because the exact significance level is less than alpha (.05), the b<SUB>0</SUB> term is statistically significant, that is, it significantly increases the predictive power of the model. Some theory is necessary to justify the use of the F distribution to test for effects and the interested reader is again directed to 
<ref>Draper and Smith (1981)</ref>.</P>
<P>The purpose of the preceding exercise is to lay a foundation and develop an understanding of how hypothesis testing works in simple linear regression. In the case of most statistical packages, the statistical test for b<SUB>0</SUB> is not automatically done and one must optionally request a regression analysis without the <index>intercept</index>. The reason for its exclusion is that it is usually not very interesting as most measurements in the social sciences are different from zero.</P>
		<TestItem type="MC">
			<question>Significance testing of the intercept in simple linear regression</question>
			<answer type="correct">is usually not very interesting in the social sciences.</answer>
			<answer>is the analysis of choice to decide about the reality of a relationship between the independent and dependent variables.</answer>
			<answer>cannot be done.</answer>
			<answer>results in a distorted picture of reality with unreasonable underlying assumptions and should be done only with extreme caution.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>In a similar manner, another model with a single value, Y'<SUB>i </SUB>= b<SUB>1</SUB>X<SUB>i</SUB>, could be compared with the no model case to see if it added significant predictive accuracy. Again, this is not a standard analysis and the statistical package user must optionally request that the regression be done without a constant term.</P>
</section>
<section>
<P><h4>A Model Adding b<SUB>1</SUB> After b<SUB>0</SUB></h4></P>
<P>A much more common and interesting approach to testing hypotheses in simple linear regression is to examine the effect of adding the b<SUB>1</SUB>X<SUB>i</SUB> term to the model after the b<SUB>0</SUB> term has been entered. The general procedure is similar to the preceding case. The predicted values of the full model are compared to the observed values of Y to find a measure of error. The predicted values of the full model are compared to the predicted values of the subset of the full model to find the increase in predictive power. The increase in predictive power is divided by error variance to find a ratio to test for additional predictive power.</P>
		<TestItem type="MC">
			<question>Significance testing of the slope given the intercept has already entered the model in simple linear regression</question>
			<answer type="correct">is the analysis of choice to decide about the reality of a relationship between the independent and dependent variables.</answer>
			<answer>is usually not very interesting in the social sciences.</answer>
			<answer>cannot be performed.</answer>
			<answer>results in a distorted picture of reality with unreasonable underlying assumptions and should be done only with extreme caution.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem> 
<P>In this case the predictive power of the full model, Y'<SUB>i </SUB>= b<SUB>0</SUB> + b<SUB>1</SUB>*X<SUB>i</SUB>, is compared to a subset of the model with only the b<SUB>0</SUB> term included, Y'<SUB>i </SUB>= b<SUB>0</SUB>=	<figure>
		<description></description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>
. When both b<SUB>0</SUB> and b<SUB>1</SUB> are included in the model, the example data yields the model Y'<SUB>i</SUB> = 40.01 - .9565 * X<SUB>i</SUB>, where b<SUB>0</SUB> = 40.01 and b<SUB>1</SUB> = -.9565. A comparison of the predictions of the two models in the example data is presented in the following table.</P>
<P>
<table cellPadding="" cellSpacing="" summary = "" title="">
  <tcaption></tcaption>
  <TR>
    <TH scope = "col" colspan = "" abbr = "" > Form-Board Test </TH>
    <TH scope = "col" colspan = "" abbr = "" > Widgets/hr </TH>
    <TH scope = "col" colspan = "" abbr = "" > Y'<SUB>i</SUB>=b<SUB>0</SUB>+b<SUB>1</SUB>X </TH>
    <TH scope = "col" colspan = "" abbr = "" > Error </TH>
    <TH scope = "col" colspan = "" abbr = "" > Squared Error </TH>
    <TH scope = "col" colspan = "" abbr = "" >
	<figure>
		<description></description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>
</TH>
    <TH scope = "col" colspan = "" abbr = "" > Additional Predictive Power </TH>
  </TR> 
<TR><TD>1</TD><TD>2</TD><TD>3</TD><TD>4</TD><TD>5</TD><TD>6</TD><TD>7</TD></TR>

<TR>
<TD>
X<SUB>i</SUB></TD>
<TD>
Y<SUB>i</SUB></TD>
<TD>
Y'<SUB>i</SUB></TD>
<TD>
Y<SUB>i</SUB> - Y'<SUB>i</SUB></TD>
<TD>
(Y<SUB>i</SUB> - Y'<SUB>i</SUB>)<SUP>2</SUP></TD>
<TD>
Y'<SUB>i</SUB> -
	<figure>
		<description></description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>
</TD>
<TD>
(Y'<SUB>i</SUB><SUP> </SUP>-
	<figure>
		<description></description>
		<url>images/Mlt025.gif</url>
		<width>15</width>
		<height>16</height>
		<align></align>
		<caption></caption>
		<alt>Y bar</alt>
	</figure>
)<SUP>2</SUP></TD>
</TR>
<TR><TD>13</TD><TD>23</TD><TD>27.576</TD><TD>-4.576</TD><TD>20.940</TD><TD>4.976</TD><TD>24.761</TD></TR>
<TR><TD>20</TD><TD>18</TD><TD>20.880</TD><TD>-2.880</TD><TD>8.294</TD><TD>-.720</TD><TD>2.958</TD></TR>
<TR><TD>10</TD><TD>35</TD><TD>30.445</TD><TD>4.555</TD><TD>20.748</TD><TD>7.845
</TD><TD>61.544</TD></TR>
<TR><TD>33</TD><TD>10</TD><TD>8.446</TD><TD>1.554</TD><TD>2.415</TD><TD>-4.154</TD><TD>200.336</TD></TR>
<TR><TD>15</TD><TD>27</TD><TD>25.663</TD><TD>1.337</TD><TD>1.788</TD><TD>3.063
</TD><TD>9.382</TD></TR>
<TR><TD>SUM</TD><TD></TD><TD></TD><TD></TD><TD>54.185</TD><TD></TD>
<TD>298.981</TD></TR>
</table>
</P>
<P>Columns (1) and (2) contain the raw data, while column (3) contains the predicted value of Y for the <index>full model</index>, column (4) contains the <index>residuals</index>, and column (5) contains the squared residuals. Up to this point the table recreates the table in the introductory text. Column (6) contains the difference between the predicted value of Y for the full model and the predicted value of Y for the subset of the full model. The reader should note that if the value of b<SUB>1 </SUB>was equal to zero, then the predictions of the two different models would be identical and this column would contain zeros. As the size of b<SUB>1</SUB> increases, the difference between the two predictions becomes larger. The values in column (7) indicate the increase in predictive power of the full model over the subset.</P>
<P>The sum of column (5) is called the error sum of squares. The sum of column (7) is the sum of squares regression and is a measure of the <index>increase in predictive power</index> of the full model, Y'<SUB>i </SUB>= b<SUB>0</SUB> + b<SUB>1</SUB>*X<SUB>i</SUB>, over the partial model, Y'<SUB>i </SUB>= b<SUB>0</SUB>. This is often shortened to the <index>sum of squares</index> of b<SUB>1</SUB> given b<SUB>0</SUB> or in mathematical notation b<SUB>1</SUB>|b<SUB>0</SUB>. It should be noted that the error sum of squares and the sum of squares of b<SUB>1</SUB> given b<SUB>0 </SUB>add up to the error sum of squares for the partial model (54.185 + 298.981 = 353.166), within <index>rounding error</index>. This will always be the case, although the proof is beyond the scope of this text. The interested reader is again directed to <ref>Draper and Smith (1981)</ref>.</P>
		<TestItem type="MC">
			<question>In hypothesis testing in linear regression, if the slope of the regression line is equal to zero</question>
			<answer type="correct">the sum of square regression will also be zero.</answer>
			<answer>the sum of square error will equal the sum of square regression.</answer>
			<answer>the degrees of freedom will be equal to zero.</answer>
			<answer>the multiple correlation coefficient will be close to one.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The variability that cannot be predicted by the <index>partial model</index> (subset) is partitioned into two parts, that which can be predicted by the addition of terms to the model and that which cannot. Everything adds up, which is nice, and the theoretical distribution of the resulting ratios can be mathematically determined, which is even nicer, because it allows an <index>hypothesis test</index> to be done.</P>
<P>All the above is neatly summarized in an <index>ANOVA source table</index>. The source table for the example data is presented below.</P>
<P>
<table cellPadding="" cellSpacing="" summary = "" title="">
  <tcaption></tcaption>
  <TR>
    <TH scope = "col" colspan = "" abbr = "" > Source </TH>
    <TH scope = "col" colspan = "" abbr = "" > Sum of Squares </TH>
    <TH scope = "col" colspan = "" abbr = "" > df </TH>
    <TH scope = "col" colspan = "" abbr = "" > Mean Square </TH>
    <TH scope = "col" colspan = "" abbr = "" > F </TH>
  </TR> 
<TR><TD>b<SUB>0</SUB></TD><TD>2553.8</TD><TD>1</TD><TD>2553.8</TD><TD>28.92</TD></TR>
<TR><TD>Error b<SUB>0</SUB></TD><TD>353.2</TD><TD>4</TD><TD>88.3</TD><TD></TD></TR>
<TR><TD>b<SUB>1</SUB>|b<SUB>0</SUB></TD><TD>298.981</TD><TD>1</TD><TD>298.981
</TD><TD>16.55</TD></TR>
<TR><TD>Error b<SUB>1</SUB>|b<SUB>0</SUB></TD><TD>54.185</TD><TD>3</TD><TD>
18.062</TD><TD></TD></TR>
<TR><TD>Total</TD><TD>2907</TD><TD>5</TD><TD></TD><TD></TD>
</TR>
</table>
</P>
		<TestItem type="MC">
			<question>In hypothesis testing in linear regression, one degree of freedom is lost for</question>
			<answer type="correct">every term in the regression model.</answer>
			<answer>every pair of score values.</answer>
			<answer>every computed sum of squares value.</answer>
			<answer>the mean of the independent variable.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>There is one <index>degree of freedom</index> used when the value of b<SUB>1</SUB> is found. The mean squares are found as before by dividing the sum of squares by the degrees of freedom. The observed <index>F ratio</index> for the b<SUB>1</SUB>|b<SUB>0 </SUB>term is found by dividing its mean square by the error mean square. The observed F ratio is then entered in the probability calculator to find an exact significance level. In this case the exact significance level is .02679 for an F distribution with 1 and 3 degrees of freedom.  Because the exact significance level is less than alpha (.05), the addition of the b<SUB>1</SUB> term to the model is statistically significant.</P>
<P>
	<figure>
		<description> The Probability Calculator is used to illustrate an F distribution with 1 and 3 degrees of freedom.  A blue arrow labeled 1 points to the F distribution in the list of distributions.  A series of arrows labeled 2 point to degrees of freedom one and two and a score.  A value of 1 is entered in df1, 4 in df2, and 16.55 as the score. A third blue arrow points to a small box on the screen with an arrow pointing to the right.  A value of .02679 appears in the text box labeled probability. The graph shows a very positively skewed distribution with none of the rightmost area shaded red.</description>
		<url> images/Mlt0230.gif </url>
		<width>501</width>
		<height>293</height>
		<align></align>
		<caption>Finding the exact significance level for an F ratio.</caption>
		<alt> Finding the exact significance level for an F ratio </alt>
	</figure>
</P>
<ProbabilityCalculator/>
<P>A much simpler test of the increase in predictive power can be done when the statistical package optionally allows for tests of R<SUP>2</SUP> change.  The <index>multiple correlation coefficient</index>, or <index>R</index>, is the <index>correlation coefficient</index> between the observed and predicted values of Y.  The value of R<SUP>2</SUP> is simply the value of R squared.  The more terms added to the model, the larger the value of R<SUP>2</SUP>.  A significance test may be done to test whether the increase in R<SUP>2</SUP> was significant.  The results of this significance test will be identical to the results of the procedure described above.</P>
<definition word="R squared change">the difference in the multiple R squared between a full model and a partial model.</definition>
<definition word="multiple R">the correlation coefficient between the observed and predicted Y values.</definition>
		<TestItem type="MC">
			<question>In a multiple regression model with a single independent variable, the value of the multiple R will be</question>
			<answer type="correct">the absolute value of the correlation coefficient.</answer>
			<answer>the correlation coefficient squared.</answer>
			<answer>mean squares due to regression divided by mean squares error.</answer>
			<answer>larger the greater the degrees of freedom.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
		<TestItem type="MC">
			<question>The multiple correlation coefficient (R)</question>
			<answer type="correct">is the correlation between the predicted and observed Y values.</answer>
			<answer>is the correlation between X and Y, minus a correction term.</answer>
			<answer>is the average of all the correlations between the independent and dependent variables.</answer>
			<answer>is the mean squares due to regression divided by the mean squares error.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>All of the above and more is done when a statistical package is used to compute a simple linear regression. The default output for the SPSS regression analysis includes a source table as follows:</P>
<P>	<figure>
		<description>The default ANOVA source table from the SPSS linear regression output is shown. It has three rows of entries, regression, residual, and total. It contains, within rounding error, the same values as the ANOVA table computed by hand.</description>
		<url>images/Mlt027.gif</url>
		<width>500</width>
		<height>174</height>
		<align></align>
		<caption>Default ANOVA source table with SPSS linear regression program.</caption>
		<alt> Default ANOVA source table with SPSS linear regression program </alt>
	</figure>
</P>
<P>The results of this table are within <index>rounding error</index> of those computed by hand.</P>
	<TestItem type="MC">
		<question>A linear prediction model that contains both the terms of a simpler model and additional terms ____than the simpler model. </question>
		<answer type="correct">will be more complex</answer>
		<answer type="incorrect">will predict less variance</answer>
		<answer type="incorrect">will be more significant</answer>
		<answer type="incorrect">more than one of the above</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>The value of b<sub>0</sub> in the prediction model Y"=b<sub>0</sub> that minimizes the value of E(Y-Y")<sup>2</sup> is </question>
		<answer type="incorrect">0</answer>
		<answer type="incorrect">Mean of X</answer>
		<answer type="correct">Mean of Y</answer>
		<answer type="incorrect">S<sub>y.x</sub></answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>What is the value of the Sum of Squares due to Regression (b<sub>1</sub>|b<sub>0</sub>)?</question>
		<figure>
			<description>A table with four columns, labeled Subject, Y prime, Y minus Y prime squared and Y minus Y bar squared.  At the bottom of the table is a row labeled Sum with the values 108, 53, and 130, respectively.</description>
			<url>Regress1.gif</url>
			<width>244</width>
			<height>104</height>
			<align></align>
			<caption></caption>
			<alt>Table of sums.</alt>
		</figure>
		<answer type="correct">77</answer>
		<answer type="incorrect">130</answer>
		<answer type="incorrect">53</answer>
		<answer type="incorrect">d.11664</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>What is the value of Mean Square Residual? </question>
		<figure>
			<description>A table with four columns, labeled Subject, Y prime, Y minus Y prime squared and Y minus Y bar squared.  At the bottom of the table is a row labeled Sum with the values 108, 53, and 130, respectively.</description>
			<url>Regress1.gif</url>
			<width>244</width>
			<height>104</height>
			<align></align>
			<caption></caption>
			<alt>Table of sums.</alt>
		</figure>
		<answer type="incorrect">43.33</answer>
		<answer type="incorrect">77</answer>
		<answer type="incorrect">2809</answer>
		<answer type="correct">26.5</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
</section>
<section>
<P><h3>Dichotomous Variables in Regression</h3></P>
<P>A <index>dichotomous variable</index> may take on one of two values. For example, gender is a dichotomous variable because it may take one of only two values, male or female. Dichotomous variables are a special case of nominal categorical variables with two or more levels. An understanding of the special case will lead to a later understanding of the more general case.</P>
		<TestItem type="MC">
			<question>Dichotomous variables</question>
			<answer type="correct">may be treated as interval variables.</answer>
			<answer>have two or more levels.</answer>
			<answer>may be assumed to have a rational zero.</answer>
			<answer>are not appropriate for use in a linear regression analysis.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
	<TestItem type="MC">
		<question>What will be the value of b<sub>0</sub> in Y"=b<sub>0</sub>+b<sub>1</sub>X, given the following table of means?</question>
		<figure>
			<description>A table of means of two groups, labeled X=0 and X=1, each with an N of 8.  The means for the two groups are 16.75 and 12.75 and the standard deviations are 1.4 and 1.6.</description>
			<url>Regress2.gif</url>
			<width>242</width>
			<height>48</height>
			<align></align>
			<caption></caption>
			<alt></alt>
		</figure>
		<answer type="incorrect">-4.00</answer>
		<answer type="incorrect">12.75</answer>
		<answer type="correct">16.75</answer>
		<answer type="incorrect">d.1.4</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>What will be the value of b<sub>1</sub> in Y"=b<sub>0</sub>+b<sub>1</sub>X, given the following table of means?</question>
		<figure>
			<description>A table of means of two groups, labeled X=0 and X=1, each with an N of 8.  The means for the two groups are 16.75 and 12.75 and the standard deviations are 1.4 and 1.6.</description>
			<url>Regress2.gif</url>
			<width>242</width>
			<height>48</height>
			<align></align>
			<caption></caption>
			<alt></alt>
		</figure>
		<answer type="correct">-4.0</answer>
		<answer type="incorrect">12.75</answer>
		<answer type="incorrect">.2</answer>
		<answer type="incorrect">d.16.75</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>What will be the value of s<sub>y.x</sub>, given the following table of means? </question>
		<answer type="incorrect">-4.0</answer>
		<answer type="incorrect">16.75</answer>
		<answer type="incorrect">.2</answer>
		<answer type="correct">1.5</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>What effect will increasing the "Effect Size" in the interactive exercise to demonstrate dichotomous variables in regression have on the resulting regression equation. </question>
		<answer type="correct">the absolute value of b1</answer>
		<answer type="incorrect">the absolute value of the correlation coefficient</answer>
		<answer type="incorrect">the standard error of estimate</answer>
		<answer type="incorrect">effect</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
	<TestItem type="MC">
		<question>The most flexible technique in hypothesis testing is </question>
		<answer type="incorrect">t-tests</answer>
		<answer type="incorrect">discriminant function analysis</answer>
		<answer type="incorrect">ANOVA for means</answer>
		<answer type="correct">Regression</answer>
		<difficulty></difficulty>
		<discriminability></discriminability>
		<author>David Stockburger</author>
		<date>03/05/2001</date>
		<concept>Additional Regression Topics</concept>
	</TestItem>
<P>As discussed in the introductory text, the <index>interval property</index> of measurement assumes that an equal change in the number system reflects the same difference in the real world. In order for the results of a linear regression to be meaningfully interpreted, the interval property of the scale must be assumed to be reasonably close to being correct. In some cases this assumption is clearly unwarranted as in the example of religious preference: 1=Protestant, 2=Catholic, 3=Jewish, and 4=Other. Because of the clear violation of the assumptions of the model, direct use of <index>religious preference</index> as either an independent or dependent variable in simple linear regression would be clearly inappropriate.</P>
<P>A dichotomous variable has only a single interval between its two levels. The interval property is thus assumed to be satisfied and therefore dichotomous variables may be used as variables in simple linear regression. When the independent variable is dichotomous and hypothesis testing is done, the results of the analysis will be identical to a t-test. Thus, the t-test is a special case of simple linear regression when the independent variable is dichotomous. When the dependent variable is dichotomous, the analysis is a special case of discriminant function analysis. </P>
		<TestItem type="MC">
			<question>With a dichotomous independent variable and an interval dependent variable, which of the following hypothesis tests would result in a different exact significance level fro the others?</question>
			<answer type="correct">chi-square</answer>
			<answer>t-test</answer>
			<answer>ANOVA of the means.</answer>
			<answer>ANOVA of the slope of the regression line given the intercept</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
</section>
<section>
<P><h3>Dichotomous Independent Variables</h3></P>
<P>A close examination of simple linear regression when the independent variable is dichotomous proves insightful. These data may be represented in both graphs and statistics and the representations share relationships in common.</P>
		<TestItem type="MC">
			<question>The preferred coding of dichotomous variables is</question>
			<answer type="correct">0 and 1</answer>
			<answer>-145.5 and 136.43</answer>
			<answer>1 and 2</answer>
			<answer>1 and 100</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The dichotomous independent variable may be coded using any two numbers without affecting the statistical procedure. For example gender could be coded as "-145.5 = male" and "136.43 = female". If one wishes to meaningfully interpret the output, however, some coding systems prove much easier than others. For various reasons, I prefer the use of "0" and "1" as numbers representing the two levels of the variable. In addition, if the positive level is assigned the number "1", interpretation is made easier. For example, on a "yes-no" item, code "no=0" and "yes=1" and on a "true-false" item, code "false=0" and "true=1". In coding gender, the preference is strictly up to the person doing the coding. In later chapters when an interaction term is computed, a coding of -1 and 1 for the two levels of a dichotomous variable will be used.</P>
<P>Example data from the interactive exercise demonstrating dichotomous independent variables is presented below.</P>
<P>	<figure>
		<description>Example data is presented, titled dichotomous data in regression. Two rows of data are shown, the first, labeled X, has only zeros or ones as data. The second, labeled Y, has values between 11 and 19.</description>
		<url>images/Mlt028.gif</url>
		<width>454</width>
		<height>57</height>
		<align></align>
		<caption></caption>
		<alt>Dichotomous Data in Regression</alt>
	</figure>
</P>
<P>This data may be represented in a number of different ways. The first is as a scatter plot, which is presented below.</P>
<P>	<figure>
		<description>A scatter plot of a dichotomous independent variable and an interval dependent variable is shown. The X-axis has two values labeled, 0 and 1. All the points on the Y-axis appear at the two points labeled 0 and 1 on the X-axis. A regression line passes through the two sets of points at the value of the mean of Y for a given value of X.</description>
		<url>images/Mlt029.gif</url>
		<width>260</width>
		<height>149</height>
		<align></align>
		<caption>A scatter plot of a dichotomous independent variable and an interval dependent variable.</caption>
		<alt> A scatter plot of a dichotomous independent variable and an interval dependent variable </alt>
	</figure>
</P>
<P>Because the X variable has only two levels, many of the points share the same space on the graph. It can be observed, however, that the slope of the <index>regression line</index> is positive, as is the slope of the regression line. The spacing of the points at the two values of X gives some idea of the variability within each level, but may be deceiving because duplication of points remains unknown.</P>
<P>A more informative representation is to draw two overlapping relative frequency polygons, as presented below.</P>
<P>
	<figure>
		<description> Two overlapping relative frequency polygons illustrating the relationship between a dichotomous independent variable and an interval dependent variable are shown. Blue and green lines delimit the two graphs, with the blue relative frequency polygon appearing to the left of the green relative frequency polygon.</description>
		<url>images/Mlt0210.gif</url>
		<width>253</width>
		<height>143</height>
		<align></align>
		<caption>Two overlapping relative frequency polygons.</caption>
		<alt> Two overlapping relative frequency polygons </alt>
	</figure>
</P>
<P>In this case, the <index>relative frequency</index> of the Y values is drawn in blue and green when X equals 0 and 1, respectively. This representation provides information about duplication of points that is unavailable in the scatter plot. This representation can be difficult to interpret if the polygons are <index>sawtoothed</index>. It can be easily observed that the values for the dependent variable are generally higher when X=1 then when X=0.</P>
<P>Means and standard deviations may be computed for the Y values for each of the two levels of the X variable. A means table for the example data is presented below.</P>
<P>	<figure>
		<description>A table of means and standard deviations is shown when the independent measure is dichotomous and the dependent measure is interval. Two rows, labeled X=0 and X=1 have sample sizes, means, and standard deviation of 8, 12.78, and 1.39 and 8, 16.75, and 1.49, respectively.</description>
		<url>images/Mlt0211.gif</url>
		<width>192</width>
		<height>48</height>
		<align></align>
		<caption>Table of means and standard deviations.</caption>
		<alt> Table of means and standard deviations </alt>
	</figure>
</P>
<P>This representation is neat, concise, and informative. It can be seen that the mean of Y is 12.75 when X=0 and 16.75 when X=1. The standard deviations of the two subsets are approximately equal.</P>
<P>The data may also be represented as a <index>correlation coefficient</index> and regression equation as presented below.</P>
<P>	<figure>
		<description>A correlation coefficient of .83, a regression line where y prime equals 12.75 plus four times X, and a standard error equaling 1.44 is shown, describing the relationships in the example data presented earlier.</description>
		<url>images/Mlt0212.gif</url>
		<width>370</width>
		<height>34</height>
		<align></align>
		<caption>Correlation coefficient, regression line, and standard error of estimate.</caption>
		<alt> Correlation coefficient, regression line, and standard error of estimate.</alt>
	</figure>
</P>
		<TestItem type="MC">
			<question>The intercept of the regression line when the X variable is dichotomous and coded 0 and 1 is</question>
			<answer type="correct">the mean of Y when X=0.</answer>
			<answer>the mean of Y when X=1.</answer>
			<answer>the mean of Y.</answer>
			<answer>the mean of X.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
		<TestItem type="MC">
			<question> The slope of the regression line when the X variable is dichotomous and coded 0 and 1 is </question>
			<answer type="correct">the difference between the mean of Y when X=1 and the mean of Y when X=0.</answer>
			<answer>the mean of Y.</answer>
			<answer>the mean of Y when X=0.</answer>
			<answer>the standard deviation of Y.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>Note that the correlation coefficient and slope of the regression line are both positive, as predicted from the <index>scatter plot</index>. A number of interesting relationships between these statistics and the table of means may be observed. Note that the <index>intercept of the regression line</index> is the mean of Y (12.75) when X=0. This will occur whenever one value of the dichotomous independent variable is coded as 0. Note also that the <index>slope of the regression line</index> is the difference between the two means (16.75 - 12.75 = 4). Finally, note that the <index>standard error of estimate</index>, 1.44, is the mean of the standard deviations of Y when X=0 and X=1, (1.39+1.49)/2 = 1.44. This will occur whenever there are equal numbers of scores in each group. Otherwise the standard error of estimate is a weighted average of the two standard deviations.</P>
<P>An interactive exercise has been provided to allow the student to explore the relationship between these different representations of dichotomous independent variables. By adjusting the scroll bars,</P>
		<TestItem type="MC">
			<question> The correlation between two variables when the X variable is dichotomous and coded 0 and 1 will be negative when</question>
			<answer type="correct">the mean of Y when X=1 is less than the mean of Y when X=0.</answer>
			<answer>the mean of Y is less than 0.</answer>
			<answer>the mean of Y when X=0 is negative.</answer>
			<answer>the standard deviation of Y is greater than the standard deviation of X.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>	<figure>
		<description> The control bar for the interactive exercise to explore the relationship between different representations of dichotomous independent variables is shown. It has two scroll bars and text boxes, labeled effect size and error, respectively. Both text boxes show the number 5.</description>
		<url>images/Mlt0213.gif</url>
		<width>436</width>
		<height>26</height>
		<align></align>
		<caption>The control bar for the interactive exercise to explore the relationship between different representations of dichotomous independent variables.</caption>
		<alt>The control bar for the interactive exercise to explore the relationship between different representations of dichotomous independent variables</alt>
	</figure>
</P>
<P>different types of data may be generated. The "Effect Size" scroll bar corresponds to the slope of the regression line. Because different random data are generated each time the "New Data" button is pressed, the two will seldom be identical. Over the long run, however, the expected value of the slope should equal the effect size. The "Error" scroll bar controls the size of the standard deviations of the Y variable within each group. Changing this value will vary the size of the standard error of estimate, correlation coefficient, and standard deviations.</P>
<P>When the <index>independent variable</index> is <index>dichotomous</index>, simple linear regression basically predicts the value of Y using the mean of the Y for each level of the X variable. The <index>correlation coefficient</index> reflects both the direction and strength of this prediction. A positive correlation indicates that the mean of the Y variable for the larger value of X will be greater than that for the smaller value of X. A negative correlation indicates the opposite, namely that the mean of the Y variable will be larger for the smaller value of X. The absolute value of the correlation coefficient measures the difference between the two means relative to their standard deviations. The bigger the difference of the means relative to their standard deviations, the larger the absolute value of the correlation coefficient.</P>
		<TestItem type="MC">
			<question> The correlation between two variables when the X variable is dichotomous and coded 0 and 1 will be larger when</question>
			<answer>the mean of Y when X=1 is less than the mean of Y when X=0.</answer>
			<answer>the mean of Y is less than 0.</answer>
			<answer>the standard deviation of Y is large.</answer>
			<answer type="correct">the difference between the means is large relative to their standard deviations.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>

<P>All this is really easier than it sounds. Students should run the interactive exercise until they feel that they have a solid understanding of the procedure and then test themselves using the testing exercise.
</P>
<activity URL="exercises/mlt02a.htm">
	<description> Exploring Dichotomous Independent Variables </description>
</activity>
</section>
<section>
<P><h3>Dichotomous Dependent Variables</h3></P>
<P>The case where the Y variable is dichotomous and the X variable is interval shares many similarities with the dichotomous independent variable case. For example, the data files appear similar</P>
<P>	<figure>
		<description> Example data is presented, titled dichotomous data in regression. Two rows of data are shown, the first, labeled X, has values between 11 and 19. The second, labeled Y, has only zeros or ones as data.</description>
		<url>images/Mlt0214.gif</url>
		<width>455</width>
		<height>54</height>
		<align></align>
		<caption></caption>
		<alt>Dichotomous data in regression</alt>
	</figure>
</P>
<P>and the scatter plot is rotated ninety degrees.</P>
<P>	<figure>
		<description> A scatter plot of a dichotomous dependent variable and an interval independent variable is shown. The Y-axis has two values labeled, 0 and 1. All the points on the X-axis appear at the two points labeled 0 and 1 on the Y-axis. A regression line passes through the two sets of points at the value of the mean of X for a given value of Y.</description>
		<url>images/Mlt0215.gif</url>
		<width>252</width>
		<height>142</height>
		<align></align>
		<caption>Scatter plot of a dichotomous dependent variable and interval independent variable.</caption>
		<alt> Scatter plot of a dichotomous dependent variable and interval independent variable.</alt>
	</figure>
</P>
<P>The relationships between the <index>table of means</index> and the slope and intercept of the regression line are not apparent in this case.</P>
<P>	<figure>
		<description> A table of means and standard deviations is shown when the dependent measure is dichotomous and the independent measure is interval. Two rows, labeled Y=0 and Y=1 have sample sizes, means, and standard deviation of 9, 13.22, and 2.63 and 7, 10.42, and 2.63, respectively. In addition, the correlation coefficient of .83, regression line where y prime equals 1.47 plus -.09 times X, and a standard error equaling 0.46 is shown, describing the relationships in the example data.</description>
		<url>images/Mlt0216.gif</url>
		<width>366</width>
		<height>78</height>
		<align></align>
		<caption>Regression statistics, means, and standard deviations when the dependent variable is dichotomous.</caption>
		<alt> Regression statistics, means, and standard deviations when the dependent variable is dichotomous </alt>
	</figure>
</P>
<P>The goal of this analysis is to predict membership in one of two groups as a function of score on another variable. An interactive exercise has been provided to allow the student to explore simple linear regression when the dependent measure is dichotomous.</P>
<activity URL="exercises/mlt02b.htm">
	<description>Exploring Dichotomous Dependent Variables</description>
</activity>
		<TestItem type="MC">
			<question> The slope of the regression line when the Y variable is dichotomous and coded 0 and 1 is </question>
			<answer>the difference between the mean of Y when X=1 and the mean of Y when X=0.</answer>
			<answer>the mean of X.</answer>
			<answer>positive when the mean of X at Y=0 is larger than the mean of X when Y=1.</answer>
			<answer type="correct">negative when the mean of X at Y=0 is larger than the mean of X when Y=1.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
</section>
<section>
<P><h4>The t-test as a Special Case of Regression</h4></P>
<P>The topics of hypothesis testing in regression and dichotomous independent variables can be combined to show how the <index>t-test</index> is a special case of linear regression. The test of whether the dichotomous independent variable adds significant additional predictive power in a linear regression will yield conclusions identical to the conclusions of a convention t-test.</P>
<P>The following table presents data and answers from the homework assignment that has a dichotomous independent variable in a regression model.</P>
<P>	<figure>
		<description>A partial view of the first seven pairs of scores in an example regression assignment data when the independent measure is dichotomous and coded as 0 and 1 is shown. It contains nine columns and seven rows of data. The columns include X, Y, Y squared, Y minus Y bar, Y minus Y bar squared, Y prime, Y minus Y prime, and Y minus Y prime squared.</description>
		<url>images/Mlt0217.gif</url>
		<width>535</width>
		<height>261</height>
		<align></align>
		<caption>Example regression assignment data when the independent measure is dichotomous and coded as 0 and 1.</caption>
		<alt> Example regression assignment data when the independent measure is dichotomous and coded as 0 and 1</alt>
	</figure>
</P>
<P>The first window of the SPSS data matrix is presented below.</P>
<P>	<figure>
		<description>The first ten scores of the example data is shown in the SPSS data editor.</description>
		<url>images/Mlt0218.gif</url>
		<width>90</width>
		<height>311</height>
		<align></align>
		<caption>Example assignment data in the SPSS data editor.</caption>
		<alt> Example assignment data in the SPSS data editor </alt>
	</figure>
</P>
<P>The <index>ANOVA source table</index> for the REGRESSION procedure is presented below.</P>
<P>	<figure>
		<description> The ANOVA source table for the Regression procedure using the example data is shown. The value for the F ratio is 8.425 with an exact significance level of .009.</description>
		<url>images/Mlt0220.gif</url>
		<width>500</width>
		<height>174</height>
		<align></align>
		<caption>The ANOVA source table for the Regression procedure using the example data.</caption>
		<alt> The ANOVA source table for the Regression procedure using the example data </alt>
	</figure>
</P>
		<TestItem type="MC">
			<question>The values for the F ratios when the dependent measure is dichotomous using the Regression and Compare means procedure will be</question>
			<answer type="correct">identical.</answer>
			<answer>will be different depending upon how the dependent measure is code.</answer>
			<answer>will be interval multiples of each other.</answer>
			<answer>will be based on different numbers of degrees of freedom for the error term.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem> 
<P>A similar ANOVA source table is produced using the COMPARE MEANS procedure and requesting the optional ANOVA analysis. It is readily observed that they are identical.</P>
<P>	<figure>
		<description>> The ANOVA source table for the Compare Means procedure using the example data is shown. The value for the F ratio is 8.425 with an exact significance level of .009.</description>
		<url>images/Mlt0219.gif</url>
		<width>597</width>
		<height>147</height>
		<align></align>
		<caption> The ANOVA source table for the Compare Means procedure using the example data.</caption>
		<alt>> The ANOVA source table for the Compare Means procedure using the example data </alt>
	</figure>
</P>
<P>Comparing the computational difficulty of the regression approach with that presented in the introductory text on the chapter on ANOVA, the student might object that the regression approach is much more time consuming to compute and gives much less insight into the reasoning behind the analysis. Since these are the basic criteria for including an analysis in these statistics text, why are they included?</P>
<P>First of all, <index>computational difficulty</index> is not an issue after the homework assignments have been completed. All computation will be done with computers.</P>
		<TestItem type="MC">
			<question>The advantage of using the Regression approach compared to the traditional t-test approach is</question>
			<answer type="correct">greater flexibility in analysis.</answer>
			<answer>simplicity of computation.</answer>
			<answer>greater power.</answer>
			<answer>fewer assumptions about the data are required.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>Secondly, the regression procedure provides greatly increased flexibility in analysis when multiple independent measures are included, the regression approach to testing hypotheses about means is preferred to the more traditional analyses of ANOVA and t-tests.</P>
<P>Thirdly, the computational formulas provided in the chapter on <index>ANOVA</index> in the introductory text were simplified in order provide a cleaner view of the underlying theory. The formulas in the introductory text assumed that there were an equal number of subjects in each group, an assumption that is hardly ever true in practice. The computational formulas for the Mean Squared terms must be modified in order to weigh each group by the sample size of that group. The changes in the computational formulas are similar to the changes in computing the standard error of the difference between the means under the assumption of equal n's and unequal n's in the Nested t-test.</P>
		<TestItem type="MC">
			<question>When there are unequal numbers of scores in each groups, the Mean Square Within Groups is</question>
			<answer type="correct">a weighted mean of the within group variances.</answer>
			<answer>the mean of the within group variances.</answer>
			<answer>a weighted variance of the group means.</answer>
			<answer>the Mean Square Between Groups divided by the Mean Square Within Groups.</answer>
			<difficulty></difficulty>
			<discriminability></discriminability>
			<author>David Stockburger</author>
			<date>03/05/2001</date>
			<concept>Additional Regression Topics</concept>
		</TestItem>
<P>The Mean Square within Groups is no longer a mean of group variances, but becomes a weighted mean of group variances. If the sample sizes, means, and variances of the two groups are represented by n<SUB>1</SUB>, n<SUB>2</SUB>, X<SUB>1</SUB>, X<SUB>2</SUB>, s<SUB>1</SUB><SUP>2</SUP>, and s<SUB>2</SUB><SUP>2</SUP>, respectively, then the computational formula for the <index>Mean Square Within</index> becomes:</P>
<P>	<figure>
		<description>The computational formula for Mean Squares Within is shown when there are an unequal number of subjects in each group. The formula states that the mean square within is equal to a numerator of the number in group one minus one time the variance of group one plus the number in group two minus one times the variance of group two and a denominator of the number in group one plus the number in group two minus two.</description>
		<url>images/Mlt0225.gif</url>
		<width>281</width>
		<height>77</height>
		<align></align>
		<caption> Computational formula for Mean Squares Within.</caption>
		<alt> Computational formula for Mean Squares Within </alt>
	</figure>
<P>Substituting numbers from the example data for the terms in this equation yields:</P>
<P>	<figure>
		<description>The computational formula for Mean Squares Within is shown when there are an unequal number of subjects in each group. The formula states that the mean square within is equal to a numerator of the number in group one minus one time the variance of group one plus the number in group two minus one times the variance of group two and a denominator of the number in group one plus the number in group two minus two. Numbers replace the symbols in the second equation, with the resulting equation yielding a value of 59.51.</description>
		<url>images/Mlt0226.gif</url>
		<width>340</width>
		<height>137</height>
		<align></align>
		<caption>Computational formula for Mean Squares Within.</caption>
		<alt> Computational formula for Mean Squares Within </alt>
	</figure>
</P>
</P>
<P>In a similar manner, the equation for the <index>Mean Square Between</index> Groups must be modified in order to deal with unequal n in the two groups. The revised formula for MS<SUB>between</SUB> is</P>
<P>	<figure>
		<description>A revised formula for the mean squares between is shown for computation when the sample sizes of the two groups are no longer equal. The mean square between is equal to the sample size of group one times the squared difference between the group mean and the grand mean plus the sample size of group two times the squared difference between the group mean and the grand mean.</description>
		<url>images/Mlt0227.gif</url>
		<width>346</width>
		<height>46</height>
		<align>Computational formula for Mean Squares Between. </align>
		<caption>Computational formula for Mean Squares Between </caption>
		<alt></alt>
	</figure>
</P>
<P>Again substituting numbers for variables, the Mean Square Between Groups for the example data is:</P>
<P>	<figure>
		<description>A revised formula for the mean squares between is shown for computation when the sample sizes of the two groups are no longer equal. The mean square between is equal to the sample size of group one times the squared difference between the group mean and the grand mean plus the sample size of group two times the squared difference between the group mean and the grand mean. Numbers are substituted in the second equation with the resulting mean square between being a value of 500.68.</description>
		<url>images/Mlt0228.gif</url>
		<width>455</width>
		<height>161</height>
		<align></align>
		<caption>Computational formula for Mean Squares Between.</caption>
		<alt>Computational formula for Mean Squares Between</alt>
	</figure>
</P>
<P>These results are within <index>rounding error</index> of those generated by the statistical package.</P>
<P>Comparing these results with the computed results of the homework exercise yields the following.</P>
<P>	<figure>
		<description>The ANOVA table from the example homework assignment is shown. The values for the mean squares between is 501.11 and the mean squares with is 59.48.</description>
		<url>images/Mlt0221.gif</url>
		<width>352</width>
		<height>207</height>
		<align></align>
		<caption>ANOVA table from the example homework assignment.</caption>
		<alt> ANOVA table from the example homework assignment </alt>
	</figure>
</P>
<P>Again it is observed that the two ANOVA source tables are identical. In addition, as described in the introductory text, the <index>t-test</index> and <index>ANOVA</index> results are related by the function F=t<SUP>2</SUP>. If the reader is still not convinced of the similarity of the tests, the results of the INDEPENDENT SAMPLES t-TEST is presented below.</P>
<P>	<figure>
		<description> SPSS output for Independent Samples t-test procedure is shown for the example data. The exact significance level for this output is .009.</description>
		<url>images/Mlt0223.gif</url>
		<width>506</width>
		<height>233</height>
		<align>SPSS output for Independent Samples t-test procedure.</align>
		<caption>SPSS output for Independent Samples t-test procedure</caption>
		<alt></alt>
	</figure>
</P>
<P>Because the regression procedure provides greatly increased flexibility in analysis when multiple independent measures are included, the regression approach to testing hypotheses about means is preferred to the more traditional analyses of ANOVA and t-tests.</P>
</section>
<section>
<P><h3>Summary</h3></P>
<P>
This chapter started with a discussion and demonstration of hypothesis testing in linear regression. Hypothesis testing in linear regression answers the question whether the addition of weighted variables to a regression equation increases the predictive power of the model enough to attribute the difference to something other than chance. In this section you saw how to find the F ratio in an ANOVA table and test the appropriate hypotheses. In later chapters the same underlying procedure will be used, only change in predictive ability will be measured by R<sup>2</sup> change and only the exact significance level of the ANOVA will be reported.</P>
<P>
The second section of this chapter used the hypothesis testing procedure in linear regression discussed in the first section and applied it to dichotomous variables. This section then demonstrated that the hypothesis testing procedure in linear regression was functionally equivalent to the t-test and ANOVA procedures to test the equality of means.
</P>
</section>
</chapter>

