A statistic is an algebraic expression combining scores into a single number. Statistics serve two functions: they estimate parameters in population models and they describe the data. The statistics discussed in this chapter will be used both as estimates of m and d and as measures of central tendency and variability. There are a large number of possible statistics, but some are more useful than others.
Central tendency is a typical or representative score. If the mayor is asked to provide a single value which best describes the income level of the city, he or she would answer with a measure of central tendency. There are many possible measures of central tendency; the three major ones-the mode, median, and mean-will be discussed in this chapter.
The mode, symbolized by Mo, is the most frequently occurring score value. If the scores for a given sample distribution are:
then the mode would be 39 because a score of 39 occurs 3 times, more than any other score. The mode may be seen on a frequency distribution as the score value which corresponds to the highest point. For example, the following is a frequency polygon of the data presented above:
A distribution may have more than one mode if the two most frequently occurring scores occur the same number of times. For example, if the earlier score distribution were modified as follows:
then there would be two modes, 32 and 39. Such distributions are called bimodal. The frequency polygon of a bimodal distribution is presented below.
In an extreme case there may be no unique mode, as in the case of a rectangular distribution.
The mode is not sensitive to extreme scores. Suppose the original distribution was modified by changing the last number, 45, to 55 as follows:
The mode would still be 39.
In any case, the mode is a quick and dirty measure of central tendency. Quick, because it is easily and quickly computed. Dirty because it is not very useful; that is, it does not give much information about the distribution.
The median, symbolized by Md, is the score value which cuts the distribution in half, such that half the scores fall above the median and half fall below it. Computation of the median is relatively straightforward. The first step is to rank order the scores from lowest to highest. The procedure branches at the next step: one-way if there are an odd number of scores in the sample distribution, another if there are an even number of scores.
If there is an odd number of scores as in the distribution below:
then the median is simply the middle number. In the case above the median would be the number 38, because there are 15 scores all together with 7 scores smaller and 7 larger.
If there is an even number of scores, as in the distribution below:
then the median is the midpoint between the two middle scores: in this case the value 38.5. It was found by adding the two middle scores together and dividing by two (38 + 39)/2 = 38.5. If the two middle scores are the same value then the median is that value.
In the above system, no account is paid to whether there is a duplication of scores around the median. In some systems a slight correction is performed to correct for grouped data, but since the correction is slight and the data is generally not grouped for computation in calculators or computers, it is not presented here.
Extreme scores do not affect the median, like the mode, as the following distribution of scores indicates:
The median is still the value of 38.5. The median is not as quick and dirty as the mode, but generally it is not the preferred measure of central tendency.
The mean, symbolized by , is the sum of the scores divided by the number of scores. The following formula both defines and describes the procedure for finding the mean:
where SX is the sum of the scores and N is the number of scores. Application of this formula to the following data
yields the following results:
Use of means as a way of describing a set of scores is fairly common; batting average, bowling average, grade point average, and average points scored per game are all means. Note the use of the word "average" in all of the above terms. In most cases when the term "average" is used, it refers to the mean, although not necessarily. When a politician uses the term "average income", for example, he or she may be referring to the mean, median, or mode.
The mean is sensitive to extreme scores. For example, the mean of the following data is 39.0, somewhat larger than the preceding example.
In most cases the mean is the preferred measure of central tendency, both as a description of the data and as an estimate of the parameter. In order for the mean to be meaningful, however, the acceptance of the interval property of measurement is necessary. When this property is obviously violated, it is inappropriate and misleading to compute a mean. Such is the case, for example, when the data are clearly nominal categorical. An example would be political party preference where 1 = Republican, 2 = Democrat, and 3 = Independent. The special case of dichotomous nominal categorical variables allows meaningful interpretation of means. For example, if only two levels of political party preference was allowed, 1 = Republican and 2 = Democrat, then the mean of this variable could be interpreted. In such cases it is preferred to code one level of the variable with a 0 and the other level with a 1 such that the mean is the proportion of the second level in the sample. For example, if gender was coded with 0 = Males and 1 = Females, then the mean of this variable would be the proportion of females in the sample.
As is commonly known, Kiwi-birds are native to New Zealand. They are born exactly one foot tall and grow in one foot intervals. That is, one moment they are one foot tall and the next they are two feet tall. They are also very rare. An investigator goes to New Zealand and finds four birds. The mean of the four birds is 4, the median is 3, and the mode is 2. What are the heights of the four birds?
Hint - examine the constraints of the mode first, the median second, and the mean last.
Skewness refers to the asymmetry of the distribution, such that a symmetrical distribution exhibits no skewness. In a symmetrical distribution the mean, median, and mode all fall at the same point, as in the following distribution.
An exception to this is the case of a bimodal symmetrical distribution. In this case the mean and the median fall at the same point, while the two modes correspond to the two highest points of the distribution. An example follows:
A positively skewed distribution is asymmetrical and points in the positive direction. If a test was very difficult and almost everyone in the class did very poorly on it, the resulting distribution would most likely be positively skewed.
In the case of a positively skewed distribution, the mode is smaller than the median, which is smaller than the mean. This relationship exists because the mode is the point on the x-axis corresponding to the highest point, that is the score with greatest value, or frequency. The median is the point on the x-axis that cuts the distribution in half, such that 50% of the area falls on each side.
The mean is the balance point of the distribution. Because points further away from the balance point change the center of balance, the mean is pulled in the direction the distribution is skewed. For example, if the distribution is positively skewed, the mean would be pulled in the direction of the skewness, or be pulled toward larger numbers.
One way to remember the order of the mean, median, and mode in a skewed distribution is to remember that the mean is pulled in the direction of the extreme scores. In a positively skewed distribution, the extreme scores are larger, thus the mean is larger than the median.
A negatively skewed distribution is asymmetrical and points in the negative direction, such as would result with a very easy test. On an easy test, almost all students would perform well and only a few would do poorly.
The order of the measures of central tendency would be the opposite of the positively skewed distribution, with the mean being smaller than the median, which is smaller than the mode.
Variability refers to the spread or dispersion of scores. A distribution of scores is said to be highly variable if the scores differ widely from one another. Three statistics will be discussed which measure variability: the range, the variance, and the standard deviation. The latter two are very closely related and will be discussed in the same section.
The range is the largest score minus the smallest score. It is a quick and dirty measure of variability, although when a test is given back to students they very often wish to know the range of scores. Because the range is greatly affected by extreme scores, it may give a distorted picture of the scores. The following two distributions have the same range, 13, yet appear to differ greatly in the amount of variability.
For this reason, among others, the range is not the most important measure of variability.
The variance, symbolized by "s2", is a measure of variability. The standard deviation, symbolized by "s", is the positive square root of the variance. It is easier to define the variance with an algebraic expression than words, thus the following formula:
Note that the variance could almost be the average squared deviation around the mean if the expression were divided by N rather than N-1. It is divided by N-1, called the degrees of freedom (df), for theoretical reasons. If the mean is known, as it must be to compute the numerator of the expression, then only N-1 scores are free to vary. That is if the mean and N-1 scores are known, then it is possible to figure out the Nth score. One needs only recall the Kiwi-bird problem to convince oneself that this is in fact true. When the heights of three of the four birds were known, it was possible to find the height of the fourth bird if the mean was also known. Similar logic works no matter what the size of N.
The formula for the variance presented above is a definitional formula, it defines what the variance means. The variance may be computed from this formula, but in practice this is rarely done. It is done here to better describe what the formula means. The computation is performed in a number of steps, which are presented below:
|Step One -||Find the mean of the scores.|
|Step Two -||Subtract the mean from every score.|
|Step three -||Square the results of step two.|
|Step Four -||Sum the results of step three.|
|Step Five -||Divide the results of step four by N-1.|
|Step Six -||Take the square root of step five.|
The result at step five is the sample variance, at step six, the sample standard deviation. In the following example there are five scores: 8, 8, 9, 12, and 13.
|X||X -||(X - )2|
The first step is to compute the mean by summing the first column (50) and dividing by the number of scores (5) and finding a mean of 10. The second step is to subtract the mean from each score and is show in the second column above. For the first entry will be -2 or the score, 8, minus the mean, 10. The third column contains the values of the second column squared. For example, the first entry, -2, squared is 4. Summing the third column finds the sum of squared deviations from the mean, sometimes referred to as the sum of squares. This value is the numerator for the formula for the sample variance and is divided by the number of scores minus 1 to find the sample variance. In this case the sum of squares (22) is divided by 4 to give a result of 5.5. The sample standard deviation is the square root of the sample variance, or in this case, 2.345.
|Step One -||Find the mean of the scores.||= 50 / 5 = 10|
|Step Two -||Subtract the mean from every score.||The second column above|
|Step three||Square the results of step two.||The third column above|
|Step Four||Sum the results of step three.||22|
|Step Five||Divide the results of step four by N-1.||s2 = 22 / 4 = 5.5|
|Step Six||Take the square root of step five.||s = 2.345|
Note that the sum of column *2* is zero. This must be the case if the calculations are performed correctly up to that point.
The standard deviation measures variability in units of measurement, while the variance does so in units of measurement squared. For example, if one measured height in inches, then the standard deviation would be in inches, while the variance would be in inches squared. For this reason, the standard deviation is usually the preferred measure when describing the variability of distributions. The variance, however, has some unique properties that make it very useful in theoretical constructs. These properties will be discussed later on in this text.
Calculations may be checked by using the statistical functions of a statistical calculator. This is the way the variance and standard deviation are usually computed in practice. The calculator has the definitional formula for the variance automatically programmed internally, and all that is necessary for its use are the following steps:
|Step One -||Select the statistics mode.|
|Step Two -||Clear the statistical registers.|
|Step Three -||Enter the data.|
|Step Four -||Make sure the correct number of scores have been entered.|
|Step Five -||Hit the key that displays the mean.|
|Step Six -||Hit the key that displays the standard deviation.|
Note that when using the calculator the standard deviation is found before the variance, while the opposite is the case using the definitional formula. The results using the calculator and the definitional formula should agree, within rounding error. In this case rounding error, or imprecision that results from not using the entire number, i.e. 3.33 rather than 3.333333..., should be less than one or two hundredths in value.
More often than not, statistics are computed using a computer package such as SPSS. It may seem initially like a lot more time and trouble to use the computer to do such simple calculations, but the student will most likely appreciate the savings in time and effort at a later time.
The first step is to enter the data into a form the computer can recognize. A data file with the example numbers is illustrated below:
Any number of statistical commands in SPSS would result in the computation of simple measures of central tendency and variability. The example below illustrates the use of the FREQUENCIES command.
Clicking on Analyze/Descriptive Statistics/Frequencies on the Data Screen in SPSS results in the Frequencies command form illustrated below.
The variable "X" is clicked from the left hand box to the right hand box and the "Statistics" button on the form has been clicked. On the optional screen that appears when the "Statistics" button was clicked, the form illustrates the selection of four statistics under "Central Tendency" and three under "Dispersion". The results of the above procedure appear when the "OK" button is clicked and are presented below:
An analysis, called a breakdown, gives the means and standard deviations of a variable for each level of another variable. The means and standard deviations may then be compared to see if they are different. If the means are different, then the groups as a whole differ from each other. If the standard deviations differ, it means that the scores in the group with the smaller standard deviation are more similar to each other than the scores in the groups with the larger standard deviation. Going back to the example of shoe sizes, the raw data appeared as follows:
|Shoe Size||Shoe Width||Sex|
The corresponding data file in SPSS would appear as follows:
It is possible to compare the shoe sizes of males and females by first finding the mean and standard deviation of males only and then for females only. In order to do this the original shoe sizes would be partitioned into two sets, one of males and one of females, as follows:
The means and standard deviations of the males and females are organized into a table as follows:
It can be seen that the males had larger shoe sizes as evidenced by the larger mean. It can also be seen that males also had somewhat greater variability as evidenced by their larger standard deviation. In addition, the variability WITHIN GROUPS (males and females separately) was considerably less than the TOTAL variability (both sexes combined).
The analysis described above may be done using SPSS using the MEANS command. The following figures illustrate the selection and output of the MEANS command in SPSS. The means command is accessed in SPSS by Analyze/Compare Means/Means as follows:
The form for a means analysis should appear as follows:
In this case "shoesize" is selected as the dependent variable and "sex" as the independent variable. Means and standard deviations will be found for each variable in the dependent list for each level of each variable in the independent list. The grouping variables will always appear in the independent list and the measured variables in the dependent list. An easy way to remember is that the dependent variables "depend upon" the independent variables. In this case shoe size depends upon sex.
The results of this analysis appear when the "OK" button is clicked and are presented in the following figure.
A similar kind of breakdown could be performed for shoe size broken down by shoe width, which would produce the following table:
|Shoe Width||N||Mean||Standard Deviation|
A breakdown is a very powerful tool in examining the relationships between variables. It can express a great deal of information in a relatively small space.
Statistics serve to estimate model parameters and describe the data. Two categories of statistics were described in this chapter: measures of central tendency and measures of variability. In the former category were the mean, median, and mode. In the latter were the range, standard deviation, and variance. Measures of central tendency describe a typical or representative score, while measures of variability describe the spread or dispersion of scores. Both definitional examples of computational procedures and procedures for obtaining the statistics from a calculator were presented.