Introductory Statistics: Concepts, Models, and Applications
David W. Stockburger
A statistic is an algebraic expression combining scores into a single number. Statistics serve two functions: they estimate parameters in population models and they describe the data. The statistics discussed in this chapter will be used both as estimates of m and d and as measures of central tendency and variability. There are a large number of possible statistics, but some are more useful than others.
Central tendency is a typical or representative score. If the mayor is asked to provide a single value which best describes the income level of the city, he or she would answer with a measure of central tendency. The three measures of central tendency that will be discussed this semester are the mode, median, and mean.
The mode, symbolized by M_{o}, is the most frequently occurring score value. If the scores for a given sample distribution are:
32 
32 
35 
36 
37 
38 
38 
39 
39 
39 
40 
40 
42 
45 
then the mode would be 39 because a score of 39 occurs 3 times, more than any other score. The mode may be seen on a frequency distribution as the score value which corresponds to the highest point. For example, the following is a frequency polygon of the data presented above:
A distribution may have more than one mode if the two most frequently occurring scores occur the same number of times. For example, if the earlier score distribution were modified as follows:
32 
32 
32 
36 
37 
38 
38 
39 
39 
39 
40 
40 
42 
45 
then there would be two modes, 32 and 39. Such distributions are called bimodal. The frequency polygon of a bimodal distribution is presented below.
In an extreme case there may be no unique mode, as in the case of a rectangular distribution.
The mode is not sensitive to extreme scores. Suppose the original distribution was modified by changing the last number, 45, to 55 as follows:
32 
32 
35 
36 
37 
38 
38 
39 
39 
39 
40 
40 
42 
55 
The mode would still be 39.
In any case, the mode is a quick and dirty measure of central tendency. Quick, because it is easily and quickly computed. Dirty because it is not very useful; that is, it does not give much information about the distribution.
The median, symbolized by M_{d}, is the score value which cuts the distribution in half, such that half the scores fall above the median and half fall below it. Computation of the median is relatively straightforward. The first step is to rank order the scores from lowest to highest. The procedure branches at the next step: one way if there are an odd number of scores in the sample distribution, another if there are an even number of scores.
If there is an odd number of scores as in the distribution below:
32 
32 
35 
36 
36 
37 
38 

38 

39 
39 
39 
40 
40 
45 
46 
then the median is simply the middle number. In the case above the median would be the number 38, because there are 15 scores all together with 7 scores smaller and 7 larger.
If there is an even number of scores, as in the distribution below:
32 
35 
36 
36 
37 
38 

38 
39 

39 
39 
40 
40 
42 
45 
then the median is the midpoint between the two middle scores: in this case the value 38.5. It was found by adding the two middle scores together and dividing by two (38 + 39)/2 = 38.5. If the two middle scores are the same value then the median is that value.
In the above system, no account is paid to whether there is a duplication of scores around the median. In some systems a slight correction is performed to correct for grouped data, but since the correction is slight and the data is generally not grouped for computation in calculators or computers, it is not presented here.
The median, like the mode, is not effected by extreme scores, as the following distribution of scores indicates:
32 
35 
36 
36 
37 
38 

38 
39 

39 
39 
40 
40 
42 
55 
The median is still the value of 38.5. The median is not as quick and dirty as the mode, but generally it is not the preferred measure of central tendency.
The mean, symbolized by , is the sum of the scores divided by the number of scores. The following formula both defines and describes the procedure for finding the mean:
where X is the sum of the scores and N is the number of scores. Application of this formula to the following data
32 
35 
36 
37 
38 
38 
39 
39 
39 
40 
40 
42 
45 
yields the following results:
Use of means as a way of describing a set of scores is fairly common; batting average, bowling average, grade point average, and average points scored per game are all means. Note the use of the word "average" in all of the above terms. In most cases when the term "average" is used, it refers to the mean, although not necessarily. When a politician uses the term "average income", for example, he or she may be referring to the mean, median, or mode.
The mean is sensitive to extreme scores. For example, the mean of the following data is 39.0, somewhat larger than the preceding example.
32 
35 
36 
37 
38 
38 
39 
39 
39 
40 
40 
42 
55 
In most cases the mean is the preferred measure of central tendency, both as a description of the data and as an estimate of the parameter. In order for the mean to be meaningful, however, the acceptance of the interval property of measurement is necessary. When this property is obviously violated, it is inappropriate and misleading to compute a mean. Such is the case, for example, when the data are clearly nominal categorical. An example would be political party preference where 1 = Republican, 2 = Democrat, and 3 = Independent. The special case of dichotomous nominal categorical variables allows meaningful interpretation of means. For example, if only two levels of political party preference was allowed, 1 = Republican and 2 = Democrat, then the mean of this variable could be interpreted. In such cases it is preferred to code one level of the variable with a 0 and the other level with a 1 such that the mean is the proportion of the second level in the sample. For example, if gender was coded with 0 = Males and 1 = Females, then the mean of this variable would be the proportion of females in the sample.
As is commonly known, KIWIbirds are native to New Zealand. They are born exactly one foot tall and grow in one foot intervals. That is, one moment they are one foot tall and the next they are two feet tall. They are also very rare. An investigator goes to New Zealand and finds four birds. The mean of the four birds is 4, the median is 3, and the mode is 2. What are the heights of the four birds?
Hint  examine the constraints of the mode first, the median second, and the mean last.
Skewness refers to the asymmetry of the distribution, such that a symmetrical distribution exhibits no skewness. In a symmetrical distribution the mean, median, and mode all fall at the same point, as in the following distribution.
An exception to this is the case of a bimodal symmetrical distribution. In this case the mean and the median fall at the same point, while the two modes correspond to the two highest points of the distribution. An example follows:
A positively skewed distribution is asymmetrical and points in the positive direction. If a test was very difficult and almost everyone in the class did very poorly on it, the resulting distribution would most likely be positively skewed.
In the case of a positively skewed distribution, the mode is smaller than the median, which is smaller than the mean. This relationship exists because the mode is the point on the xaxis corresponding to the highest point, that is the score with greatest value, or frequency. The median is the point on the xaxis that cuts the distribution in half, such that 50% of the area falls on each side.
The mean is the balance point of the distribution. Because points further away from the balance point change the center of balance, the mean is pulled in the direction the distribution is skewed. For example, if the distribution is positively skewed, the mean would be pulled in the direction of the skewness, or be pulled toward larger numbers.
One way to remember the order of the mean, median, and mode in a skewed distribution is to remember that the mean is pulled in the direction of the extreme scores. In a positively skewed distribution, the extreme scores are larger, thus the mean is larger than the median.
A negatively skewed distribution is asymmetrical and points in the negative direction, such as would result with a very easy test. On an easy test, almost all students would perform well and only a few would do poorly.
The order of the measures of central tendency would be the opposite of the positively skewed distribution, with the mean being smaller than the median, which is smaller than the mode.
Variability refers to the spread or dispersion of scores. A distribution of scores is said to be highly variable if the scores differ widely from one another. Three statistics will be discussed which measure variability: the range, the variance, and the standard deviation. The latter two are very closely related and will be discussed in the same section.
The range is the largest score minus the smallest score. It is a quick and dirty measure of variability, although when a test is given back to students they very often wish to know the range of scores. Because the range is greatly affected by extreme scores, it may give a distorted picture of the scores. The following two distributions have the same range, 13, yet appear to differ greatly in the amount of variability.
Distribution 1 
32 
35 
36 
36 
37 
38 
40 
42 
42 
43 
43 
45 
Distribution 2 
32 
32 
33 
33 
33 
34 
34 
34 
34 
34 
35 
45 
For this reason, among others, the range is not the most important measure of variability.
The variance, symbolized by "s^{2}", is a measure of variability. The standard deviation, symbolized by "s", is the positive square root of the variance. It is easier to define the variance with an algebraic expression than words, thus the following formula:
Note that the variance could almost be the average squared deviation around the mean if the expression were divided by N rather than N1. It is divided by N1, called the degrees of freedom (df), for theoretical reasons. If the mean is known, as it must be to compute the numerator of the expression, then only N1 scores that are free to vary. That is if the mean and N1 scores are known, then it is possible to figure out the Nth score. One needs only recall the KIWIbird problem to convince oneself that this is in fact true.
The formula for the variance presented above is a definitional formula, it defines what the variance means. The variance may be computed from this formula, but in practice this is rarely done. It is done here to better describe what the formula means. The computation is performed in a number of steps, which are presented below:
Step One  
Find the mean of the scores. 
Step Two  
Subtract the mean from every score. 
Step three  
Square the results of step two. 
Step Four  
Sum the results of step three. 
Step Five  
Divide the results of step four by N1. 
Step Six  
Take the square root of step five. 
The result at step five is the sample variance, at step six, the sample standard deviation.
X 
X  
(X  )^{2} 
8 
2 
4 
8 
2 
4 
9 
1 
1 
12 
2 
4 
13 
3 
9 
50 
0 
22 
Step One  
Find the mean of the scores. 
= 50 / 5 = 10< /P > 
Step Two  
Subtract the mean from every score. 
The second column above 
Step three 
Square the results of step two. 
The third column above 
Step Four 
Sum the results of step three. 
22 
Step Five 
Divide the results of step four by N1. 
s^{2 }= 22 / 4 = 5.5< /P > 
Step Six 
Take the square root of step five. 
s = 2.345 
Note that the sum of column *2* is zero. This must be the case if the calculations are performed correctly up to that point.
The standard deviation measures variability in units of measurement, while the variance does so in units of measurement squared. For example, if one measured height in inches, then the standard deviation would be in inches, while the variance would be in inches squared. For this reason, the standard deviation is usually the preferred measure when describing the variability of distributions. The variance, however, has some unique properties which makes it very useful later on in the course.
Calculations may be checked by using the statistical functions of a statistical calculator. This is the way the variance and standard deviation are usually computed in practice. The calculator has the definitional formula for the variance automatically programmed internally, and all that is necessary for its use are the following steps:
Step One  
Select the statistics mode. 
Step Two  
Clear the statistical registers. 
Step Three  
Enter the data. 
Step Four  
Make sure the correct number of scores have been entered. 
Step Five  
Hit the key that displays the mean. 
Step Six  
Hit the key that displays the standard deviation. 
Note that when using the calculator the standard deviation is found before the variance, while the opposite is the case using the definitional formula. The results using the calculator and the definitional formula should agree, within rounding error.
More often than not, statistics are computed using a computer package such as SPSS. It may seem initially like a lot more time and trouble to use the computer to do such simple calculations, but the student will most likely appreciate the savings in time and effort at a later time..
The first step is to enter the data into a form the computer can recognize. A data file with the example numbers is illustrated below:
Any number of statistical commands in SPSS would result in the computation of simple measures of central tendency and variability. The example below illustrates the use of the FREQUENCIES command.
The results of the above procedure are presented below:
An analysis, called a breakdown, gives the means and standard deviations of a variable for each level of another variable. The means and standard deviations may then be compared to see if they are different. Going back to the example of shoe sizes, the raw data appeared as follows:
Shoe Size 
Shoe Width 
Sex 
10.5 
B 
M 
6.0 
B 
F 
9.5 
D 
M 
8.8 
A 
F 
7.0 
B 
F 
10.5 
C 
M 
7.0 
C 
F 
8.5 
D 
M 
6.5 
B 
F 
9.5 
C 
M 
7.0 
B 
F 
7.5 
B 
F 
9.0 
D 
M 
6.5 
A 
F 
7.5 
B 
F 
The corresponding data file in SPSS would appear as follows:
It is possible to compare the shoe sizes of males and females by first finding the mean and standard deviation of males only and then for females only. In order to do this the original shoe sizes would be partitioned into two sets, one of males and one of females, as follows:
Males 
Females 
10.5 
6.0 
9.5 
8.5 
10.5 
7.0 
8.5 
7.0 
9.5 
6.5 
9.0 
7.0 
7.5 

6.5 

7.5 
The means and standard deviations of the males and females are organized into a table as follows:
Sex 
N 
Mean 
Standard Deviation 
Males 
6 
9.58 
0.80 
Females 
9 
7.06 
0.73 
Total 
15 
8.06 
1.47 
It can be seen that the males had larger shoe sizes as evidenced by the larger mean. It can also be seen that males also had somewhat greater variability as evidenced by their larger standard deviation. In addition, the variability WITHIN GROUPS (males and females separately) was considerably less than the TOTAL variability (both sexes combined).
The analysis described above may be done using SPSS using the MEANS command. The following illustrates the selection and output of the MEANS command in SPSS.
A similar kind of breakdown could be performed for shoe size broken down by shoe width, which would produce the following table:
Shoe Width 
N 
Mean 
Standard Deviation 
A 
2 
7.5 
1.41 
B 
7 
7.43 
1.46 
C 
3 
9.0 
1.80 
D 
3 
9.0 
0.50 
Total 
15 
8.06 
1.47 
A breakdown is a very powerful tool in examining the relationships between variables. It can express a great deal of information in a relatively small space. Tables of means such as the one presented above are central to understanding Analysis of Variance (ANOVA).
Statistics serve to estimate model parameters and describe the data. Two categories of statistics were described in this chapter: measures of central tendency and measures of variability. In the former category were the mean, median, and mode. In the latter were the range, standard deviation, and variance. Measures of central tendency describe a typical or representative score, while measures of variability describe the spread or dispersion of scores. Both definitional examples of computational procedures and procedures for obtaining the statistics from a calculator were presented.