Introductory Statistics: Concepts, Models, and Applications
David W. Stockburger
A model of a frequency distribution is an algebraic expression describing the relative frequency (height of the curve) for every possible score. The questions that sometimes come to the mind of the student is "What is the advantage of this level of abstraction? Why is all this necessary?" The answers to these questions may be found in the following.
The belief that a few algebraic expressions can adequately model a large number of very different real world phenomena underlies much statistical thought. The fact that these models often work justifies this philosophical approach.
In almost all cases in the social sciences, it is not feasible to collect data on the entire population in which the researcher is interested. For instance, the individual opening a shoe store might want to know the shoe sizes of all persons living within a certain distance from the store, but not be able to collect all the data. If data from a subset or sample of the population of interest is used rather than the entire population, then repeating the data collection procedure would most likely result in a different set of numbers. A model of the distribution is used to give some consistency to the results.
For example, suppose that the distribution of shoe sizes collected from a sample of fifteen individuals resulted in the following relative frequency polygon.
Because there are no individuals in the sample who wear size eight shoes, does that mean that the store owner should not stock the shelves with any size eight shoes? If a different sample was taken, would an individual who wore a size eight likely be included? Because the answer to both of the above questions is yes, some method of ordering shoes other than directly from the sample distribution must be used.
The alternative is a model of the frequency distribution, sometimes called a probability model or probability distribution. For example, suppose that the frequency polygon of shoe size for women actually looked like the following:
If this were the case the proportion (.12) or percentage (12%) of size eight shoes could be computed by finding the relative area between the real limits for a size eight shoe (7.75 to 8.25). The relative area is called probability.
The probability model attempts to capture the essential structure of the real world by asking what the world might look like if an infinite number of scores were obtained and each score was measured infinitely precisely. Nothing in the real world is exactly distributed as a probability model. However, a probability model often describes the world well enough to be useful in making decisions.
The statistician has at his or her disposal a number of probability models to describe the world. Different models are selected for practical or theoretical reasons. Some example of probability models follow.
The uniform distribution is shaped like a rectangle, where each score is equally likely. An example is presented below.
If the uniform distribution was used to model shoe size, it would mean that between the two extremes, each shoe size would be equally likely. If the store owner was ordering shoes, it would mean that an equal number of each shoe size would be ordered. In most cases this would be a very poor model of the real world, because at the end of the year a large number of large or small shoe sizes would remain on the shelves and the middle sizes would be sold out.
The negative exponential distribution is often used to model real world events which are relatively rare, such as the occurrence of earthquakes. The negative exponential distribution is presented below:
Not really a standard distribution, a triangular distribution could be created as follows:
It may be useful for describing some real world phenomena, but exactly what that would be is not known for sure.
The normal curve is one of a large number of possible distributions. It is very important in the social sciences and will be described in detail in the next chapter. An example of a normal curve was presented earlier as a model of shoe size.
Almost all of the useful models contain parameters. Recalling from the chapter on models, parameters are variables within the model that must be set before the model is completely specified. Changing the parameters of a probability model changes the shape of the curve. The use of parameters allows a single general purpose model to describe a wide variety of real world phenomena.
In order for an algebraic expression to qualify as a legitimate model of a distribution, the total area under the curve must be equal to one. This property is necessary for the same reason that the sum of the relative frequencies of a sample frequency table was equal to one. This property must hold true no matter what values are selected for the parameters of the model.
Although probability is a common term in the natural language, meaning likelihood or chance of occurrence, statisticians define it much more precisely. The probability of an event is the theoretical relative frequency of the event in a model of the population.
The models that have been discussed up to this point assume continuous measurement. That is, every score on the continuum of scores is possible, or there are an infinite number of scores. In this case, no single score can have a relative frequency because if it did, the total area would necessarily be greater than one. For that reason probability is defined over a range of scores rather than a single score. Thus a shoe size of 8.00 would not have a specific probability associated with it, although the interval of shoe sizes between 7.75 and 8.25 would.