
![]() |
|||
![]() |
|||
| Table of Contents | Objectives |
|---|---|
| Range | A poor measure of scatter. |
| A Solution Beginning | An attempt to better measure scatter. |
| The Solution -- Variance | Variance solves a mathematical problem. |
| Standard Deviation | The square root of variance. |
| Calculating Variance and Standard Deviations | Learn to calculate these two statistics. |
| Coefficient of Variation | Fixing the Standard Deviation. |
| Normal Curves | Learn about the standard deviation's meaning in normal distributions. |
| Skew Revisited | Formulas for Skew. |
| Additional Information | Discover interesting Web Links |
| Computer Project 10 | Using Statlets to calculate measures of variation. |
| Computer Project 11 | Another measure of variation calculation. |
| Percentile Ranks | Position measures that lead to the Interquartile Range. |
| Computer Project 12 | Interpreting Q-plots and P-plots. |
| Questions/Test | Take the End of Chapter Test |
| Report | Send a Chapter Report to your Instructor |
We can now calculate ranges as our measures of variability. However, as noted above, ranges are not good measures of scatter because their values only depend on two scores from a distribution. A better measure of variability would take into account all of the individual scores in a distribution. The figure on the left dramatically illustrates this problem. Which distribution has the most scatter? Do the ranges for these two distributions differ?
The first attempt to get around this zero sum property was to use absolute values. However, as figure on the right illustrates, this doesn't work. Sums of absolute difference values for distributions that have the same scatter are different if there are more values in one of the distributions.
Simply summing the absolute deviations is not enough. The scatter in the two distributions shown in the figure above on the right are the same, Set 2 just has more elements in the distribution. Mathematicians decided to take the average of these deviations. This descriptive statistic (the formula is shown on the left) is called the mean deviation (MD). If we do the calculation for MD we properly get the same value for both distributions.

There are two different formulas for calculating variance. One of the formulas is used for populations and the other is used to calculate a sample statistic. Remember that sample statistics are used to estimate unknown population parameters. You will calculate sample variance in order to estimate an unknown population variance. Plugging the identical values into these formulas will give different answers because the formulas are slightly different. The formula for population variance, shown on the left, is straight forward. This formula is used for calculating variance if the distribution contains the entire population of interest.

The same thing would happen to sample variance estimates if we did not correct them. On average, a sample will have a bit less scatter than the population. Statisticians found that if they subtracted one from the denominator of the variance equation that this slightly increased the result and the average of all the sample variances taken from a population accurately estimated the population variance. By subtracting one case from the denominator, the statistic becomes an unbiased estimator of the population parameter. To make sure we can tell the differences between the variance formula for populations and samples, we make sure that in the sample formula (shown on the left) that statistics are substituted for parameters and the letter "n" which indicates the size of the sample is lower case.
The figure on the left shows two populations which have different variability. The first set has small scatter, while the second set has a much larger scatter. Note that both means are equal to 9. When you calculate the variance for the first distribution you get a relatively small number (.8571) while the variance in the second population is relatively large (53.1429).

The equations for variance shown above, are what statisticians refer to as "Think about it" formulas. They make sense when you simply look at them. People understand what the equations are doing. They also make sense in that they are measuring scatter. However, they are horrid for use in hand calculators. (As an aside, they are rarely used in computer programs either.) The figure on the right presents equations that are quite useful in hand calculators,
but are quite poor when used in computer programs. I don't think that on the surface they make much sense. However, using your calculator, you can arrive at an answer without ever stopping to write down an answer for any step. These formulas are known as the calculator formulas in many textbooks.

There is one more variance formula, shown on the left, which can be used if the data constitutes a population and all the values are dichotomous. Dichotomous data consists of only two values (0, 1). These type of data are often found in the social sciences. Zero might indicate failure on a examination item, while one indicates passing, or 0 might be male while 1 is female. If you have population data that is dichotomous, this equation can be used to calculate variance. Where p = the proportion of passes or correct responses or the proportion of 1s; q = the proportion of incorrect responses, or failures or 0s. Remember that q must equal 1-p.
The figure on the left demonstrates the calculation of variance using dichotomous data. Dichotomous data is also called binary data.
The standard deviation is simply the square root of variance. Standard deviations return the variability measure back to the original score units instead of squared score units. The equations on the left provide you with the "Think about it" and calculator equations for standard deviations.

The standard deviation has one small problem if it is used to compare the scatter of one distribution to another. It is quite common for the size of the standard deviation to be proportional to the size of the mean. That is, given the same amount of scatter, we would expect standard deviations to be larger if the mean of the data was 20,000 instead of 20. Although this is not always true, it is frequently true. To remove the effect of the overall size of the variable values, the coefficient of variation is calculated. The coefficient of variation is simply found by taking the standard deviation and dividing by the mean. The advantage of using the coefficient of variation to express scatter is that coefficients of variation are comparable across data sets with dramatically different means.
If you have a variable that is normally distributed (many many variables are), then standard deviations are important because they allow the calculation of confidence intervals into which certain known percentages of scores reside. Approximately 68% of the scores in a normal distribution are between the mean and ± 1 standard deviation. Approximately 95% of the scores in a normal distribution are between the mean and ± 2 standard deviations. The figure shown on the left illustrates this property of normal distributions.
In
Chapter 4 we discussed when one would choose to report a mean, median, or mode
as the measure of central tendency. We stated that, if the distribution was unimodal
and seriously skewed that the median should be reported. At that point, we said
that if the value for standardized skewness was outside ±2 that the distribution
could be considered seriously skewed. You have now come to a point where at least
three formulas for measuring skew can be given. The first formula (not presented
because of its simplicity) is taken from Richard P Renyon and Audrey Haber's text
Fundamentals of Behavioral Statistics (7th Ed.), published in 1991 by McGraw Hill.
The authors state that when the mean is higher than the median, the distribution
of scores is positively skewed. Conversely, when the sample mean - median is a
negative value, the scores are negatively skewed. However, these indices of direction
of skew tell us little about the amount of skew. E. S. Pearson, whom many consider
the founder of modern statistics, proposed the coefficient of skew (sk), shown
in the figure on the left, where SK is coefficient of skew, Mdn is the sample
median, the sample mean is indicated by x-bar, and s is the standard deviation
of the distribution.
The third formula, shown on the right, is frequently reported in statistics texts. The deviation of each value from the mean is taken to the third power. The sum of these deviations are then divided by the variable's standard deviation.


![]() |
|||
![]() |
|||