production date 2/18/00

Graphing Data

Table of Contents Objectives
Frequency Distributions Review Frequency Distributions.
Frequency Polygon Learn to draw one.
Grouped Frequency Polygon Use when the number of values are many.
Jittered Plots Learn how data is displayed using a Jittered Plot.
Time Sequence Plot Learn to interpret this plot.
Histograms Histograms can lie about shape.
Stem-and-Leaf Learn to produce these plots.
Box Plots Learn about displaying outliers.
Statlets Examples Use Statlets to draw all these graphs.
Distribution Shapes Learn about different shapes in distributions.
Proper Graph Design Keep it simple.
Computer Project 5 Using Statlets to Graph a Variable.
Business Graphics Barcharts and Piecharts.
Project 6 Creating Barcharts and Piecharts.
Graphing Probability Distributions Graph 24 different distributions.
Additional Information Discover interesting Web Links
Questions/Test Take the End of Chapter Test
Report Send a Chapter Report to your Instructor


This chapter is concerned with techniques for graphing a single variable. Graphing more than one variable is discussed in chapters where multiple variables are simultaneously analyzed.

After data are prepared for a computer analysis, one of the first things researchers do is summarize it. If a picture is worth a thousand words, a graph is worth a thousand calculations. Often the human eye can detect subtleties within a graph that no amount of statistical calculation can disclose. If constructed correctly, graphs summarize and greatly speed our understanding of data. If constructed incorrectly, graphs can be made to tell quite convincing lies.

Frequency Distributions   

xf
101
92
84
78
64
52
41
In the next section we learn that a quick method of hand sketching a data set is to draw a frequency polygon. A frequency polygon is a closed line graph of a frequency distribution. First, we review frequency distributions where the column headed with the letter x (or the name of the variable) represents the value of the variable of interest. The column headed by the letter f (or number), represents the number of times that particular variable value occurs in the data. This is the number of frequencies for each individual score value.

If we have the following data set {10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 5, 4} and we turn it into a frequency distribution, it would look like the figure on the left.

A complete discussion of frequency distributions is found in Chapter 2.


Frequency Polygon   

A frequency polygon is a closed line graph of a frequency distribution. To construct the frequency polygon first mark the x-axis using the score values (found conveniently under the x column in the frequency distribution), and the y-axis using the frequencies found under the f column. The completed frequency polygon for the above frequency distribution should look like the figure on the left.

Sample Data Polygons

Notice that the data, in this case, form a symmetrical graph with the score of 7 the most frequent. Also note that the graph is closed (it touches the x-axis). An open or closed graph coding scheme is used by many statisticians to indicate whether the data are from a sample or a population. Sample data are real in the sense that all their values are known. Line graphs formed from sample data are closed.

x y
10 6
9 4
8 2
7 2
6 6
Using the drawing grid directly below, and referring to the frequency distribution on the left, roughly sketch what you think the sample frequency polygon should look like. Frequency values are shown on the left, while score values are displayed on the bottom. This applet only allows you to draw straight lines by pressing the mouse button, and dragging from one point to another. You will need to draw your polygon using 10 or fewer lines (10 lines are the maximum number allowed in the applet). Think before you begin. You can not erase lines once you have started drawing. When you have finished your sketch, you may view a correct drawing for comparison.



Population Data Polygons

Population data are usually not known. When we sample large populations, we don't know all the population values, just the values contained in our sample. Graphs of populations and theoretical distributions like the normal distribution are therefore open - they do not touch the x-axis. The figure on the left illustrates the normal distribution. Notice that the ends of the graph do not touch the x-axis. The equation for the normal distribution is overlaid on the graph.

Frequency polygons are not typically drawn by statistical programs, therefore they are not seen as often today as they were in the past. Another difficulty with frequency polygons is that if there are many different values (these are graphed on the x-axis), the figure becomes much wider than practical. Statisticians developed grouped frequency polygons to deal with this situation.

Grouped Frequency Polygons   

In a grouped frequency polygon, several different variable values are grouped into one dot which is placed at their midpoint. Since grouped frequency polygons are rarely seen today, will not discuss them further.

Jittered Plots   



Another popular graph type for a single variable is called a jittered plot. Jittered plots graph each variable value along the x-axis and place a symbol above the variable value each time the variable value occurs. The symbols are randomly jittered (moved) on the y-axis so that symbols do not overlap. Above is a jittered plot produced by Statlets for the following data: {10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 5, 4}.

Notice that above the value of 4 on the x axis, you find a single square representing the single value of 4 found in the data. Above the value of 5 are two squares representing the two fives in the data. Do you see the eight squares above the value of 7? The entire data set is represented in this plot.

You can reproduce this jittered plot using Statlets by either clicking on the Statlets link in the navigation bar, or using this single applet that will produce an additional window in your browser. Remember, to return to this page, simply close the newly created page.


Time Sequence Plots   

Yearf
196542.3
197437.2
197933.5
198332.2
198530.0
198728.7
198827.9
199025.4
199125.4
199226.4
199325.0
Sometimes the frequency with which variable value occurs is very important, and it is also associated with a second variable that is sequential. Sequential variables are often a measure of time. Investigating how the frequency of a variable is associated with changes in the sequential variable is where time series plots (a.k.a. time sequence plots) shine. The percentage of people aged 18 and older who smoke tobacco is shown in the table on the left. The time-sequence plot produced using this data is shown below.

What does this graph indicate? Are you impressed with the decrease in smoking?



You can reproduce this time sequence plot using Statlets by either clicking on the Statlets link in the navigation bar, or using this single applet that will produce an additional window in your browser.

The Center for Disease Control produces many different reports concerning health issues in America. You may want to look at some of their reports, and see how they use time sequence plots. You may also want to try your hand at creating more time sequence plots for data. More data concerning tobacco usage are provided for this purpose.


Histograms   

Histograms are like frequency polygons except the entire score (or group of scores if there are many) is represented by a bar instead of a point. The lines at either side of the vertical bar are drawn at the upper and lower limits of the score interval and touch each other.

To study histograms we look at several made using a common statistical software program. First start with the data used to construct the frequency polygon and jittered plot {10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 5, 4}.
When the histogram is constructed using typical statistical software, it looks like the figure on the left

Note that this histogram looks quite similar to the frequency polygon produced earlier. It is symmetrical and the middle bar (representing the score of 7) has the most frequencies. Like the frequency polygon, the actual number of frequencies is found on the right y-axis where it is labeled with the word count. You can see that for the score of 7 there are 8 counts or frequencies. For scores of 4 and 10 there is just a single frequency.

You will also note that using this software package, not every bar is represented by a value on the x-axis. The bars that have a score attached to them have that number attached to the left side of the bar where the line on the left edge of the bar descends below the x-axis. Not every bar is labeled because the numbers tend to crowd each other. However, it is easy to determine that the first bar represents the number 4 (it is marked) the second bar must represent the number 5 because the third bar is clearly marked with the number 6.

The left y-axis is labeled as the proportion per bar. This axis simply calculates the proportion of frequencies found in each bar. Proportions are calculated by taking the number of frequencies in that particular bar and dividing that number by the total number of frequencies in the data set. For example, for the number 7 there are 8 frequencies. Because there are 22 numbers in the data set, the proportion per bar for 7 is equal to 8/22 = 0.3636.

In the figure shown on the left, the frequency polygon and the histogram are superimposed. Note their similarities. The histogram included the proportion per bar y-axis on the left. You can also see that in this case, the x-axis was labeled differently by the computer software. The histogram labeling was explained earlier. In the frequency polygon each dot's value was individually labeled directly under the dot on the x-axis. Therefore, the numbers on the x-axis from the polygon and histogram do not lie directly on top of one another.

The histogram shown directly on the left was produced with Statlets using the same data as the histograms produced above using the default values set by Statlets. Notice how the graph does not have the same shape as the histograms drawn above.

You are cautioned that histograms that have more than a single value represented in a bar do not always display the shape of a distribution correctly. Histogram bars are also known as bins. If the bars represent more than one value, they are said to have a bin width equal to the range of values they represent. For example, if the values of 4, 5, and 6 were all represented by a single bar in a histogram, it would be said to have a bin width of three. To illustrate that histograms can look quite different depending on the bin width selected by the software, manipulate the bin widths on a JAVA histogram program created by Dr. West from the University of South Carolina.

The figure on the left is another histogram produced by Statlets using the identical data set as above, but with different drawing options. After clicking the options button, the number of bins was changed to 7, and the lower limit was set to 4.0 (the value of the smallest number), and the upper limit was set to 10.0 (the value of the largest number). By choosing these values, each bar was forced to illustrate a single value.

In general, most software packages will construct a histogram containing between seven and twenty bars. Fewer than seven bars, and you have probably summarized the data too much. With more than 20 bars in a histogram, the data is not summarized enough.

Stem-and-Leaf Graphs   

Stem-and-Leaf displays are like histograms turned on their sides, except that all or part of the values for the variable appear in the graph. In a stem-and-leaf diagram, each number is broken into two parts. These two parts are called the stem and the leaf. In a simple example, the first digit of the number is the stem and the leaf is the second digit.

The figure below shows a stem-and-leaf display created by a typical statistical package for a data set with 46 cases, whose values range from 1 to 100. Some of the text has been changed to green for instructional purposes.



         Stem and leaf plot of variable:    SCORE    , N =    46


Minimum is:        1.000
Lower hinge is:       11.000
Median is:        21.500
Upper hinge is:       77.000
Maximum is:      100.000

               0   1234567789
               1 H 0123345556789
               2 M 46
               3   3
               4   5
               5   6
               6   55567
               7 H 279
               8   026777
               9   023
              10   0
A stem-and-leaf display



The first stem-and-leaf row appears after the initial descriptive statistics, and is shown using a green font. This row, which might have an equivalent bar in a histogram, has a stem of 0 and leaves of 1,2,3,4,5,6,7,7,8,9. It is representing the numbers from 1 - 9. The next stem in the next lower row is 1. The leaves associated with that stem are 0, 1,2,3,3,4,5,5,5,6,7,8,9. If you place the stem ahead of the leaves you get the numbers this bar is representing: 10, 11, 12, 13, 13, 14, 15, 15, 15, 16, 17, 18, and 19.

Notice that because each leaf is one digit long and that each digit is drawn on the screen using a font where each letter is the same width, that bars are formed whose length represents the number of frequencies in each bar. Stem-and leaf's great advantage, at least with simple data sets, is that all the data values are represented. When looking at a histogram, data values are lost. The figure below illustrates the same data set using a histogram.

Notice how the histogram "looks like" the stem-and-leaf rows if the histogram was turned on its side.

Many statistical packages' stem-and-leaf displays supply more pieces of information. Minimums (the smallest number) and maximums (the largest number) are often displayed. Lower and upper hinges are also frequently calculated and displayed. The lower hinge corresponds to the number at the 25 percentile (25% of the frequencies are below this number). The upper hinge corresponds to the number at the 75th percentile (75% of the frequencies are below this number). The median is also calculated. Medians are middle scores in terms of frequencies. Thus 50% of the frequencies in a data set are below the median. These values are displayed above the actual stem-and-leaf graph and the letters H and M are placed in the space between the stem and leaf in the graph to indicate within which bar the hinges and the median may be found.

The figure on the left was created using Statlets, using the data used previously to produce the Statlets histogram. Notice that the stem-and-leaf plot looks like the second histogram. One of the major advantages Statlets provides is a full interpretation of the output.


Box Plots    

Box or box-and-whiskers plots display the distribution of single variables. The figure on the left shows the box plot for last data set.

In a box plot, the variable's median is marked by a single vertical line. In the box plot using the last data set, we know that the median is 21.5. We showed this value using the stem-and-leaf procedure. The vertical line representing the median can be seen in the left part of the box. In the box plot you have to estimate it from the graph. The upper and lower hinges mark the ends of the box. From this graph we see that the lower hinge is close to 10 and the upper hinge is just below 80. Since we have calculated those values and shown them in the stem-and-leaf display we know the lower hinge is exactly 11 and the upper hinge is 77.

The length of the box is called the Hspread . The Hspread is equivalent to the interquartile range, that is, fifty percent of the values fall within the box.

The whiskers show how far data spreads away from the hinges. Most software packages only allow hinges to spread to a maximum distance of 1.5 Hspreads. If data values do not spread all the way to ± 1.5 Hspreads from the hinges, the whiskers do not extend that far. With this example, note that the lower whisker only extends to the number one and the upper whisker extends to the number 100. All of the data in this example are within this 1.5 Hspread boundary.

If there are data values beyond 1.5 Hspreads from the hinge, but less than three Hspreads, they are typically marked with an asterisk and called an outside value or outside point. Data values more than three Hspreads from the hinges are frequently marked with circles and called far outside values . You should certainly consult the manual for any software package you are using to determine how these outside and far outside values are displayed. The Statlets' manual describes it's symbol use for box plots fully. The box plot on the left illustrates a data set with a single outside value, and two far outside values. The Statlets produced box plot directly below illustrates a data set with four outside and a single far outside value. Outside and far outside values are also known as outliers. Their values lie outside most of the other values in the data set. You will learn later that the detection of outliers is very important. Box plots are a visual method of detecting them. Outliers can greatly influence the calculation of certain statistics.

Notice also that in the Statlets produced box plot, the median is shown by the vertical line inside the box, and the mean is shown using a red cross.


Statlets Examples   

This single Statlets link will allow you to create all of the graph types discussed above. Follow the directions at this link, and produce all of the possible graphs.


Distribution Shapes    

One of the advantages of graphs is that they let us visually determine the shape of data. However, we must be quite careful looking at graphs and inferring data shape. Histograms, which are one of the most popular graph forms are, as we learned above, surprisingly poor at letting us see data's shape.

There are three important shape characteristics that well drawn graphs under certain circumstances can quickly indicate. These shape characteristics are modality, skew, and kurtosis.

Modality

Modality refers to the number of humps or modes in a distribution. Normal distributions have a single hump. Look at the following three figures that illustrate unimodal, bimodal and multimodal distributions.




We have used histograms to illustrate modality because if each bar represents a single number as these histograms do, then histograms can clearly show modality. More often, the bars in histograms contain several different values, and, if so, the histogram does a poor job of indicating modality. There is simply no graphical substitute to using a frequency distribution to judge modality.

Skew

Another shape characteristic, skew, indicates whether the distribution is symmetrical. Perfectly symmetrical distributions, like the normal distribution, have a skew equal to zero. If the data set has few small values and many values that are larger, the skew of the distribution is most likely negative. If exactly the opposite occurs, then the data has a positively skewed shape. This animation cycles through a series of pictures that begin by displaying a distribution that is approximately normal. Then, as data values change, the data becomes increasingly positively skewed. The figures below clearly illustrate symmetrical, negatively skewed and positively skewed distributions. Look at the tails of the distributions, and derive a rule for identifying the skew of the distribution.







Use the Statlets' Examples page again, click the Stats tab. Note in the interpretation what is said about values for standardized skewness and kurtosis. Of particular interest here are the standardized skewness and standardized kurtosis, which can be used to determine whether the sample comes from a normal distribution. Values of these statistics outside the range of -2 to +2 indicate significant departures from normality, which would tend to invalidate any statistical test regarding the standard deviation. Standardized skewness and kurtosis, are not influenced by the size of the variable values, and can be compared across distributions.

Kurtosis

Kurtosis indicates how fast a distribution comes to a peak, and how thick the tails of the distribution are in comparison to a normal distribution. Both peakedness and tailness are components of kurtosis. Perfectly normal distributions have kurtosis values of zero and are called mesokurtic. Distributions that rise to a peak faster than normal curves, and also have thicker tails are called leptokurtic and have kurtosis values greater than zero. Distributions that rise slower than normal curve, and that have thinner tails are called platykurtic and have kurtosis values less than zero.

The following figure illustrates distributions that are leptokurtic and platykurtic. As you will notice from this figure, the density lines of both the leptokurtic, and platykurtic distributions cross the normal distribution density line twice on each side of the mean.



Regular and standardized kurtosis, however, are difficult to interpret in distributions that are not symmetrical. If you would like to learn more about the interpretation of kurtosis, see the Additional Information section in this Chapter.


Proper Graphic Design    

Graphs should be designed to present information as accurately as possible. The information contained in a graph should be concise. As everyone knows the eye is easily fooled, and improperly constructed graphs can easily tell lies. The most frequently cited rule to make graphs as truthful as possible is to make them as simple as possible.

Many excellent newspapers like USA Today are sources of poor graphs from a statistical point of view. Newspapers love graphical data displays. Of course, to sell papers, they make their graphs as visually interesting as possible, and use color extensively; seemingly without much thought as to what the colors connote to readers. As an example, look at this bar chart concerning "What Smokers Believe" that appeared in the June 3, 1997 edition of USA Today. You won't find a statistics software package that would produce a bar chart with this much visual appeal. However, the statistical package's graphs would be subject to much less misinterpretation.

After reading about proper graphic design, you might want to view USA Today and other newspapers, with an eye to finding example graphs that violate the rules pertaining to proper graphical design. A very good starting place is the first link in the Graphics Resources section that follows. Also, Chapter 2 "Cognitive science and graphic design" in the SYSTAT graphics manual provides an excellent text based starting point. The correct bibliographic citation for the SYSTAT manual was: SYSTAT: Graphics, Version 5.2 Edition. Evanston, IL: SYSTAT, Inc., 1992. Since that time, SYSTAT was purchased by SPSS. A list of SPSS publications are available.


Computer Project 5   

Graphing a Single Variable

To do this fifth computer project, you need to first read the directions. Next, look at the questions and possible answers in the project report.
To view the project report, you must be able to establish an active internet connection.

The project report will appear on a secondary page. After reading that secondary page, do not close it. Simply move the report page so that you can see this page (click and drag the window's title bar to expose this primary page). Click this page to activate it, and start Statlets by clicking the Statlets button on the Navigation panel. After completing the project, and if instructed to do so by your instructor, click the project report page to activate it, and answer the questions. After clicking the submit button, close both the report window, and Statlet's windows.


Business Graphics   

The graphs we have discussed in this chapter are those most frequently used in social science statistics. Barcharts and piecharts are frequently used in business statistics.

Barcharts

Barchart graph from StatletsBarcharts like the one shown on the left are often confused with histograms. While these two graph types superficially look quite similar, they are dramatically different. Barcharts are differentiated from histograms in that barcharts are the name given to graphs when nominal variables are graphed on the x-axis.

Barcharts are used more often in business statistics instead of in the social sciences. Note how the bars in a barchart do not touch each other. The length of the bars are usually scaled to reflect either the number of times a specific value occurs, or the percentage that the value occurs.

Piecharts

Piechart from StatletsPiecharts are like barcharts with the exception that the sections of the pie are scaled to reflect the percentage of occurrences of a specific variable value.


Computer Project 6   

Creating Barcharts and Piecharts

To do this sixth computer project, you need to first read the directions. Next, look at the questions and possible answers in the project report.
To view the project report, you must be able to establish an active internet connection.

The project report will appear on a secondary page. After reading that secondary page, do not close it. Simply move the report page so that you can see this page (click and drag the window's title bar to expose this primary page). Click this page to activate it, and start Statlets by clicking the Statlets button on the Navigation panel. After completing the project, and if instructed to do so by your instructor, click the project report page to activate it, and answer the questions. After clicking the submit button, close both the report window, and Statlet's windows.


Graphing Selected Probability Distributions   

By using the Plot/Probability Distributions menus as shown on the left, Statlets can plot 24 different probability functions. The most useful of those distributions in an introductory statistics class are the Binomial, Chi-square, F, Normal, and Student's t distributions.

The figure below shows the input tab for all 24 distributions. Read the user manual pages for this procedure at this link. These pages will be created in a new browser window. To return to this page, simply close the new browser window. The most useful tabs for introductory students include the PDF tab, which simply displays a graph of the probability density function (stated in other words, what the distribution looks like) the Tail Areas tab which calculates the probabilities of up to five values along the distribution, and the Critical Values tab which finds the value in the function given a probability. We will return to this procedure when we discuss each of the distributions later in the text.



Additional Information   

Web Resources

You may be interested in visiting the following links that provide additional learning opportunities concerning proper graphical design.
Jon Cryer's posted references from the Edstat listserv.


Graphical Data Analysis: A statistics class with an electronic text from Montana State University

 


Questions/Test   

This link allows you to take a computer scored end-of-chapter test. If your instructor requests to see the results of this examination, you can either copy and e-mail or print the feedback you will receive immediately after taking the test.

Report   

Please send a report indicating your understanding of this chapter to your instructor. You will need to know both your and your instructor's e-mail addresses.