production date 2/5/00

Nonparametric Analysis

Table of Contents Objectives
One-Variable Chi-Square How to calculate the Goodness-of-fit test
Statlets one-variable chi-square Using Statlets to produce a frequency table for use in the Goodness-of-fit test.
Computer Problem 33 Calculating the Goodness-of fit test.
Two-Variable Chi-Square Understanding crosstabulation procedures.
Computer Problem 34 Using Statlets to conduct a two-way chi-squared test.
Tests for Ordinal Data Other nonparametric statistical tests.
Additional Information Where to learn more about nonparametric tests
Questions/Test Take the End of Chapter Test
Report Send a Chapter Report to your Instructor


Until now, we have discussed and learned how to calculate parametric statistics. In parametric statistics, the symbols used when writing the null and alternative hypotheses are population parameters. These parameters completely specify the location and shape of the population distribution in order to assure the shape of the sampling distribution.

Nonparametric statistics are often termed distribution -free tests. They do not assume that a population distribution must be specified by parameters. Consequently, these procedures do not use population parameters in their null and alternative hypotheses. While parametric statistics require the dependent variable to be measured at the interval level, nonparametric statistics are used when the dependent variable is measured at the nominal and ordinal levels. Nominal, ordinal, and interval levels were discussed in Chapter 2 A complete discussion of nonparametric statistics is beyond the scope of this book. Additional information on nonparametric statistics can be found in S. Siegel (1956) and L.A. Marascuilo and M. McSweeney (1977). The full reference to these texts is found in the Additional Information section later in this chapter. This chapter discusses the one-variable Chi-Square (Goodness of fit) test, and the two-variable chi-square for nominal data. For ordinal level dependent variables, where typically differences in medians, instead of means are being evaluated, Sign the Mann-Whitney U Test, Wilcoxon Matched Pairs Signed Rank Test, and the Kruskal-Wallis One-Way Analysis of Variance are briefly discussed.


One-Variable Chi-Square   

The one-way or one-variable chi-square test is also known as the goodness-of-fit test. This test compares a set of observed frequencies (O) with a set of expected frequencies (E). The observed frequencies for each group are the number of sample cases in each group. The expected frequencies are the number of cases you expect each group to have given the sample size. Expected frequencies are determined in two ways. First, you might have a theoretical reason to expect a certain percentage of the frequencies to be in a particular category. For example, genetic theories for dominant and recessive traits lead researchers to conclude that if the theory is correct, a certain percentage of offspring should have the dominant trait while the remainder would possess the recessive trait. Second, in the absence of a theoretical expectation, you expect random assignment to the categories. Using random assignment, each category is expected to have the same number, and therefore the same percentage of frequencies. If the observed frequencies are distributed differently from the expected frequencies, the population distribution for the groups are presumed to be different.

Statlets, like many statistical packages doesn't directly compute the one-sample chi-square. However, the Tabulation procedure constructs a table that can be used to easily calculate the goodness-of-fit test, and also constructs several useful graphics for interpreting goodness-of fit problems. Remember from Chapter 2 that there is a bug (or was it a feature?) with the Tabulation procedure in Statlets. To review this problem, see the Feature or Bug section in Chapter 2. Before continuing, read the user manual for the Tabulation procedure.


Statlets one-variable chi-square   

Assume you are a market researcher trying to decide which of four packages created for your client's breakfast produce is best. You present the four different packages to 80 consumers and ask them to make a choice concerning which packaging they prefer. You code each subjects with a One if they choose package #1, a Two if they choose package #2, and so on. If there are no differences in preference for the four packages, you assume that each package would receive an equal number of choices. Twenty people would choose package #1 as best, twenty people would choose package #2 as best, twenty people would choose package #3 as best, and twenty people would choose package #4. To compare the observed frequencies (O) with the expected frequencies (E) you calculate a chi-square statistic with the assistance of Statlet's Tabulation procedure.

Directly below is the sample data collected. Again, remember that the Summary/Tabulate procedure has a small bug in the package where the first time a variable value is encountered, it is not counted. To turn this bug into a feature, the first four values are entered as labels. In this data set, the first four values are not actual data values, but simply values in the order that we want the table created. We have indicated that they are special by including the second variable value "Label" by each of them. Notice that in the first row, only a single variable is defined (Package). When a second variable occurs, it will be placed in the second column by Statlets.

Package
One Label
Two Label
Three Label
Four Label
Three
Two
One
Four
Four
Four
Four
Three
Two
Two
Four
Four
Four
Four
Two
Two
Four
Four
Three
Three
Two
Four
Four
Two
Two
Three
Four
Four
Two
Two
Four
Four
One
Three
Four
Four
Four
Two
Two
Two
Three
Three
One
Two
Two
Two
Two
Two
Two
Two
One
Three
Two
Two
Two
Two
Two
One
Two
Two
Two
Two
Two
One
Two
Two
Two
One
Two
Two
One
Two
Two
One
One
Three
Two
Two
Four
Two


When the copy and paste procedure is used to enter this data into statlets, the Data window looks like the figure directly below. Notice that, as was explained above, the first four entries are clearly noted as labels. In other Statlets procedures, these values would be counted, but in the Summarize/Tabulate procedure, the first entry of each different value is ignored.


Next select the Summarize/Tabulation procedure using the menu choices. The Input tab should be completed to look like the following figure.

Notice that the second variable was not given a name in the first row, and so neither it's name, or the default name of Col_# appears in the Input tab.

Clicking the Table tab produces the following correct frequency counts.


To calculate the chi-square statistic use the following formula where in this problem, the expected frequency (E) for each cell is equal to 20.

The observed frequencies are those given in the Table tab output. For the first cell, O = 10, E = 20 and the value for the first cell = 5. For the second box, O = 40, E = 20, and the value for the second cell = 20. For the third cell, O= 10, E = 20, and the value for the third cell = 5. Finally for the fourth cell, O= 15, E = 20, and the value for the fourth cell = 1.25. Adding together all these values for the four cells produces a calculated chi-square value equal to 31.25.

To check to see whether this chi-square value is statistically significant, you would need to check the calculated value against the critical value given by the Plot/Probability Distributions procedure. After using these menu choices, select the chi-square distribution, and in the PDF tab Options button set the degrees of freedom to the number of categories -1. In this case, the degrees of freedom are equal to three as shown in the PDF tab and Options button shown directly below.


To find the critical chi-square value, select the Critical Values tab, and if you have set the alpha level at 0.05, make sure that 0.95 is set in Statlets using the Options button as shown below. Notice that the critical value displayed is equal to 7.8146. Since our calculated value of 31.25 is larger than the critical value, we reject the null hypothesis that the observed frequencies are equal to the expected frequencies. We would conclude that the observed frequencies are significantly differently than the expected frequencies.


The bar and pie charts for this problem, that immediately follow certainly illustrate that the frequencies for the four different presentations differ dramatically with box 2 being by far the most favored.






Computer Problem 33   

Use the Summarize/Tabulation procedure to produce a frequency table for the following data. The data come from an investigation where a consumer-research organization asked thirty men on their college's intramural soccer team to evaluate the effectiveness of three soccer training films. After viewing each film (One, Two, Three), each man indicated the film he preferred. The data immediately follow:
Preference
Three
Three
Three
Three
One
One
Three
Three
Three
Two
Two
Three
Three
Three
Three
One
Two
Two
One
One
Three
One
Three
One
Three
Three
Two
Two
Three
One


If your instructor requests, submit the project 33 report.


Two-Variable Chi-Square   

Another name for the two-way chi-square is the chi-square test of independence. Because the Goodness-of-fit test is often not discussed in textbooks, and often not calculated in statistical packages, the two-way chi-square test is frequently simply referred to as the chi-square test. When the dependent variable is measured at the categorical level and the researcher has an experimental situation with two independent variables, each with at least two categories, the two-way chi-square test is appropriate. Before proceeding read the user manual for the procedure in Statlets that conducts the two-way chi-square test, Crosstabulation. Crosstabulation also constructs several graphics that aid in understanding two-way chi-square problems.

You are a researcher working for the Department of Defense interested in determining whether there is a relationship between the social class of military volunteers and the branch of service for which they volunteer. If a relationship exists, you believe you can save taxpayer dollars without lowering the number of military volunteers by targeting advertisements to join the military to media events viewed more often by members of the specific social class who have the highest volunteer rates for that specific branch of the service. You collect data for a random sample of 100 recruits. The two variables are Choice (Army, Navy, Air_Force and Marines) and Class (Upper, Middle, and Lower). Because the data set is so large, it can be found on this separate page.

After using the copy and paste procedure to enter the data into Statlets, choose the Summarize/Crosstabulation procedure using those menu choices. Looking at the Input tab below, you can see that Choice has been selected to create the rows of the crosstabulation table while Class has been selected to form the columns of that table.


Clicking the Table tab creates the crosstabulation frequency table. Notice that for Army there were 17 volunteers from the Lower class, twelve from the Middle class and only a single volunteer from the Upper class. Because the total number of data values is 100, the percentages calculated and shown directly below the cell values are exactly the same.


The Chi-squared tab calculates and displays the chi-squared value using the same formula as used for the Goodness-of-fit test above. The only difference is that the degrees of freedom are calculated by first subtracting one away from the number of rows, and the number of columns and multiplying those values together. In this instance the number of rows = 4 and the number of columns = 3, so the degrees of freedom are (4-1)*(3-1) = 6. as indicated in the Chi-squared output below. The Chi-squared value of 50.27 is statistically significant indicating that there is a relationship between Choice and Class.


The Stats tab calculates other measures of association. These other measures of association indicate the strength of the association. While a complete description of each measure of association is beyond the scope of this book, a brief summary of each follows the output. Definitions for each of the associated statistics can also be found in the Glossary.


Cramer V, and the contingency coefficient quantify the degree of association tested by the chi-square. Indeed, their formulas are all based directly on the chi-square formula. Cramer V can take values from 0 to +1. A large value for Cramer V signifies that a high association exists between the variables.

The contingency coefficient is also derived from the chi-square value. Its values range from zero to some upper limit. The upper limit of the contingency coefficient depends upon the table size, so contingency coefficients should only be compared between tables of the same size. The larger the contingency coefficient, the stronger the relationship between the variables.

The conditional gamma, Kendall tau-b and tau-c, and Somers D are appropriate when both categorical variables are ranks. Tau-B, tau-c, gamma, and Somers D all use information about the ordering of the variables by considering every pair of values in the data set. The measures primarily differ from one another in how ties in the rankings are treated.

Lambda and the uncertainty coefficient are measures of association when the variables are measured at the categorical level. Lambda measures the percent of improvement in your ability to predict a value of the dependent variable once you know the value of the independent variable. Obviously, there are two lambdas calculated depending on which of the two variables is considered the dependent variable.

The uncertainty coefficient is similar to the lambda coefficient. It measures the proportion by which uncertainty is reduced in the dependent variable once you know the value of the independent variable. Both lambdas and the uncertainty coefficients have values that range from 0 (no improvement) to 1 when perfect predictions are possible.

Clicking the Barchart tab produces the following graphic from our data. Notice that for the Army most of the recruits came from the Lower class, while for the Air Force, most came from the Upper class designation.


Clicking the Mosaic Chart tab creates a graph where the length of each bar is the same, but the bars are divided and color coded according the the percentage of each with respect to the variable creating the column in the crosstabulation table. Here again, one sees that most of the recruits for the Army are from the Lower class, while for the Air Force, they are from the Upper class.


Finally the Skychart tab produces a three-dimensional bar chart of the counts for each row category. You can also click on the rotation button marked with arrows around an X just below the Interpret button. The arrows rotate the graph in the direction each arrow points. While the human eye is very good at detecting relationships, it sometimes needs the proper perspective to see those relationships. The Skychart for the Military Data is shown below.



Computer Problem 34   

Use the Summarize/Crosstabulation procedure to produce a frequency table and calculate the chi-squared statistic for the following data. The data come from an investigation where during the 1960s, a major university wanted to know whether there was a relationship between students' CLASS standing (freshman, sophomore, junior, senior) and their political affiliation (AFFIL). The data directly follow:

CLASS AFFIL
freshman democrat
sophomore democrat
junior republican
senior republican
freshman democrat
sophomore democrat
junior republican
senior republican
freshman democrat
sophomore democrat
junior republican
senior republican
freshman democrat
sophomore democrat
junior republican
senior republican
freshman republican
sophomore republican
junior democrat
senior democrat
freshman democrat
sophomore democrat
junior republican
senior republican
freshman republican
sophomore republican
junior democrat
senior democrat
freshman democrat
sophomore democrat
junior republican
senior republican
freshman republican
sophomore republican
junior democrat
senior democrat
freshman democrat
sophomore democrat
junior republican
senior republican
freshman republican
sophomore republican
junior democrat
senior democrat




If your instructor requests, submit the project 34 report.


Tests for Ordinal Data   

In Statlets, other nonparametric tests are included with the parametric procedures. For example, in Chapter 9 we studied the one-sample t test where we learned procedures to determine whether a sample mean was different from a hypothesized mean. To review that material, see this section in Chapter 9.

You were asked to read the user manual for the Analyze/One Sample/One Variable Analysis procedure which was one of two procedures discussed that conducted the one-sample t test. The Rank Test tab was discussed in the user manual, but neglected in the chapter. It is the Rank Test tab that conducts the nonparametric tests that are equivalent to the one-sample t test. Instead of testing whether a sample mean is different from a hypothesized mean, the Rank Test tab provides statistics that test to see if a sample median is significantly different from a hypothesized median. You are encouraged to review the Rank Test section of the user manual again. As you will note from the user manual, both a Sign and a Signed Rank test are performed. For further information on this test see the Additional Information section below.

The Mann-Whitney or Wilcoxan test is the nonparametric equivalent of the independent t test that we studied in Chapter 10. Again, these nonparametric equivalent tests were discussed in the user manual that you have previously read. To review the Man Whitney or Wilcoxan test read W Test section in the user manual.

For paired data, the Rank Test tab in Statlets' Analyze/Two Sample/Paired Samples procedure provides a the nonparmetric equivalent Sign and Signed Ranks Tests to the dependent t test. Again, these nonparametric equivalent tests were discussed in the user manual. You can reread the Rank Test section of the manual for paired samples to gain an understanding of these nonparametric tests.

Finally the nonparametric equivalent for the one-way ANOVA is the Kruskal-Wallis test. The Kruskal-Wallis test compares group medians, testing the hypotheses:
Ho: all group medians are equal
H1: all group medians are not equal
Again information concerning this nonparametric test is found in the Kruskal-Wallis section of the user manual.


Additional Information   

The following two textbooks are excellent references to nonparametric procedures.
Marascuilo, LA, and McSweeney, M. 1977. Nonparametric and distribution-free methods for the social sciences. Belmont, CA: Wadsworth Publishing.

Siegel, S., 1956 Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill.

By this point, you might also try your own skills at finding relevant information on nonparametric statistical techniques that are on the web by conducting a web search looking for this broad topic, or any of the specific statistical techniques we have discussed.


Questions/Test    

This link allows you to take a computer scored end-of-chapter test. If your instructor requests to see the results of this examination, you can either copy and e-mail or print the feedback you will receive immediately after taking the test.

Report    

Please send a report indicating your understanding of this chapter to your instructor. You will need to know both your and your instructor's e-mail addresses.