production date 2/5/00

Correlation and Regression


Table of Contents Objectives
Correlation Values How to evaluate the value of a correlation
Scatterplots How to interpret graphs of correlations
Statlets Plot Using Statlets to produce a scatterplot.
Pearson Correlation Theoretical formulas for the Pearson Correlation.
Hypothesis Testing Determining if the relationship is significant.
A Statlets Problem Using Statlets to conduct a two-way ANOVA.
Hand Calculation Hand calculating the correlation coefficient.
Calculation Applet A simple applet to calculate small correlation answers.
Simple Lines Understanding simple lines as a prerequisite to regression lines.
Understanding Regression Understanding regression lines.
Outliers Outliers = strange values, and greatly change regression output.
Statlet's Regression Analysis Using Statlets to conduct a correlation/regression analysis
Additional Information Discover John Tukey
Questions/Test Take the End of Chapter Test
Report Send a Chapter Report to your Instructor


The basic purpose behind correlation is to find out if two variables are related to one another. If the variables are related, regression then allows the use of the relationship in the prediction of one variable given a score on the other variable.

For example, you might be planning on going on to graduate school and earning your masters or doctoral degree. Many graduate level programs require that you submit scores from the Graduate Record Examination (GRE). After you submit your application for admission to the graduate program of your choice, the admission team will look at your application materials and estimate how well they expect you will do in the program if you are admitted. This prediction may occur informally (just a sort of best guess), or formally where correlation and regression procedures are used. Using formal procedures, the admission committee calculates the correlation coefficient using the GRE scores submitted by students attending the program, and the grade point averages (GPA) earned by those students. This correlation is used as both a descriptive and inferential statistic. As a descriptive statistic, the correlation informs the admissions committee about the relationship between GRE scores and GPAs. As an inferential statistic, the correlation coefficient can be used to make a decision about whether there is a statistically significant relationship between GRE scores and GPAs. If the relationship is significant, a regression equation can be generated and used by the committee. They will then be able to take your submitted GRE score, and predict your GPA before admitting you. Of course, if your predicted GPA is near the top of the scale, you should expect your acceptance letter very soon.

In this chapter we learn, (1) how to calculate a correlation coefficient, (2) how to evaluate both it's significance and strength, (3) how to test it for statistical significance, and finally (4) how to construct and use a regression equation.


Correlation Values

A correlation coefficient is a number that ranges from -1.0 up through 0 to a maximum value of +1.0. The correlation indicates how closely the relative positions of two or more variables agree with one another. Or stated another way, the correlation indicates the correspondence, or lack of correspondence between the relative positions of two or more variables.

Direction & Magnitude

Correlation coefficients indicate both the direction of the relationship and its magnitude. If a correlation is negative, it indicates that the high values on the first variable are related to low values on the second variable, and low values on the first variable go with high values on the second. If the correlation is positive, then low values on the first variable go with low values on the second variable, and high values on the first variable, in general, go with high values on the second variable. Of course, this direction is given by the sign (either + or -) of the calculated correlation. Magnitudes are measured by comparing the pattern the data makes with a straight line. If the pattern perfectly matches a straight line then the magnitude of the correlation is either +1.0 or -1.0 depending on the direction. If the pattern of the relationship doesn't fit a linear pattern at all, then the magnitude of the correlation is zero (0) indicating that there isn't a relationship between the two variables in question.


Scatterplots

Scatterplots graphically show these relationships. When you use a good graphics package to draw scattergrams, the x and y-axes should be approximately the same length. If not, the picture starts to tell lies about the data. The figures below show various scattergrams and their approximate correlations.



High positive correlation -- Correlation = .75



High negative correlation -- Correlation = -.75



Low correlation -- Correlation = 0.0



Perfect positive correlation -- Correlation = 1.0


Statlets Plot

This Statlet's link describes a problem faced by all school psychologists: "Is there a relationship between intelligence as measured by traditional IQ tests, and academic achievement?" If there is a relationship, then IQ can be used to predict academic achievement. Psychologists have found that there is a positive relationship between these two variables. They use equations to predict a child's achievement score after they have given that youngster an intelligence test. They also give the child an achievement test. Now they have a predicted achievement score given IQ, and an observed achievement score from the actual given test. Psychologists then compare these two scores (expected vs. observed), and can determine if a child is underachieving. The link above demonstrates the procedures for producing the scatterplot in this and similar situations. You should fully understand how to produce and derive meaning from scatterplots.


Pearson Correlation

There is one major correlation coefficient called the Pearson product-moment correlation named after Karl Pearson. It's symbol, if calculated using sample data, is r (which stands for regression). The Pearson correlation is a measure of straight line association between two variables. Remember correlation measures the relative position congruence between two variables. Z scores are the best measures of relative position. Pearson correlations are based on the concept of the product of the z scores.

If z scores for both the x and y variables are either positive or negative, their product will be positive. If one of the x, y pair is negative, their product will be negative. The Pearson correlation is simply the average of the cross-products of the z scores in a bivariate data set. The equations below give the population parameter and sample statistic formulas.

rxy = Equation 14.1 Population correlation rho

rxy = Equation 14.2 Sample statistic

These equations simply show that you convert each subject's score on both the x and y variables to z scores, multiply the z scores together, add them up, and either divide by the population size or the sample size minus one to calculate the correlation.

The equations above are both conceptual formulas. The equation directly below provides the calculation formula for the sample correlation. While is looks more difficult, it is actually easier to use with calculators.

Calculation formula for rxy

You can generate the calculation formula for rho by simply substituting the population standard deviations for the sample standard deviations in the formula above.


Hypothesis Testing

Is It Significant

While correlations can be used to describe relationships between variables, a common use is to test the null hypothesis that states there is no relationship between the two variables. If there is no relationship between the variables, then in the population, the correlation (rho) would equal zero. Derivation of the sampling distribution is straight forward. Assume you have a population that has a zero correlation between two variables. If you repeatedly sample from the population using the same sample size, you can create a sampling distribution of sample correlations. The mean of this sampling distribution of sample correlations will be zero. Critical values for the sample correlation will depend on whether you are conducting directional or nondirectional tests, and on the degrees of freedom. As the sample size and degrees of freedom increase, the sample correlation becomes a better and better estimate of the population parameter (rho = 0), so it is increasingly more difficult for the sample correlation to differ from the population parameter of zero. rcv values can be found in the applet shown at this simple rcv link, or the critical values table in this environment. Degrees of freedom are given by n-2.

Is It Important

Beyond finding that two variables significantly correlate, and rejecting the null hypothesis, it is often important to talk about the importance of the found correlation. To discuss "importance" calculate the coefficient of determination. The coefficient of determination is simply found by squaring the correlation coefficient. If IQ and reading scores in children correlate .6 with each other, then the coefficient of determination (r2) would equal .36. Multiplying by 100 produces a percentage. This is the percentage of variance shared by the two variables. In this example, IQ and reading would be said to share 36% of their variance. There are no rules for determining whether the percentage of variance shared is important beyond what researchers know from the literature. For example, if a new intelligence test shared 20% of its variance with reading, it wouldn't be considered too important because other intelligence tests are already known to share more. However, if we could discover a measure that shared 20% of its variance with suicide attempts in adolescents we would think it quite important. Suicide is a major cause of death in adolescents, and no measure has correlated well with attempting suicide.

Hand Calculation

Teacher_1Teacher_2
12
20
45
12
44
21
32
40
Two student teachers observe eight children in a classroom situation where children are supposed to be silently working at their desks. Each student teacher independently records the number of times the child speaks without getting permission from the teacher to do so. Is there a significant correlation between the two student teachers? The data are shown in the table on the left. Using the calculation formulas shown above, solve the problem.

Begin by calculating the standard deviations of both variables using the formula presented in Chapter 5 and reproduced directly below


In calculating these standard deviations, you will have calculated all of the components for the correlation coefficient except for the sum of the cross products of x and y. For Teacher 1 sum of x = 21, and the sum of x2 = 67. The mean for teacher 1 = 2.625. With 8 subjects, the standard deviation for Teacher 1 = 1.302. Likewise for Teacher 2, the sum of these values (now referred to with the letter y) = 16. The sum of y2 = 54. The standard deviation of y = 1.772. The sum of the cross products = 48. Finally substituting these values into the correlation equation yields a value of 0.371.

You will be able to check this work and other small problems using the simple calculation applet directly below. This data set is reproduced in that browser window.


Calculation Applet

Enter the data shown above, and replicated on the simple r Calculation Applet into the applet appearing on that page to check your hand calculation. Are the answers the same?

The Statlets applet found in the Statlet's Regression Analysis section can be used to calculate correlations for larger problems as well as conducting a regression analysis.


Simple Lines

Two points __________________. You probably didn't have any difficulty filling in the words "make a line" in the proceeding sentence. As everyone learns in elementary school, two points define a line. We can use any two points to completely describe the single line through these two points with a simple formula. What you may not remember is that the line through any two points can be defined with the formula Y = bX + c. In this equation Y is the value on the Y-axis, b is the slope of the line, and c is the constant or intercept. The slope of the line is simply the change the line makes between the two points on the Y-axis divided by the change made between the two points on the X-axis. This change in value is often abbreviated with symbol that looks like a small triangle, and is called delta. Thus, the slope is often defined as delta y divided by delta x.

The constant is defined as the point where the line crosses the Y-axis when X = 0. If you know the slope of the line, and a single point, it is easy to find the constant. You simple take the x value at the point, and subtract it away from zero. This gives you the change the line must make from this point to get the x value to zero. Now you multiply that value times the slope. This newly calculated value is the amount the line must change on the Y-axis as x moves from the value at the point you are considering to zero. If you add this number to the y value at this point, you will have the constant.

The Two Points applet is designed to allow you to draw two points on a gird. The applet gives you the x and y values of the two points, and constructs the line between the two points. Also, the changes along the X and Y-axes are displayed and calculated along with the slope and constant. Use this applet to gain understanding of two point line problems before continuing on to the next section.

Understanding Regression

The applets shown at either the Understanding Regression link or the Understanding Regression with Residuals link are designed for you to investigate concepts surrounding regression lines. The directions, and suggestions for using these applets are found directly on their pages. However, you should be familiar with the formulas and the calculation of simple lines given two points before proceeding with this applet.

A regression line is not defined by points at each x,y pair. It is calculated so that it is the single best line representing all the data values that are scattered in a swarm like those shown in the scatterplots above. Regression lines are derived so that the distance between every value and the regression line (this distance is called a residual and is displayed in the Understanding Regression with Residuals link) when squared and summed across all the values is the smallest possible value. Thus, the values on the Y-axis for the regression line are not directly derived from the values, but from expected values. To differentiate real from expected values, statisticians put what they call a hat (^) above the expected variable. Thus, the simple regression line formula includes a Y-hat as shown below.

The slope of the regression line is given by the following formula.

and the constant is calculated using this formula:

With these formulas, you should proceed to the Understanding Regression or the Understanding Regression with Residuals link before working with the Statlet's Regression Analysis applet.


Outliers

Outliers are the name given to values that are very different from the others in a data set. In working the applets shown at either the Understanding Regression link or the Understanding Regression with Residuals link in the section above, you should have learned that outliers have a large effect on the correlation coefficient and the regression equation.


Statlet's Regression Analysis

In this section you learn how to use the Statlet's program to solve correlation and regression problems. Before proceeding, read the user manual for the Model/Regression/Simple Regression procedure.

The following scores are from the Quantitative (Quant) and Verbal portions of the Graduate Record Examination. The question is is there a significant relationship (correlation) between the Verbal and Quantitative score. If there is, then the researcher would like to construct the best prediction equation (equation for the regression line) using Quantitative scores to predict Verbal scores.

Quant Verbal
660 640
490 530
560 520
510 500
670 580
580 540
620 600
610 595
352 370
480 500
522 580
576 495
666 658
680 710
456 500
580 579
390 410
615 520
590 495
580 580

First copy and paste the data above into Statlets. Next, choose the Model/Regression Analysis/Simple Regression procedure using Statlets' menu choices. The input tab should be completed as shown in the following figure.
The Summary tab output, actually tests the correlation coefficient (the t value for the slope does this), and calculates the estimates for the regression equation. As shown in the figure below, the correlation coefficient is equal to 0.8566. The t value for the slope is equal to 7.04444 with a p value less than 0.05 indicating that the slope, and thus the correlation coefficient is significantly different from zero. The equation for the regression line is given under the Estimate column. The regression equation is: The expected value for Quant = .977935*Verbal + 26.2779.

Along with these values, the Summary tab calculates the square of the correlation coefficient indicating that 73.38% of the variance is accounted for using Verbal scores to predict Quantitative scores. To produce a 95% confidence interval around the expected score, researchers would add and subtract 1.96 standard errors of estimate to the predicted values.

The Fitted Model Plot tab output creates a scatter plot of the correlation. As the interpretation of this plot clearly indicates, the output shows the results of fitting a Linear model to describe the relationship between Quant and Verbal. The equation of the fitted model is
Quant = 26.2779 + 0.977935*Verbal
The inner bounds (blue lines) show 95.0% confidence limits for the mean of many new values of Quant at given values of Verbal. The outer bounds (red lines) show 95.0% confidence limits for a single new value of Quant at given values of Verbal.

Clicking the ANOVA tab conducts an analysis of variance testing the significance of the correlation. The F that is produced (49.6242) is simply the square of the t value produced for the slope in the Summary tab output. These two significance tests are looking at the same question (Is the correlation significantly different from zero?), and thus have identical p values, and are really the same statistic. An F with a single degree of freedom in the numerator is simply the square of the equivalent t test.

Clicking the R-Plots tab, gives the output shown below. The interpretation indicates that this plot displays the Studentized residuals versus values of Verbal. Any non-random pattern could indicate that the selected model does not adequately describe the observed data. In addition, any values outside the range of -3 to +3 could well be outliers. In this case, the pattern of residuals is random.


Because there are no Studentized residuals outside the range of -3 to +3, the Residuals tab does not indicate that any of the values are outliers.

Finally, clicking the Predictions tab, used the regression equation and standard error of estimate allowing the prediction of dependent values given predictor values using the Options button. For this example, the Options button input is shown below.

Values for Verbal scores were set at 400, 350, 410 and 580.

The Predictions tab output shown below indicates that if a new subject has a Verbal score of 580 their expected Quant score given the regression equation would be 593.48, and 95% of the time their score would be between 489.11 and 697.85.


The Models tab is useful for fitting other models to the data, but these procedures are not discussed in this text.


Computer Problem 31   

Each spring, many of the public schools in Nebraska conduct what is called kindergarten roundup. Parents with children who will be kindergartners the following fall, are asked to attend the roundup. The future students are given speech and hearing screening tests as well as academic screening tests. The academic screening tests are used to alert parents as to whether their children are ready to profit from academic experiences. The following data set contains children's screening scores (Screen), and the same children's achievement scores (Achiev) at the end of their actual kindergarten year. Use the Model/Regression Analysis/Simple Regression procedure to determine if it Screen is significantly correlated with Achiev. If your instructor requests, submit the project 31 report.
Screen Achiev
1 46
3 52
3 41
3 92
5 59
6.5 48
6.5 66
8 82
9 41
10 60


Computer Problem 32   

Use the Model/Regression Analysis/Simple Regression procedure to determine the regression equation for the following data where First is used to predict Second. Assuming the correlation is significant, what is the predicted value for Second if the subject earned a score of 80 on First.
First Second
72 55
57 46
50 30
71 73
75 76
75 68
62 50
63 53
76 83
85 90
93 82
64 60
40 34
55 60
77 81

If your instructor requests, submit the project 32 report.



History of Correlation and Regression

The following link provides an interesting history for the correlation coefficient.

http://www.amstat.org/publications/jse/v9n3/stanton.html


Questions/Test    

This link allows you to take a computer scored end-of-chapter test. If your instructor requests to see the results of this examination, you can either copy and e-mail or print the feedback you will receive immediately after taking the test.

Report    

Please send a report indicating your understanding of this chapter to your instructor. You will need to know both your and your instructor's e-mail addresses.