]>
So far, we've talked about measuring a single variable in a population. If the variable is quantitative, we draw a picture of the results, and then measure the center and spread.
Now, we're going to talk about measuring two quantitative variables. Specifically, we'll measure two things from each individual, resulting in paired measurements.
We begin as we did before—with a picture. Since the measurements are paired (like coordinates!), we'll plot them in two dimensions, resulting in a scatterplot (the book calls it a scatter diagram). Don't forget to scale and label the axes!
The question that we will ask is, "Do these points make a line?" or, "How close to a line are these points?" This is called correlation, and for now, there are three possible answers: none (not like a line at all), moderate (a real fuzzy line), or high (awfully close to a line).
Example:
[1.] Does the size of a home determine its price? Here are data from a sample of nine homes in Phoenix, Arizona. The size is in hundreds of square feet; the price is in thousands of dollars.
size, x |
price, y |
26 |
259 |
27 |
274 |
33 |
294 |
29 |
296 |
29 |
325 |
34 |
380 |
30 |
457 |
40 |
523 |
22 |
215 |
Let's make a scatterplot of the data. Since the x variable ranges from 22 to 40, I need the x-axis to range from about 20 to 50. Since the y variable ranges from 215 to 523, I need the y-axis to range from about 200 to 600. Now, start plotting coordinate pairs—first at (26, 259); then at (27, 274); next at (33, 294); etc. Hopefully, you get something like this:

This has moderate correlation—it looks like a fuzzy line.
Extra! Non-linear Relationships: You should also be able to recognize other types of relationships—quadratic and exponential, in particular.
If the data follow a pattern like this, with a flat part on one side, then the variables have an exponential relationship.

If the data follow a pattern like this (or half of this—without a flat part), then the variables have a quadratic relationship.

For a particular set of data, there may be many lines that could reasonably model the data. There is one, however, that is used more than any other—the least squares regression line. This is often referred to as the "line of best fit," although I do not like that term.
With any luck, you calculator will find the equation of the least squares regression line for you. I'm not going to bother trying to explain the formulas—if you are in the position that you must calculate these things by hand, then see me personally for assistance!
When you write the equation, it is important to emphasize that it produces predictions for the y variable based on values of the x variable. Thus, when we write the equation, we'll put a "hat" on y to indicate that it is being predicted (not observed). Look for this in the next example.
Once you have the equation, then you can predict a value of y by plugging in a value for x. You should not plug in a value of x that isn't between the maximum and minimum x-values presented in the data—that's called extrapolation, and it's a bad thing. The opposite, interpolation, is when your x-value is between the minimum and maximum—and that's fine.
We had a nice rule of thumb for outliers in the one variable case—alas, we do not for the two variable case. You just have to look at the picture. Points that are very far away in the x-direction (horizontal) will have a greater impact on the equation of the least squares regression line. Fortunately, we won't have to deal with the problems that these points bring.
Example:
[2.] Is there a relationship between fat and age? Some researchers measured the age, and percentage of body fat, for a sample of 18 adults. Here are the data:
age,x |
fat,y |
23 |
9.5 |
23 |
27.9 |
27 |
7.8 |
27 |
17.8 |
39 |
31.4 |
41 |
25.9 |
45 |
27.4 |
49 |
25.2 |
50 |
31.1 |
53 |
34.7 |
53 |
42.0 |
54 |
29.1 |
56 |
32.5 |
57 |
30.3 |
58 |
33.0 |
58 |
33.8 |
60 |
41.1 |
61 |
34.5 |
Let's take a look with a scatterplot.
This appears to have a moderate linear correlation.
My calculator tells me that the least squares regression equation is . Here's the line graphed on top of the plot:

Now—let's predict the percentage of body fat for a 35 year old. Plug in x = 35!
. This model predicts that a 35 year old will have 22.4% body fat.
With a single variable, we started with a graph, then calculated some numbers. Finding the regression line is part of that, but now we need a number to describe the correlation. We'll calculate the Coefficient of Correlation, which measures the strength and direction of a linear relationship. The symbol for this is r.
Again, we don't want to be calculating this by hand, so I won't go over the formulas. What you should know is that r can take values between -1 and 1. 1 indicates a perfect line with positive slope, -1 indicates a perfect line with negative slope, and values near zero indicate no linear relationship. The farther r is from zero, the better the linear relationship between the variables. Also, r has no units—in particular, it is not a percentage!
If the data do not appear linear when graphed, then do not use r-it just doesn't apply!
Example:
[3.] The correlation for the home data is r = 0.8287 (moderate to strong linear relationship with positive slope). The correlation for the age and fat data is r = 0.7921 (moderate linear relationship with positive slope).
There are no rules for calling certain values of r strong, moderate or weak…a value of 0.01 would probably never be called strong, and a value of 0.99 would probably never be called weak. It all depends on the context…
Page last validated 2010-08-15