Well, now that we can describe the linear relationship between two quantitative variables, it’s time to conduct inference.
There are several parameters in linear regression. The big idea is that there is a real linear relationship, coupled with random variation, that produces the observed data.
, where α is the y-intercept of the true linear relationship, β is the true slope of the linear relationship, and ε represents random variation. This can also be written , which indicates that for an particular value of x ( ), the linear relationship produces the mean response (for that value of x).
There are two parameters of interest—the intercept, and the slope. Of these, the slope is more interesting. The slope is about change, and much of mathematics (Calculus in particular) is concerned with change.
[1] There must be evidence of a linear relationship. We check this with our initial scatterplot, r and , checking the fit of the line, and the residual plot.
[2] observations must be independent—for us, the best evidence of this will be a random sample.
[3] the variation about the line (the variation in the set of responses for each particular value of x) must be constant. We check this with the residual plot.
[4] the response variable (for any value of x) must have a normal distribution centered on the line. We check this with a histogram or normal probability plot.
The formula is the same as most test statistics that we have used—. Specifically, , where b is the sample slope of the least squares regression line and SEb is the standard error of the slope.
, and . s is the standard deviation about the least squares regression line. You’ll notice that it looks a lot like standard deviation for the response variable—there are two differences; the denominator ( ), and the fact that we’re subtracting the predicted response, not the overall mean response. As was the case with sample standard deviation, we’ll let the calculator find the value of SEb for us (or we’ll read it from computer output).
You might have guessed that is the degrees of freedom in this test!
Our null hypothesis must be for no relationship; thus, it will always be β = 0. In light of this, we often write this statistic this way: .
[1.] Here are some data on body fat and age. The data show the age of the subject (18 randomly selected subjects) and the percentage of body fat for that subject.
Age |
%Fat |
23 |
9.5 |
23 |
27.9 |
27 |
7.8 |
27 |
17.8 |
39 |
31.4 |
41 |
25.9 |
45 |
27.4 |
49 |
25.2 |
50 |
31.1 |
53 |
34.7 |
53 |
42.0 |
54 |
29.1 |
56 |
32.5 |
57 |
30.3 |
58 |
33.0 |
58 |
33.8 |
60 |
41.1 |
61 |
34.5 |
Let’s see if there is any evidence of a useful linear relationship between these variables—specifically, so that we could predict body fat percentage from age.
H0: β = 0 (no useful relationship between fat and age)
Ha: β ≠ 0 (a useful relationship between fat and age)
This calls for a t test for slope. There are many requirements that must be met in order to conduct this test. We’ve been told that these data represent a random sample. Let’s check for linearity in the relationship.
Figure 1–Scatterplot for Example 1
Looks pretty linear, with a fairly strong positive association. , which confirms our observations of a fairly strong linear relationship; and , which means that 62.74% of the variation in percentage of body fat can be explained by the least squares regression of body fat on age. That’s not bad…let’s look at the fit.
Figure 2–Least Squares Line for Example 1
Looks pretty good; the gap from 30 to 40 shouldn’t be a problem…on to the residuals.
Figure 3–Residual Plot for Example 1
Looks pretty good—no obvious patterns. How about normality of the residuals?
Figure 4–Histogram of Residuals for Example 1
That’ll do—not perfect, but not clearly skew. We should be able to continue with the test.
I’ll choose a 5% level of significance.
; the sample slope is 0.547, and the standard error of the slope is…uh-oh. How are we going to calculate that? The calculator just gives s—it sure would be a pain to have to use that formula to find SEb.
Well, let’s use a trick. There is one other place where we’ve used SEb—in the test statistic! Since , and since we can get the calculator to give us t and b (from the test), we can do this: .
THIS IS A TRICK! NEVER write this formula where someone (e.g., an AP grader) might see it! If you were really doing this out in the world, you’d be using software, and you’d have everything you needed.
So—we get SEb = 0.1056. Plug in!
. With 16 degrees of freedom, 2P(b > 0.5480) = 2P(t > 5.1910) = 0.
If there is no significant slope between fat and age, then I can expect to get a sample slope of 0.5480 or greater (or -0.5480 or lower) in almost no samples. This happens too rarely to attribute to chance at the 5% level; it is significant, and I reject the null hypothesis.
It appears that there is a useful linear relationship between age and % body fat.
The same! Hoo-ray!
The same basic formula still applies—statistic ± (critical value)(measure of variation). In particular, , where b is the slope of the least squares regression line, t* is the upper critical value from the t(n 2) distribution (df = n 2), and SEb is the standard error of the slope.
[2.] Here are some data from a random sample of baseball teams. The data show the team’s batting average and the total number of runs scored for the season.
Table 2–Batting Average and Runs Scored
Avg. |
Runs |
0.294 |
968 |
0.278 |
938 |
0.278 |
925 |
0.270 |
887 |
0.274 |
825 |
0.271 |
810 |
0.263 |
807 |
0.257 |
798 |
0.267 |
793 |
0.265 |
792 |
0.256 |
764 |
0.254 |
752 |
0.246 |
740 |
0.266 |
738 |
0.262 |
731 |
0.251 |
708 |
Let’s estimate the slope of the true linear relationship (where batting average helps to predict runs) between these variables with 99% confidence.
Well, before we begin plugging in numbers, we should check the requirements.
First up: a scatterplot of the data.
Figure 5–Scatterplot for Example 2
Looks pretty linear, with strong positive association. There seems to be a gap between 0.28 and 0.29.
, which supports our earlier observations. , which indicates that 74.91% of the variation in Total Runs can be explained by the least squares regression of Total Runs on Team Average. Let’s check the fit of the regression line.
Figure 6–Least Squares Line for Example 2
The fit seems acceptable. Let’s check the residuals.
Figure 7–Residual Plot for Example 2
Figure 8–Histogram of Residuals for Example 2
Hmmm—I’ll look at the Normal Probability Plot, too.
Figure 9–Normal Probability Plot for Example 2
Well, that’s not too bad. I think that we can safely continue.
99% confidence and 14 degrees of freedom gives a t* of 2.9768. The sample slope is 5709.2, and the standard error of the slope—using our trick again—is 883.1.
.
I am 99% confident that the true slope of the relationship between total runs and batting average is between 3080.393 and 8338.093 (the units are really weird!).
Often, you will be required to read standard computer output in order to obtain values for these items. Fortunately, almost all computer output looks alike. Here are some data, followed by several examples of computer output. Since you know how to get the calculator to give you what you need to know, you should be able to determine where those items are located in these examples…
The data relate the mass of a plant (g) with the quantity of volatile compounds (hundreds of nanograms) emitted by each plant.
mass |
volatiles |
57 |
8.0 |
85 |
22.0 |
57 |
10.5 |
65 |
22.5 |
52 |
12.0 |
67 |
11.5 |
62 |
7.5 |
80 |
13.0 |
77 |
16.5 |
53 |
21.0 |
68 |
12.0 |
Here is the output from a freeware program called R:
Coefficients:
|
Estimate |
Std. Error |
t value |
Pr(>|t|) |
(Intercept) |
3.5237 |
10.2995 |
0.342 |
0.74 |
mass |
0.1628 |
0.1547 |
1.053 |
0.32 |
Residual standard error: 5.418 on 9 degrees of freedom
Multiple R-Squared: 0.1096, Adjusted R-squared: 0.01067
F-statistic: 1.108 on 1 and 9 DF, p-value: 0.32
Here is the output from Microsoft Excel:
SUMMARY OUTPUT
Regression Statistics |
|
Multiple R |
0.331066792 |
R Square |
0.109605221 |
Adjusted R Square |
0.010672467 |
Standard Error |
5.417706998 |
Observations |
11 |
ANOVA |
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
Regression |
1 |
32.51787616 |
32.51787616 |
1.107875977 |
0.319982024 |
Residual |
9 |
264.163942 |
29.35154911 |
|
|
Total |
10 |
296.6818182 |
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Intercept |
3.523687722 |
10.29948909 |
0.342122575 |
0.740112225 |
mass |
0.162848458 |
0.154717015 |
1.052556876 |
0.319982024 |
And now, output from Data Desk:
Dependent variable is: volatiles
No Selector
R squared = 11.0% R squared (adjusted) = 1.1%
s = 5.418 with 11 - 2 = 9 degrees of freedom
Source |
Sum of Squares |
df |
Mean Square |
F-ratio |
Regression |
32.5179 |
1 |
32.5179 |
1.11 |
Residual |
264.164 |
9 |
29.3515 |
|
Variable |
Coefficient |
s.e. of Coeff |
t-ratio |
prob |
Constant |
3.52369 |
10.3 |
0.342 |
0.7401 |
mass |
0.162848 |
0.1547 |
1.05 |
0.3200 |
And finally, output from Statcrunch.com (which looks a lot like Minitab):
Simple linear regression results:
Dependent Variable: volatiles
Independent Variable: mass
Sample size: 11
Correlation coefficient: 0.3311
Estimate of sigma: 5.417707
Parameter |
Estimate |
Std. Err. |
DF |
T-Stat |
P-Value |
Intercept |
3.5236878 |
10.299489 |
9 |
0.34212258 |
0.7401 |
mass |
0.16284846 |
0.15471701 |
9 |
1.0525569 |
0.32 |
You should see some similarities there…
Page last updated 2015-05-13