]>
The Big Idea: Having now looked at a picture of the data, it is time to crunch some numbers. The pictures are subjective; different people will interpret them differently. Numbers are much more—but not completely—objective. With numbers, we will now measure two important features of a data set—center and spread.
Mode: The mode of a data set is the datum (value) that occurs most often. There is no special symbol for this measure.
The problem with this measure is that it doesn't always exist! If all the data are the same, then there is no mode. If several data occur equally often, then there are many modes! It is for these reasons that the mode is not terribly useful.
Example:
[1.] Here are the number of tornadoes in the U.S. for each month in 1998:
47 |
72 |
72 |
182 |
308 |
373 |
82 |
60 |
104 |
86 |
25 |
6 |
Let's find the mode.
Since 72 occurs twice, and every other datum occurs exactly once, the mode is 72.
Median: A value, below which 50% of the data lie. Also: the halfway point in the data; the value that splits the data set in half. There are a few symbols for this measure: M and (x tilde) seem to be the most popular. Some people don't use a symbol.
This always exists, but not everyone calculates it the same way. We'll do it like the book does it—list the data in order, smallest to largest. Find the position that splits the data into two equal parts. If that position lies on a number, then that's your median. If the position lies between two data, then find the number halfway between them, and take that as the median.
Example:
[2.] Let's find the median for the tornado data.
First, order the numbers.
6, 25, 47, 60, 72, 72, 82, 86, 104, 182, 308, 373
Now—find the middle (marked with *)!
6, 25, 47, 60, 72, 72,* 82, 86, 104, 182, 308, 373
Since the middle doesn't lie on a number, we take the two closest—72 and 82—and find the number halfway between them. That would be 77—and that's the median.
Mean: Add the data, and divide by the number of data. The standard symbol for this is (x bar).
Most probably call this "the average," though that would be technically incorrect. When we say "the mean," what we technically want is "the arithmetic mean." Naturally, you know how to do this. Perhaps you can even get your calculator to do it for you…
The formula for the mean is usually given like this: . The big thing is a Greek capital letter sigma, which stands for "sum," (add things up). The small n stands for the number of data in the set. So, this formula says "add up all the data, and divide by the number of data." More formulas like this are coming up soon!
Example:
[3.] The mean of the tornado data is , which comes out to 118.0833.
We should be careful here. When we calculate these measures, we typically are calculating them on samples—so we really ought to call them the "sample mode," "sample median," and "sample mean." In the case of the mean, the distinction is very important, because later on, we will use (never calculate!) the "population mean," or the mean of all the data (not just a sample).
Range: The difference between the largest datum and the smallest datum. Take the highest number and subtract the lowest number. There is no special symbol for this.
Example:
[4.] The range of the tornado data is 373 - 6 = 367.
Standard Deviation: This is sort of a mean deviation—but not exactly.
First of all, a deviation is the difference between a datum and the mean. This is, actually, another measure of variation. However, we want to look at all the deviations. I'm about to go into the reason that the formula looks like it does—if you don't care, don't watch!
Perhaps you might think to take the average of the deviations—but this won't work. Taking the mean of a bunch of differences from the mean will always give you zero—the problem is that some deviations are negative, and some are positive.
There are two ways to take care of this. You'd probably think to take the absolute value—and this will work; but it won't give you the standard deviation. The reason lies in calculus—we won't go there.
The second way is to square everything, and then take the square root. This may seem a roundabout way to do it, but again, the reasons lie in calculus.
So—the mean of the deviations, with something to handle the negatives, would look like this: . This is really close—there's just one (complicated) hitch. This measure is biased—it doesn't do what statisticians need for it to do. I won't try to explain why (at least, not much), but in order for this measure (well…a measure related to this, actually) to be unbiased, we have to change the denominator.
Finally, then—the formula for the sample standard deviation is .
You do not want to calculate this by hand.
Example:
[5.] The standard deviation of the tornado data is 113.448.
Variance: The variance of a data set is the square of the standard deviation. Easy!
Example:
[6.] The variance of the tornado data is 12869.72.
Again, we should be careful. We are looking at the "sample range", "sample standard deviation," and "sample variance." There will come a time when the difference between sample measures and population measures will be very important.
Center and Spread are important concepts in statistics—together with Shape, they form the Big Three: the three things you should always do to describe a data set. In particular, statisticians almost always want to know the mean, the standard deviation, and the shape. Unfortunately, there are times when the mean and standard deviation aren't as useful. For those times, we need something else.
Percentile: The kth Percentile of a data set is a value, below which k% of the data lie. You find a percentile much like you found the median—sort the data to ascending order, then find the point that divides the set into the desired regions.
For example—let's say that I want to find the 90th percentile of the tornado data. To do that, I first list the numbers in order. Then, I find the point where 90% of the data are below that point, and 10% are above. This is a bit harder than the median—for that, you only had to find the halfway point! Finding a 90% point is more difficult.
Let's try anyway. There are 12 data in the set, so each datum is , or about 8.3%, of the set. So, after the second datum, we've looked at about 16.6% of the set; after the third, we've looked at 25%—how far have we got to go to get to 90%?
Six would be 50%; nine would be 75%; eleven would be 91.6%—so the 90th percentile is somewhere between the tenth and eleventh data.
Exactly what to do now depends on who you ask—so we won't bother.
Decile: The kth Decile of a data set is a value, below which (10k)% of the data lie. The first decile is the 10th percentile; the second decile is the 20th percentile; etc.
Quartile: The kth Quartile of a data set is a value, below which (25k)% of the data lie. The first quartile is the 25th percentile; the second quartile is the 50th percentile; etc. These are easier to calculate than percentiles, so we will.
Note that the 50th percentile is the same as the median. Also note that the 25th percentile will be the median of the lower half of the data set. Cutting things in half is easy; so finding the first (and third) quartiles will be easy, since they only involve cutting things in half.
In particular, the first quartile, Q1, is the median of the first half of the data. To find it, first find the median. Now look at all the number to the left of the median—the middle position of those numbers is the first quartile. Similarly, the third quartile is the median of the upper half of the data.
Example:
[7.] For the tornado data—we've already found the median. Here are the numbers that are below (to the left) of the median.
6, 25, 47, 60, 72, 72
Locate the middle of these numbers!
6, 25, 47,* 60, 72, 72
The middle position isn't on a number, so we find the number halfway between 47 and 60—which is 53.5. Thus, Q1 = 53.5.
Similarly, Q3 = 143.
Interquartile Range (IQR): The difference between Q3 and Q1 (Q3 - Q1). This is another measure of spread.
Example:
[8.] The IQR of the tornado data is 89.5.
Five Number Summary: For any data set, the Five Number Summary includes the Minimum, Q1, Median, Q3, and the Maximum. This is a convenient summary of a data set.
Example:
[9.] For the tornado data, the five number summary is 6, 53.5, 77, 143, 373.
Boxplot: A graphic display of the Five Number Summary. Also known as a Box-and-Whiskers Plot.
Take an axis (vertical or horizontal; it doesn't matter). Make marks along the axis at each value from the five number summary. Connect Q1 and Q3 to make a box. Connect the box to the other marks.
Example:
[10.] For the tornado data—here are the marks at the five numbers.

Now, I've connected Q1 and Q3.

Finally, the complete boxplot:

Extra! Side-by-Side Boxplots:
Just as you can make two histograms, or two stemplots, you can also make two boxplots.
Example:
[11.] Here are some findings about the length of stays in the hospital for a sample of men:
2 |
3 |
2 |
24 |
2 |
7 |
7 |
10 |
18 |
4 |
10 |
19 |
1 |
11 |
1 |
1 |
3 |
3 |
9 |
4 |
13 |
3 |
8 |
14 |
17 |
23 |
13 |
6 |
6 |
12 |
5 |
1 |
1 |
6 |
2 |
9 |
1 |
15 |
12 |
|
Here are the lengths of stays for a sample of women:
14 |
12 |
21 |
4 |
18 |
4 |
7 |
3 |
6 |
7 |
1 |
4 |
4 |
12 |
9 |
7 |
6 |
2 |
15 |
3 |
1 |
3 |
5 |
10 |
2 |
5 |
14 |
1 |
7 |
5 |
5 |
1 |
7 |
15 |
9 |
|
Let's compare them with side-by-side boxplots. First, the five number summaries.
|
Min |
Q1 |
Median |
Q3 |
Max |
Male |
1 |
2.5 |
6 |
12 |
24 |
Female |
1 |
3.5 |
6 |
9.5 |
21 |
Now, the plots.

Outliers: An outlier is a datum that does not seem to belong with the other data. "One of these things is not like the other; one of these things just doesn't belong." That's it—there is no numeric definition of what constitutes an outlier. An outlier is some value that just doesn't seem to fit—it isn't like the others.
That being said, there are some guidelines as to what makes a datum an outlier. The most popular is from the same man who invented boxplots.
Tukey's Rule of Thumb for Outliers: Any datum lower than Q1 - 1.5·IQR, or higher than Q3 + 1.5·IQR, is probably an outlier.
Example:
[12.] With the Hospital stay data above, let's check for outliers.
For the males: IQR = 12 - 2.5 = 9.5; 1.5·IQR = 14.25. So anything lower than 2.5 - 14.25 (-11.75), or higher than 12 + 14.25 (26.25). There aren't any data in those regions, so the men have no outliers.
For the women: IQR = 9.5 - 3.5 = 6; 1.5·IQR = 9. So anything lower than 3.5 - 9 (-5.5), or higher than 9.5 + 9 (18.5). There is one value that falls in this area—21. So it looks like the women have an outlier.
Extra! Modified Boxplots: Sometimes, people will create boxplots using this rule of thumb for outliers—rather than extending the "whiskers" all the way out to the minimum and maximum, they'll run them out to the smallest/largest non-outlier. Then, they mark the outliers separately. Here's the hospital stay boxplots again, with outliers marked in this fashion.

OK, I said that this was another way to look at the Big Three—what I didn't say is why.
If the shape of the data is skew—or if there are outliers—then the mean and standard deviation aren't the best choices for measuring center and spread. In that case, the median and the interquartile range do a better job.
Example:
[13.] Here are the graduation rates—percentage of adult residents holding a high school diploma—for all 50 states and DC.
90 |
79 |
80 |
85 |
88 |
83 |
86 |
87 |
79 |
87 |
90 |
84 |
91 |
85 |
86 |
89 |
89 |
85 |
80 |
86 |
83 |
|
85 |
82 |
83 |
85 |
78 |
88 |
86 |
82 |
77 |
92 |
|
84 |
77 |
84 |
89 |
79 |
89 |
84 |
81 |
78 |
76 |
|
82 |
80 |
84 |
77 |
87 |
84 |
81 |
84 |
89 |
88 |
|
Here are the all the measures for these data:
Mode: 84 (occurs 7 times)
Median: 84
Mean: 84.06
Range: 16
Standard Deviation: 4.1105
Variance: 16.8965
Minimum: 76
Q1 : 81
Q3 : 87
Maximum: 92
And here's the boxplot:

Page last validated 2010-08-15