SUMMARY STATISTICS

Tweet
Share

INTRODUCTION

Once a data set has been collected, we will usually have too many numbers to make sense of it all. To help understand the data we use descriptive statistics to summarize and simplify the data set. The values we use to summarize and describe data sets are also the values that statistical tests work with when trying to decide of populations are different from one another (i.e., all tests are based on a certain statistic, not the values individually).

Summary values for data sets that are samples are called statistics.

Summary values for data sets that are populations are called parameters.

Typically, since our data sets are usually samples, we use sample statistics to estimate the value of the population parameters. Some of these are intuitive and easy to calculate by hand or with a calculator while others generally require the use of a computer. Although technically we could talk about "descriptive statstics" or "descriptive parameters", since most real-world work is done with samples, the terms "summary statistics" or "descriptive statistics" are used when describing the calculations.

Descriptive statistics come in three types: location, variation, and shape.

LOCATION STATISTICS

Location statistics

Location statistics are used to identify where on the number line the values are located. You can think of these as describing the typical value or average value. If we are interested in whether a certain set of values are larger or smaller than another, then we would be comparing a specific location statistic in the two groups.

The most commonly used location statistics for a data set are: mid-range, mode, median, mean.

Mid-range

The mid-range is the point mid-way between the largest and smallest value in the data set. You calculate it by adding the minimum and maximum values and then dividing by two.

$$ \text{mid-range} = {{minimum + maximum} \over {2}} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the mid-range would be (2+9)/2 = 11/2 = 5.5.

This value is not very precise since it only uses the largest and smallest values and is therefore prone to being thrown off by outliers. In a more technical sense, it only uses a very small portion of the information about the data to estimate the location and is consequently inaccurate. Although the mid-range is easy to calculate, it's rarely used in practice.

Mode

This is the value that occurs most frequently in the data set. A data set can have more than one mode or it can have none.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the mode would be 5.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 7, 9 }, the modes would be 5 and 7.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 6, 7, 8, 9 }, the mode is undefined since no value is more common than the others.

This doesn't really tell us where the data is centered unless the data has a frequency peak at the center (which is true for some, but not all, data sets). For certain purposes the mode is useful however - for example, it tells us which value we are most likely to get if we pick a random value from the data set. There are few statistical tests for modes however and it is mainly used for description purposes instead of inference.

Median

This is the value that is in the middle of the sample data (50% below, 50% above). To calculate the median, you first order the data set from smallest to largest. If there are an odd number of values in the data set, then the median is usually defined as the one value in the middle. If there are an even number of values, then the median is usually defined as the mean of the middle two observations.

For example, if we have a data set consisting of the values { 2, 3, 5, 8, 9 }, the median would be 5.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 8, 9 }, the median would be (4+5)/2 = 4.5.

The technique described above is the simplest method, but there it doesn't take the shape of the distribution into account so there are alternative methods you may sometimes see used.

The median usually gives us a good idea of where the data is centered unless the data has a weird asymmetric shape. It is also not prone to being thrown off by outliers since it focuses on the center of the distribution.

The idea of dividing the data set into sections (e.g., top and bottom halves) in order to summarize it is extended by describing a data set with quartiles, quintiles or percentiles.

Mean

The mean, more precisely called the arithmetic mean, is the arithmetic average of the values. It is calculated by adding up all the values and dividing by the sample size.

$$ \text{mean} = {{\sum_{i=1}^{n}x_i} \over {n}} $$

For example, if we have a data set consisting of the values { 2, 3, 5, 8, 9 }, the mean would be (2+3+5+8+9)/5 = (27)/5 = 5.4.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 8, 9 }, the mean would be (2+3+4+5+8+9)/5 = (31)/6 = 5.166.

This is what most people learn as "the average", but technically mid-ranges, modes, and medians are averages too. There is also a value called a "geometric mean" that is used in some specialized circumstances so it's good to be precise.

The "average"

As noted above, the term average is imprecise because it is often used for either the mean or median and these can be quite different if the distribution is asymmetric. If you ever are presented with a value called "the average" you should determine whether it refers to the mean or median. Similarly, when describing the average you should be precise in your vocabulary and specify whether you are presenting the mean or median. Not defining the average and then using the one that best supports your argument is a common way to manipulate an audience.

For example, if you want to make it look like people are doing well economically you would want to talk about the arithmetic mean as the average since it would include millionaires and billionaires and therefore be a very big number. But if you wanted to describe the income that most people personally experience, then the median may be more representative of the average income for normal people.

Beware of this.

VARIATION STATISTICS

Variation statistics are used to describe how spread out the values are. Are the values are clumped in one tight area or are they spread out on the number line? If we randomly choose values will they tend to be similar or very variable? If we are interested in whether a certain set of values is more variable than another, then we would be comparing a specific variation statistic in the two groups.

The most commonly used variation statistics for a data set are the range, inter-quartile range (IQR), sum of squares, variance, standard deviation, and the coefficient of variation

Range

The range is the distance between the smallest and largest values on the number line. This is calculated by subtracting the minimum value from the maximum value in the data set.

$$ \text{Range} = {maximum - minimum} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the range would be 9-2 = 7.

This value is not very precise since it only uses the largest and smallest values and is therefore prone to being thrown off by outliers. In a more technical sense, it only uses a very small portion of the information about the data to estimate the variation and is consequently inaccurate. Although the range is easy to calculate, it's rarely used in practice.

Inter-quartile range

The inter-quartile range describes how far apart the first and third quartiles are (see the Quartiles page for an explanation of what these are) and describes the width of the region containing the middle 50% of the data values. This statistic represents the variation in the center of the sample quite well and is not influenced by outliers.

You calculate it by subtracting the value of Q1 from the value of Q3.

$$ \text{Interquartile range} = {Q3 - Q1} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 6, 7, 8, 9 }, the IQR would be 7.5 - 3.5 = 4.

This value is fairly reliable and useful to describe data sets, but there aren't really any statistical tests based on it.

Sum of squares

The sum of squares is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean.

$$ \text{SS} = {{\sum_{i=1}^{n}(x_i - \bar{x})^2} } $$

For example, if we have a data set consisting of the values { 5, 10, 15 }, the sum of squares would be (5-10)2+(10-10)2+(15-10)2 = 25+0+25 = 50.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the sum of squares would be (2-4)2+(3-4)2+(5-4)2+(6-4)2 = 4+1+1+4 = 10.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. This value also gets larger as there are more values even if the spread of the data is the same and therefore does not necessarily represent the typical variation, that's what the variance (see below) is for.

Variance

The variance is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean, divided by the size of the data set. This is equivalent to the sum of squares from above divided by the data set size (i.e., n).

For a population of data this would be:

$$ \text{Population variance} = {{\sum_{i=1}^{n}(x_i - \mu)^2} \over {n}} $$ $$ \text{Population variance} = {SS \over {n}} $$

A modification to the calculation must be made when we try to estimate the variance of a population from a sample. If we use the equation above it ends up underestimating the true population variance, but if we divide the sum of squares of the sample by the sample size minus one instead of the sample size directly we get an estimate that isn't biased.

$$ \text{Sample variance} = {{\sum_{i=1}^{n}(x_i - \bar{x})^2} \over {n-1}} $$ $$ \text{Sample variance} = {SS \over {n-1}} $$

For example, if we have a population consisting of the values { 5, 10, 15 }, the populationvariance would be ((5-10)2+(10-10)2+(15-10)2)/3 = (25+0+25)/3 = 50/3 = 16.667.

For example, if we have a sample consisting of the values { 5, 10, 15 }, the sample variance would be ((5-10)2+(10-10)2+(15-10)2)/2 = (25+0+25)/2 = 50/2 = 25 which is our best estimate of the population variance for the population these values came from.

For example, if we have a population consisting of the values { 2, 3, 5, 6 }, the populationvariance would be ((2-4)2+(3-4)2+(5-4)2+(6-4)2)/4 = (4+1+1+4)/4 = 10/4 = 2.5.

For example, if we have a sample consisting of the values { 2, 3, 5, 6 }, the sample variance would be ((2-4)2+(3-4)2+(5-4)2+(6-4)2)/3 = (4+1+1+4)/3 = 10/3 = 3.333 which is our best estimate of the population variance for the population these values came from.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. However, due to certain mathematical properties it has, the variance is the basis for the most commonly used statistical tests and it is therefore used widely.

Standard deviation

The standard deviation is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean, divided by the size of the data set - then taking the square root. This is the square root of the variance (see above).

$$ \text{ For a population: } \sigma = \sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n} } $$
or
$$ \text{ For a sample: } s = \sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n-1} } $$

The same modification to this calculation must be made when we try to estimate the standard deviation of a population from a sample as we saw in the case of the variance calculation. This is because the standard deviation is v=based entirely on the variance. We have to use the right variance value to get the right standard deviation value.

For example, if we have a population consisting of the values { 5, 10, 15 }, the population standard deviation would be (((5-10)2+(10-10)2+(15-10)2)/3)1/2 = ((25+0+25)/3)1/2 = (50/3)1/2 = 16.6671/2 = 4.083.

For example, if we have a sample consisting of the values { 5, 10, 15 }, the standard deviation would be (((5-10)2+(10-10)2+(15-10)2)/2)1/2 = ((25+0+25)/3)1/2 = (50/2)1/2 = 251/2 = 5 which is our best estimate of the population standard deviation for the population these values came from.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the population standard deviation would be (((2-4)2+(3-4)2+(5-4)2+(6-4)2)/4)1/2 = ((4+1+1+4)/4)1/2 = (10/4)1/2 = 2.51/2 = 1.581.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the variance would be (((2-4)2+(3-4)2+(5-4)2+(6-4)2)/3)1/2 = ((4+1+1+4)/3)1/2 = (10/3)1/2 = 3.3331/2 = 1.826 which is our best estimate of the population standard deviation for the population these values came from.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. However, if we are fairly sure the shape of the distribution is normal (i.e., a bell-curve), then approximately 66% of the data lies within one standard deviation of the mean and 95% lies within two standard deviations of the mean. This can be a very useful thing to know so the standard deviation is widely used to describe variation. Statistical tests don't directly work on the standard deviation however, they usually use the variance.

Coefficient of variation

The coefficient of variation is a measure of the variation relative to the mean and is calculated by dividing the standard deviation by the mean and multiplying by 100 to get a percentage.

$$ \text{ For a population: } CV = {{\sqrt {{{\sum_{i=1}^{n}(x_i - \mu)^2} } \over {n} } } \over \mu} \times 100 $$ $$ \text{ For a population: } CV = {{\text{population standard deviation} } \over \mu} \times 100 $$
or
$$ \text{ For a sample: } CV = {{\sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n-1} } } \over \bar{x}} \times 100 $$ $$ \text{ For a sample: } CV = {{\text{sample standard deviation} } \over \bar{x}} \times 100 $$

If we have a data set with a standard deviation of 10 and a mean of 30 the coefficient of variation would be 10/30 x 100 = 33.333%.

If we have a data set with a standard deviation of 10 and a mean of 60 the coefficient of variation would be 10/60 x 100 = 16.667%.

If we have a data set with a standard deviation of 5 and a mean of 30 the coefficient of variation would be 5/30 x 100 = 16.667%.

If we have a data set with a standard deviation of 5 and a mean of 60 the coefficient of variation would be 5/60 x 100 = 8.333%.

You can see how the variation and mean combine to give the relative value. Although the values in the second data set are twice as spread out as the third, since the mean is twice as big, the relative variation is equivalent. This value is used for comparing the relative variation in data sets that may have different means because often variation increases with the mean even when the relative variation is the same.

Just like the variance it's based on, this value is not very precise since is prone to being greatly increased by the presence of any outliers. The standard deviation used must also match the scenario - i.e., whether the data is a sample or population. Lastly, statistical tests don't directly work on the coefficient of variation, this is purely a descriptive statistic for comparing different samples or populations.

SHAPE STATISTICS

Shape statistics are used to describe how asymmetric the values are and how their frequency distribution compares to a reference distribution called the normal distribution. Is there a long tail of rare values in one direction that are smaller or larger than the rest? Is the distribution pointier or more flattened out than the bell-shaped normal distribution?

The most commonly used shape statistics for a data set are the skewness, kurtosis, and excess kurtosis

Comparing the shapes of different distributions is somewhat rare in statistics, but ensuring that a distribution we are working with has the shape that a certain technique requires is very common. Being able to recognize when a distribution has the shape we expect is therefore important.

Skewness

The skewness is a measure of the asymmetry and based on calculating the sum of all the cubed differences between each value and the mean, divided by the size of the data set. This initial value includes information about spread, so we divide this by the standard deviation cubed to remove the influence of this variation and what remains indicates the asymmetry alone.

For a population of data this would be:

$$ \text{Population skewness} = { {{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {\sigma^3} } $$

A modification to the calculation must be made when we try to estimate the skewness of a population from a sample. There are two alternative equations used for sample data.

$$ \text{Sample skewness} = { {{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {s^3} } $$ $$ \text{Sample skewness} = { { \sqrt { {n(n-1)}} \over {n-2} } {{{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {\sigma^3} } } $$

The skewness of a normal distribution (or any symmetric distribution really) is zero which is what we usually want.

Kurtosis and excess kurtosis

The kurtosis is a measure of the "peakedness", how much pointier than the normal distribution the distributon is. More precisely, it compres the thickness of the tails to those of the normal distribution so it's really a "tailness" measurement, but that's not as catchy so people use the "peakedness" descriptor. This is based on calculating the sum of all the fourth powers of the differences between each value and the mean, divided by the size of the data set. This initial value includes information about spread, so we divide this by the standard deviation to the fourth power to remove the influence of this variation and what remains indicates the peakednes alone. Larger values indicate distributions that are pointier and smaller values indicate distributions that are flatter.

For a population of data this would be:

$$ \text{Population kurtosis} = { {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {\sigma^4} } $$

The kurtosis of the normal distribution would end up being 3 so we usually calculate the excess kurtosis so that instead of cpomparing our value to 3, we look at whether it's positive or negative.

$$ \text{Population excess kurtosis} = {{ {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {\sigma^4}} - 3 } $$

A modification to the calculation must be made when we try to estimate the excess kurtosis of a population from a sample.

$$ \text{Sample excess kurtosis} = { { {(n+1)n} \over {(n-1)(n-2)} } { {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {s^4}} - 3{{ {(n-1)^2}} \over {(n-2)(n-3)} } } $$

The excess kurtosis of a normal distribution is zero which is what we usually want.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO 1:


StatsExamples-summary-statistics-intro.pdf

TRANSCRIPT OF VIDEO 1:


Slide 1

Let's look at the type of values that are used when doing statistics.

Slide 2

When we're conducting a study we usually care about the parameters of a population, that's all the individuals or items of interest, but we can't measure them all because it's not feasible or practical.
We therefore usually measure the values for a subset and calculate statistics of a sample, and our sample statistics are therefore estimates of the population parameters. As indicated in the diagram here where we want to know about the entire population, we take a sample calculated statistics and then use those statistics to estimate the population parameters.

Slide 3

So we then need a way to summarize the sample or population because populations and samples usually have too many values to make sense out of every single one of them. And we usually want to know the basic properties anyway, not every single value. For example, if we wanted to see whether disease raised the diastolic pressure of individuals with it, do we really need to look at all the numbers individually or does the average accomplish the goal?
In the box shown we can see sample data sets of blood pressure values for baseline and then for individuals with the disease. And if we looked data value by data value, first of all it would be too overwhelming. And second, if we looked at the first couple values or the last couple of values, we would actually see that the first set of values seems to be larger than the second. Only when we calculate an average for all the values together do we see that the second set has an average or typical value that is larger than the first set. So we usually want to calculate basic properties like this for a sample or for a population, so what basic properties do we calculate?

Slide 4

What are the basic properties of data sets that we typically want to know?
Well, first we want to know about the location, what is the typical value or average value? Second we want to know about the spread of the data, how variable are the data values? And third, we sometimes want to know how the distribution of values compares to others that have similar locations and spreads.

Slide 5

First, let's look at statistics of location. So here we're asking the question what is the typical value or the average. The first important thing is to realize there is no The AVERAGE, there are four different averages that can be calculated.
The first is the mean which is the sum of all the values divided by the number of values. The second is the median which is the value that's in the center of our data set with half on each side. The third is the mid-range which is halfway between the smallest and largest value. And the fourth is the mode which is the most frequent or common single value. You can have more than one mode if more than one value are equally common.
The first two are commonly used and it's always important to know which one someone is talking about when they're talking about the average. And the last two are rarely used so we won't focus any more on them.

Slide 6

The mean and median are the most common statistics of location and we'll take a look at them here with this data set consisting of a two, an eight, an eleven, twelve, thirteen, and a fourteen
For the mean, we'll sum all the values and divide by six because that's how many there are.
For the median we would look at the particular value halfway through the data set, so that would be between the eleven and the twelve. So the mean uses the exact positions and you can think of it as calculating something like a center of mass where values like that two, that are far out to one side, will pull the mean in their direction. The median just divides the data into regions, upper and lower, and doesn't really take into account where they are located relative to each other within those regions.
And both provide a location that the values are ranged around as you can see in the figure.

Slide 7

Here's the same kind of figure, but now I have a more asymmetric data set with a two, a couple of threes, a four, a seven, and a twenty. And now you can see how the median and the mean are becoming a bit more different because the mean is being pulled upwards by that large value, the 20.
And this figure illustrates why we must be aware of the average that someone is using. In this case, this value of the 20 is bringing the mean upwards, away from the median. And the mean no longer represents where most of the values are because of that extreme value.
For this data set if someone wanted to make the average value seem small, they would use the median, but if they wanted to make the average value seem large, they would use the mean. So you always have to beware of which average someone is talking about when they're describing a data set where you can't see all the values, they're just telling you the average.

Slide 8

An important thing to think about when we think about statistics is what the properties of those statistics are. And a major property is how robust, that is consistent or resistant to the effects of randomness, are the values that we're calculating.
So one way to think about this is, what happens to the statistic in the presence or absence of outliers (which are rare extreme values like we saw in the previous slide)?
The mean, as we saw, is not robust. The median is robust, because it's only looking at the middle it's by definition ignoring the extreme values. The mid-range is not robust, because in fact the only values being used to calculate it are the smallest and the largest single values. And then the mode typically is robust because it's based on most common values, so outliers would not show up.
Another way to think about how robust something is, is how consistent is it when calculated from repeated samples from the same population?
In this case the mean and the median are both robust. Now the mean is robust because it's using the maximum amount of information, it's using the location of every single value. The mid-range is not robust, it's only using two values. And the mode in this case is also not robust. In particular, if the distribution has a lot of equally common values, the mode might be all over the place in different samples.

Slide 9

Now for statistics of spread, also called variability or dispersion. And now we're asking the question how variable are the values?
So we'll look at six different spreads.
The first is the range, which is just the distance from the smallest to the largest.
Second is the interquartile range, which is the width of the middle fifty percent of the values.
And then we'll look at four values that are all based on the sum of squares, which is the sum of squared differences from the mean. The variance is based on the sum of squares and you can think of it as the mean of the sum of squares values. The standard deviation is going to be the square root of the variance and the coefficient of variation is the standard deviation relative to that mean.

Slide 10

First, the range and interquartile range.
The range is just the difference or the distance between the smallest value and the largest value. In this case it would be fourteen minus two, is a range of twelve. And the range is not robust because it's only using two of the values.
The IQR, which represents the middle of 50 percent, in this case you can see it's encapsulating the middle three out of the six values, is more robust because it uses quartiles and it's related to the median in a way we'll get into in just a moment.

Slide 11

Here are the same statistics of spread, the range and interquartile range, for a different data set, our data set that has that large twenty.
You can see now how the range is greatly inflated, now it's eighteen, whereas the IQR is still fairly small because the middle 50 percent of the data is fairly close together.

Slide 12

As mentioned, the IQR uses quartiles to determine its value. So what the quartiles do is divide the data into four equal regions.
And the median is actually one of these quartiles, that's dividing the data into a top half and a bottom half.
And then we have Q1, which is the first quartile, and that's going to divide the bottom half into two equal sized chunks. You can also think of this as being placed at the 25 percent spot.
Q2 is our median, placed at the 50 percent spot.
And our third quartile is placed at the 75 percent spot, and that's dividing the top half the data into two halves.
And the IQR is going to be the value of Q3 minus Q1.
In this example, with eight data values, you can see the median is in between the fourth and fifth value. And then in the bottom four values, Q1 is in between the second and third. And in the top half, the Q3 is in between the second and third out of those four values.
The quartiles are fairly straightforward to figure out if you have an even number of values like this.

Slide 13

If the data set has an odd number of values, it's a little bit more complicated.
So the median can then be the individual single value that's in the middle. So for example, with this data set of seven values, the median is the fourth out of seven. Now Q1 will be the median of the bottom half, and we'll include the median in that bottom half. So for those four values Q1 is in between the second and third, that is the four and the seven. And for the top half of the data, Q3 is in between the twelve and the fifteen. And then again, the IQR would be Q3 minus Q1.

Slide 14.

So to recap how to calculate quartiles using a simple method, the method's going to ignore the shape of the distribution and just focus on the values.
Step 1 we'll arrange the values in order from smallest to largest.
Step 2 depends on whether n, the number of values, is odd or even. If n is odd, the median will be the center value, and we'll include this value in both halves during step 3 our calculation of Q1 and Q3.
If there's an even number of values, the median is the middle of the two middle values. Because this median is not an actual data value, we would not include it when we're doing step three.
And then in step three we're going to repeat either step 2a or 2b, depending on whether the top half and bottom half of the data set have an even or odd number of values.
And then Q1 would be the median of the smallest half of the values and Q3 would be the median of the largest half of the values.

Slide 15

So it turns out that other methods exist because the definition of the median is actually arbitrary, there's no good theoretical foundation for it. If the exact quartile values are important, make sure you know which is being used.
So there's an alternative number one, which is same as the previous method described, but we would not include the median when we're calculating Q1 and Q3 when we have data sets with an odd number of values.
And then a second alternative, we would calculate Q1 and Q3 using a weighted average of the data points if the data set has an odd number of values.
So if the data set can be represented by 4n+1 values, so that would be 5 values, 9 values, 13 values etc. Q1 would be 25 percent of the nth value and 75 percent of the (n+1)th value and Q3 would be 75 percent of the (3n+1)th value and 25 percent of the (3n+2)th value.
If the data set has 4n+3 values, so that would be 7 or 11 or 15 values, etc. Q1 would be seventy-five percent of the (n+1)th value and twenty-five percent of the (n+2)th value q3 would be 25 percent of the (n+2)th value plus 75 percent of the (n+3)th value
This is obviously a more complicated way to do it, so most people would go with the two methods described previously. However this method does take the shape of the distribution into a bit more account. If you're using a computer software program to calculate your medians you'll want to make sure you understand which one is being used.

Slide 16

Now for the other four statistics of spread, the sum of squares, the variance, the standard deviation, and the coefficient of variation.
So medians and quartiles are used for describing data sets, but these are the values that are used in statistical tests due to mathematical properties of sums of squares and variances.

Slide 17

First up the sum of squares, which is the sum of the squared differences from the mean.
So if we look at our data set here with six values, if we calculate the mean we get an eleven. So then our sum of squares is going to be looking at each of those six values and calculating the sum of the squared difference between each value and that mean.
So that first value would be 6 minus 11 which is negative 5 squared, second value would be 8 minus 11 which is negative 3 squared, and so on. And then we would add them all up to get 60.

Slide 18

Looking at a slightly different data set that has the same mean, but we can see is more spread out, our sum of squares would be the 2 minus 11 which is negative 9 squared, plus the 6 minus 11 which is negative 5 squared, and so on. And we will get a sum of squares of 200, which is much larger than the 60 we got for the previous data set.

Slide 19

Here we can see a data set where most of the values are much closer together, but one is much larger than the others. And if we calculate the sum of squares we'll get a 235, which is even more than the one in the previous slide.
And this illustrates one of the unfortunate properties of the sum of squares is that it is very prone to being inflated by the presence or absence of outliers. Single values that are much larger or smaller than the others can have a large impact on the sum of squares.

Slide 20

So to recap, the sum of squares is the sum of the squared differences from the mean.
This is the foundation of the variance, standard deviation, and coefficient of variation that we'll talk about in a second.
It's extremely prone to being inflated by outliers, which is unfortunate. However the variance, the next statistic we'll talk about, has a useful property that for two independent data sets the sum of their variances is the same as the variance of the combined data set.
This mathematical property is so useful that it causes us to use sum of squares and variances despite this problem it has with outliers.

Slide 21

Okay, now for the variance. And this is represented by the greek sigma symbol, or s, and then with a square.
The variance is always positive because it's based on a sum of squares. It's not robust to outliers since it's going to be based on the sum of squares.
If we have a population of data, we represent the variance with sigma squared and it's the sum of squares divided by n where n is the number of data values.
However if we have a sample, we represent the variance with s squared and we have the sum of squares divided by n minus 1. We use this second equation for a sample because the estimate of a population variance using a sample often underestimates the true population variance. Therefore the different denominator is used to create what's called an unbiased estimate of the population variance.
Remember that all sample statistics are estimates of population parameters and so we'll never expect them to always be exactly correct, but we don't want there to be any bias where they consistently underestimate or overestimate. And this adjustment to the variance formula prevents it from consistently underestimating the population variance.

Slide 22

Now for the standard deviation, represented with the greek symbol sigma if we're looking at a population or s if we're looking at a sample. It has the same properties as the variance because it's just the square root of the variance. And the standard deviation now has the same units as the original data unlike the variance. If you think about it earlier, because everything was squared, any units that our data values would have had would have become squared which actually makes the variance hard to interpret. With the standard deviation however, now the units are back to the original data set and it's much easier to interpret.
For this reason, the standard deviation is used as a descriptive statistic much more than the variances.

Slide 23

Our last measurement of spread is the coefficient of variation, sometimes referenced with the value cv. It has the same properties as the variance because it's based on the variance.
For a population we're going to take the population standard deviation divide by the mean multiply by 100 to get a percentage.
For the sample we'll take the sample standard deviation divide by the mean multiply by 100 to get a percentage.
And this puts the variation into context. How variable is the data compared to the typical value? It's often used as a descriptive statistic, it's not particularly common, and it's rarely used in statistical tests. It's used enough however that you should be familiar with it.

Slide 24

Before we move on to the discussion of the shape statistics there's something you should be aware of.
There are two different equations that can be used to calculate the sum of squares as shown in the top right of this slide.
There's the summation of the squared differences, which is fairly straightforward. And then there's that second equation which is the sum of the squared values, minus the sum of the values squared, all divided by the sample size.
Both of them will give you the exact same number with a particular data set, as you can see in the example shown on the slide here for a data set of three, four, six, and seven which has a mean of five.
The first equation is much easier to understand and visualize what it represents, whereas the second seems more complicated. However if you're doing calculations using a calculator and doing it by hand, the second equation requires less button pushing, has less rounding error, and will actually give you a more accurate value. Because people used calculators before we had computers for such a long time, that second equation was used quite a lot, and is therefore still written in a lot of textbooks.

Slide 25

Now for the statistics of shape and what we'll be doing is comparing the shapes we see for our data to the normal distribution, i.e the bell curve, which is a baseline shape for comparison.
So there are two, or three depending on how you think about it, different shape statistics.
There's the skewness which is going to measure asymmetry and the kurtosis which is going to measure peakedness, what it really does is measures the thickness of the tail however, and then a value called the excess kurtosis which is just the kurtosis minus 3.

Slide 26

The skewness is calculated by the equation there which is the sum of the cubed differences between each data value and the mean divided by the number of data values. And then we have to scale that for the variation by dividing the whole thing by the standard deviation to the third. And what this does is it measures asymmetry.
If we have a perfectly symmetric distribution, like shown in the middle, we'll get a skewness of zero. If there's a tail that moves off to the left with a small number of values that are much smaller than what most of the values are, we'll get a negative skewness. And if we have a tail that goes off to the right, where there's a small number of values much larger than the typical value, we'll get a positive skewness.

Slide 27

The kurtosis, which measures the peakness or thickness of the tails, is a very similar equation, but now we have the fourth power instead of the third power.
If you calculate the kurtosis of the normal distribution as shown in the middle, you'll get a kurtosis of three. So if we subtract three that would give us an excess kurtosis of zero. We would call that distribution mesokurtic. Meso for like middle.
If you have a peaked, kind of pointier distribution with skinny tails, the kurtosis would end up being larger than three, excess kurtosis would be positive, and we would call that leptokurtic.
Whereas on the left, if the distribution is kind of fatter with tails that have more values in them, we would get a kurtosis that's less than three, which would be a negative excess kurtosis, and we would call that platykurtic. Platy meaning flat.
The exact values of kurtosis are often not used directly. We're really just comparing them to the three for kurtosis, or the zero for excess kurtosis, to figure out whether our distribution looks like the normal distribution, that is mesocritic, or whether it looks like it's leptokurtic or platykurtic.

Slide 28

The equations for skewness and kurtosis on the previous slides were for the population of data. Estimates of the population skewness or excess kurtosis from sample data can be biased and so the equations shown in this slide are better. For the skewness we have two options for equations to use, for the excess kurtosis we have one to use. And these will allow our sample data to estimate a population data in a way that is not consistently too large or consistently too small.

Slide 29

What's the application of the skewness and kurtosis? The skewness and kurtosis are rarely studied for their own sake, they are usually calculated to see if the distribution is normal, which is typically what we want because most statistical tests require normal distributions for our data sets.
So what we usually really want is for the skewness to be zero and the excess kurtosis to be zero, so that we can be confident our statistical tests will work.

Slide 30

So here's kind of a summary of our four different location statistics our six different spread statistics, and are two different shape statistics.
The mean, median, and standard deviation are often used when describing data. The range, IQR, and coefficient of variation are also sometimes used, but not as much.
And then for doing statistical testing, most statistical tests work with the means if they're interested in location, and variances if they're interested in spreads.
And then the skewness and kurtosis are often prerequisites for doing our tests.

Zoom out.

So there you have it, that's an overview to the basic statistics that are calculated for a population of data, but much more commonly a sample of data that we're trying to use to estimate the values for population of data.
There's a companion video to this one linked to in just a second that goes through a step-by-step example using a data set and calculates all the values shown.

End screen.

Click at the top to subscribe to the channel. On the left is a link to the playlist for stats examples and on the right is the video I mentioned that walks through calculations of all the statistics discussed in this video.


LINK TO SUMMARY SLIDES FOR VIDEO 2:


StatsExamples-summary-statistics-examples.pdf

TRANSCRIPT OF VIDEO 2:



Slide 1

Let's look at some step-by-step examples of calculating summary statistics for the location spread and shape of a data set.

Slide 2

When we're estimating data values we usually care about the parameters of a population that is all the individuals of interest, but we usually can't measure them all so more often we measure the values for a subset and calculate the statistics of a sample to estimate those population parameters.

Slide 3

And what are the statistics that we calculate. They're the statistics of location spread and shape and you can see there's four location statistics, six spread statistics, and two shape statistics. And we'll actually use slightly different equations for some of those because it can vary depending on whether it's a sample or a population that we're measuring. See the companion video linked below and at the end of this video for more detail about what these values represent and why we use these different equations for samples and populations.

Slide 4

Okay let's look at our data set here this will be the number of birds in eight different trees. So we've measured eight trees and counted the number of birds we've gotten the values of one, four, five, seven, three, nine, seven, and twelve. So the best order to calculate our location statistics will be the median, the mode, the mid-range, and then the mean.

Slide 5

Our first step will be to reorder all the values from smallest to largest. So we're going to go to our original data set and reorder those values.

Slide 6

Now that we have an ordered data set we can calculate the median, mode, and mid-range quite easily.
The median will be the value in the middle. This data set has an even number of values, n equals eight, so we find the middle two values and take their mean. So that's the five and the seven, five plus seven equals twelve, divided by two gives us a median of six.
For the mode, since the data is ordered we can look for the longest string of consecutive identical values. In this data set there's only one value that occurs more than once, and that's the seven which occurs twice, so that's our mode
For the mid-range this is halfway between the smallest and largest value, the mean of those two values. Since we have these ordered we can look on the far left, the one is the smallest, on the far right the twelve is the largest. one plus twelve divided by two is six and a half.
So our median is six, our mode is seven, and our mid-range is six and a half.

Slide 7

Our final location statistic is the mean and that's the sum of all the values divided by the number of values, which is n equals eight.
So the mean will be one, plus three, plus four, plus five, plus seven, plus seven, plus nine, plus 12 is 48, divided by eight gives us six.
So now we can look at our four location statistics. The median is six, mid-range is six and a half, mode is seven, mean is 6 and we can see that they're all similar which suggests there's no extreme asymmetry or outliers. And if we look at this figure of the values on the number line, we can see that that's the case.

Slide 8

Now for our spread statistics and again we'll use our ordered data set because it makes a couple of these easy to calculate.
First our range. The range is the difference between the largest and the smallest so in this case it's going to be 12 minus 1 is 11. That tells us how far apart the minimum and maximum values are.
Then the interquartile range which is the difference between Q1 and Q3, which is the first quartile and the third quartile. So before we can calculate that we have to calculate the quartiles first.
And the way we do this is in three steps.
First we divide the data set into a bottom and top half.
Then, we compute the medians for each of those halves, that will be Q1 and Q3.
Then, we compute the difference between the value of Q3 and the value of Q1.

Slide 9

So step one is to divide our data set into bottom and top halves.
Once we've done that, we can compute the medians for each half. So in the bottom half we have the values of one, three, four, and five. Q1 which is the median of the bottom half, will be the mean of three and four, three plus four is seven, divided by two, three and a half. Q3 will be the median of the top half ,so that's going to be 7 plus 9 is 16, divided by 2, is 8 for Q3.

Slide 10

The IQR is going to be the difference between Q3 and Q1 so that's going to be 8 minus 3.5 is 4.5.
And if we look at a figure of our data here we can see that the IQR is the width of the middle 50 percent of the data. So we can see that region of width 4 and a half contains 4 out of our 8 values.

Slide 11

Before we proceed with calculating the rest of our spread statistics let's look a little bit more detail about calculating quartiles.
If the data set has an odd number of values, the middle value is the median and there are actually three different options for calculating the quartiles. It's a bit more complicated than if we have an even number of values as in the previous example.
Option one, we would include the median in both the bottom and top half of the data when we calculate Q1 and Q3.
Option two, we do not include the median in either the bottom or the top half of the data.
And then option three, we would calculate Q1 and Q3 using a weighted average of the data points in the top and the bottom half.

Slide 12

Now for our example here, now the bottom half of the data includes the one, three, four, five, and six so Q1 is the median of that set of data.
So that's going to be the four the value in the middle.
The top half of the data has the six, sever, sever, nine, and twelve so Q3 will be the middle value there the median which is a 7. So now our IQR would be 7 minus 4 is 3.

Slide 13

Option 2 for calculating quartiles when we have an odd number of values would be to not include the median in either the bottom or top half. So we have our ordered data set and then we divide it into two halves but we do not include that six in either half.
Now the bottom half is a one, three, four, and five. We compute the median of that which is going to be three plus four, divided by two, seven divided by two, is three and a half.
In the top half we have the seven, seven, nine, and twelve so we calculate Q3 as seven plus nine is 16, divided by two, is eight.
And the IQR is then eight minus three and a half, gives us four and a half.

Slide 14

Option three for calculating our quartiles with an odd number of values involves using a weighted average of the data points.
And the first step is to see whether the equation four n plus one, or four n plus three, is more appropriate to match the size of our data set. For the example we're looking at here, with 9 you can see that 4 times 2 plus 1 can equal 9, but there's no whole number where 4 times that number plus 3 would equal 9. So we'll be using the first set of equations for 4n plus 1 values.
For 4n plus 1 values, Q1 will be 0.25 times the nth value plus 0.75 times the n plus 1 th value. So n was equal to 2, so that's going to be 0.25 times the second value which is the 3, plus 0.75 times the third value which is that 4, and that will give us Q1.
And then Q3 will be 0.75 times the 3n plus 1 value, plus 0.25 times the 3n plus 2th value, so that would be 0.75 times the 7 and 0.25 times the 9, the 7th and 8th values in our data set.
So when we plug those values in, Q1 will be 0.25 times 3 plus 0.75 times 4 gives us 3.75. Q3 will be 0.75 times 7 plus 0.25 times 9 gives us 7.5. So then the IQR will be 7.5 minus 3.75 gives us 3.75.
These three different options give us three different values for the IQR so we should always be clear about which of these options we're using when we calculate the IQR and report it for our data set.

Slide 15

Returning back to the rest of our statistics for the spread of the data we'll now look at the sum of squares.
The sum of squares is the sum of the squared differences from the mean which we earlier calculated to be six. Sum of squares, you can see this summation here i equals one to eight, so from the first value to the eighth value for x sub i minus the mean squared. So we're looking at each of our values 1, 3, 4, 5, 7, 7, 9 and 12 subtracting 6, which is the mean, squaring, we're doing that separately for all eight of our values, and then adding those up. So we'll get 1 minus 6 squared, plus 3 minus 6 squared, plus 4 minus 6 squared, plus 5 minus 6 squared, plus 7 minus 6 squared, plus 7 minus 6 squared, plus 9 minus 6 squared, plus 12 minus 6 squared, that gives us those values you see there.
And then we sum those values up and we get a sum of squares of 86 for this data set.

Slide 16

The three other values of the spread are all easy to calculate once we have the sum of squares. The variance, standard deviation, and coefficient of variation all begin with the sum of squares.
The variance is just going to be the sum of squares divided by n if it's a population, or n minus 1 if it's a sample. And we'll use the symbols sigma squared or s squared to represent the population variance or the sample variance. So those values would be 86 divided by 8 to give us 10.75, or 86 divided by 7 to give us 12.286.
The standard deviation is the square root of the variance, so for the population sigma would be the square root of 10.75 gives us 3.279. s would be the square root of the sample variance, so that's the square root of 12.286 gives us 3.505.
And then the coefficient of variance is the standard deviation divided by the mean multiplied by 100 to make it into a percentage. So for the population that would be 3.279 divided by 6, that's the mean, times 100 gives us 54.65 percent. For the sample that would be 3.505, divided by 6 times 100, gives us 58.42 percent. These equations are all basically exactly the same, the only difference is we have to know whether we're dealing with a population or a sample to decide how to calculate the variance.

Slide 17

A useful property of the standard deviation is that for normal distributions approximately two-thirds or 66 percent of the values are within one standard deviation the mean and approximately 95 percent of the values are within two standard deviations of the mean.
So if we go 3.279 or 3.505 above and below the 6 that region should hold about two-thirds of our data and we can see that it does it holds 6 out of the 8 values.
In this way the standard deviation is similar to the IQR in that it gives us a region within which we expect a certain percentage of our data.
Now remember this property of the standard deviation will only hold if our data is normally distributed. So one of the things we want to know is whether our data is skewed or non-normal which is what our next two statistics are for.

Slide 18

The final set of statistics we're interested in are ones that describe the shape of the distribution. We calculate a value called the skewness which represents the asymmetry of our data values and the kurtosis which measures aspects of the shape in terms of the height of the peak relative to the height of the tails. The skewness and kurtosis are rarely studied for their own sake they are usually just calculated to see if the distribution is normal. So we compare the skewness and the kurtosis values we get to the values for a normal distribution to see if they match.
And the equations on this slide are the ones used for a data set that is a population.

Slide 19

If the data set is a population, we use this equation for the skewness. We're going to look at the sum of every data value, minus the mean to the third power, divided by the number of data values and then we have to scale that for the overall variation, so we're going to divide that by the standard deviation to the third power. So you can see that we're plugging in the values here so n is equal to eight, our standard deviation is the 3.279, which is our standard deviation for the population. We can bring those values out in front of the sum, so we'll just have 1 divided by 8, divided by the standard deviation to the third, times the sum of every value minus the mean to the third power. I won't read off all these numbers, but you can pause the video here if you want to see exactly how they go in, but we would plug all those numbers in and eventually we would get 84 divided by 282.042 gives us a skewness of 0.298.

Slide 20
For the kurtosis we have almost exactly the same equation, but now it's the sum of each value minus the mean to the fourth power, divided by the number of values. And then again, in order to subtract out the spread, we're dividing that by the standard deviation to the fourth power.
So again we can pull those values out in front of the summation, 1 divided by 8 times the standard deviation to the 4th power, times the sum of each value minus the mean to the fourth power. Plug all those values in and we end up getting a kurtosis of 2.229.

Slide 21
If the dataset is a sample there are different equations to use. For the skewness there are two other equations that we can use. For the excess kurtosis there's a more complicated equation that we would have to use.
But we can use the sums of the cubes and the fourth powers that we calculated from the previous slides so the 84 and the 2102.

Slide 22

So if our data set is a sample and we're calculating the skewness we can use two different equations.
For the first equation, it's the same as the skewness for the population except we divide by the standard deviation for the sample to the third power. That gives us a skewness of 0.244.
For the second equation we have a coefficient in front of our population skewness equation, that's that square root of n times n minus 1 divided by n minus 2. When we plug all those numbers in, so square root of 8 times 7 all divided by 6, and then that's multiplied by this population skewness, we would get 0.371.

Slide 23

For the sample excess kurtosis we have that larger equation, but if you look in the middle part of it we have that summation of the fourth powers, divided by n, divided by the sample standard deviation to the fourth power. That's very similar to the population kurtosis just using the sample standard deviation instead of the population standard deviation.
And then we have a leading coefficient which is n plus 1 times n divided by n minus 1 times n minus 2. So that's just plugging in the number of values in our data set.
And then at the end we have 3 times n minus 1 squared, divided by n minus 2 times n minus 3. So again that's just plugging in the number of data values. So we just have to be careful and go slow and plug things in. So 9 times 8 divided by 7 times 6 our 2102, divided by eight, then sample standard deviation to the fourth power, minus three times seven squared, divided by six divided by five gives us negative 1.915.

Slide 24

Okay to recap for our skewness and kurtosis values if the data set is a population we get a skewness of 0.298, kurtosis of 0.229, and an excess kurtosis of negative 0.771. The excess kurtosis is just the regular kurtosis minus 3.
If our data set instead was a sample we would have two different skewnesses that we could calculate depending on which of the equations we used. Either the 0.244 or the 0.371. Neither one is absolutely correct they are two different estimators of the true population skewness. So we would probably want to calculate both of them and have a fairly good confidence that our population skewness is most likely somewhere between those two. And then the excess kurtosis value we get negative 1.915.
So these values, a positive skewness indicates that the dataset will be right skewed, and a negative excess kurtosis indicates the data set is platykurtic.

Slide 25

Now let's take a look at all of our statistics in relation to a figure of our data.
So our four location statistics are all pretty similar, a six, a six ,a six and a half, and a seven and they're all representing the general location or the typical average value of the data set.
Then our spread values, the range, interquartile range, and sum of squares are the same whether or not this is a population or a sample, but then the variances, standard deviations, coefficients of variation, skewnesses and, kurtosis values differ depending on whether this is a population or a sample.

Zoom out

Hopefully this video can help you put everything together so that you can calculate the basic statistics for any population or sample that you're interested in calculating the location spread and shape statistics for.

End screen

This video calculated a bunch of statistics but there's a companion video that describes more about what the statistics represent that's linked below on the right.


Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.