SUMMARY STATISTICS (INTRODUCTION)

Tweet
Share

INTRODUCTION

Once a data set has been collected, we will usually have too many numbers to make sense of it all. To help understand the data we use descriptive statistics to summarize and simplify the data set. The values we use to summarize and describe data sets are also the values that statistical tests work with when trying to decide of populations are different from one another (i.e., all tests are based on a certain statistic, not the values individually).

Summary values for data sets that are samples are called statistics.

Summary values for data sets that are populations are called parameters.

Typically, since our data sets are usually samples, we use sample statistics to estimate the value of the population parameters. Some of these are intuitive and easy to calculate by hand or with a calculator while others generally require the use of a computer. Although technically we could talk about "descriptive statstics" or "descriptive parameters", since most real-world work is done with samples, the terms "summary statistics" or "descriptive statistics" are used when describing the calculations.

Descriptive statistics come in three types: location, variation, and shape.

LOCATION STATISTICS

Location statistics

Location statistics are used to identify where on the number line the values are located. You can think of these as describing the typical value or average value. If we are interested in whether a certain set of values are larger or smaller than another, then we would be comparing a specific location statistic in the two groups.

The most commonly used location statistics for a data set are: mid-range, mode, median, mean.

Mid-range

The mid-range is the point mid-way between the largest and smallest value in the data set. You calculate it by adding the minimum and maximum values and then dividing by two.

$$ \text{mid-range} = {{minimum + maximum} \over {2}} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the mid-range would be (2+9)/2 = 11/2 = 5.5.

This value is not very precise since it only uses the largest and smallest values and is therefore prone to being thrown off by outliers. In a more technical sense, it only uses a very small portion of the information about the data to estimate the location and is consequently inaccurate. Although the mid-range is easy to calculate, it's rarely used in practice.

Mode

This is the value that occurs most frequently in the data set. A data set can have more than one mode or it can have none.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the mode would be 5.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 7, 9 }, the modes would be 5 and 7.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 6, 7, 8, 9 }, the mode is undefined since no value is more common than the others.

This doesn't really tell us where the data is centered unless the data has a frequency peak at the center (which is true for some, but not all, data sets). For certain purposes the mode is useful however - for example, it tells us which value we are most likely to get if we pick a random value from the data set. There are few statistical tests for modes however and it is mainly used for description purposes instead of inference.

Median

This is the value that is in the middle of the sample data (50% below, 50% above). To calculate the median, you first order the data set from smallest to largest. If there are an odd number of values in the data set, then the median is usually defined as the one value in the middle. If there are an even number of values, then the median is usually defined as the mean of the middle two observations.

For example, if we have a data set consisting of the values { 2, 3, 5, 8, 9 }, the median would be 5.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 8, 9 }, the median would be (4+5)/2 = 4.5.

The technique described above is the simplest method, but there it doesn't take the shape of the distribution into account so there are alternative methods you may sometimes see used.

The median usually gives us a good idea of where the data is centered unless the data has a weird asymmetric shape. It is also not prone to being thrown off by outliers since it focuses on the center of the distribution.

The idea of dividing the data set into sections (e.g., top and bottom halves) in order to summarize it is extended by describing a data set with quartiles, quintiles or percentiles.

Mean

The mean, more precisely called the arithmetic mean, is the arithmetic average of the values. It is calculated by adding up all the values and dividing by the sample size.

$$ \text{mean} = {{\sum_{i=1}^{n}x_i} \over {n}} $$

For example, if we have a data set consisting of the values { 2, 3, 5, 8, 9 }, the mean would be (2+3+5+8+9)/5 = (27)/5 = 5.4.

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 8, 9 }, the mean would be (2+3+4+5+8+9)/5 = (31)/6 = 5.166.

This is what most people learn as "the average", but technically mid-ranges, modes, and medians are averages too. There is also a value called a "geometric mean" that is used in some specialized circumstances so it's good to be precise.

The "average"

As noted above, the term average is imprecise because it is often used for either the mean or median and these can be quite different if the distribution is asymmetric. If you ever are presented with a value called "the average" you should determine whether it refers to the mean or median. Similarly, when describing the average you should be precise in your vocabulary and specify whether you are presenting the mean or median. Not defining the average and then using the one that best supports your argument is a common way to manipulate an audience.

For example, if you want to make it look like people are doing well economically you would want to talk about the arithmetic mean as the average since it would include millionaires and billionaires and therefore be a very big number. But if you wanted to describe the income that most people personally experience, then the median may be more representative of the average income for normal people.

Beware of this.

VARIATION STATISTICS

Variation statistics are used to describe how spread out the values are. Are the values are clumped in one tight area or are they spread out on the number line? If we randomly choose values will they tend to be similar or very variable? If we are interested in whether a certain set of values is more variable than another, then we would be comparing a specific variation statistic in the two groups.

The most commonly used variation statistics for a data set are the range, inter-quartile range (IQR), sum of squares, variance, standard deviation, and the coefficient of variation

Range

The range is the distance between the smallest and largest values on the number line. This is calculated by subtracting the minimum value from the maximum value in the data set.

$$ \text{Range} = {maximum - minimum} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9 }, the range would be 9-2 = 7.

This value is not very precise since it only uses the largest and smallest values and is therefore prone to being thrown off by outliers. In a more technical sense, it only uses a very small portion of the information about the data to estimate the variation and is consequently inaccurate. Although the range is easy to calculate, it's rarely used in practice.

Inter-quartile range

The inter-quartile range describes how far apart the first and third quartiles are (see the Quartiles page for an explanation of what these are) and describes the width of the region containing the middle 50% of the data values. This statistic represents the variation in the center of the sample quite well and is not influenced by outliers.

You calculate it by subtracting the value of Q1 from the value of Q3.

$$ \text{Interquartile range} = {Q3 - Q1} $$

For example, if we have a data set consisting of the values { 2, 3, 4, 5, 6, 7, 8, 9 }, the IQR would be 7.5 - 3.5 = 4.

This value is fairly reliable and useful to describe data sets, but there aren't really any statistical tests based on it.

Sum of squares

The sum of squares is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean.

$$ \text{SS} = {{\sum_{i=1}^{n}(x_i - \bar{x})^2} } $$

For example, if we have a data set consisting of the values { 5, 10, 15 }, the sum of squares would be (5-10)2+(10-10)2+(15-10)2 = 25+0+25 = 50.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the sum of squares would be (2-4)2+(3-4)2+(5-4)2+(6-4)2 = 4+1+1+4 = 10.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. This value also gets larger as there are more values even if the spread of the data is the same and therefore does not necessarily represent the typical variation, that's what the variance (see below) is for.

Variance

The variance is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean, divided by the size of the data set. This is equivalent to the sum of squares from above divided by the data set size (i.e., n).

For a population of data this would be:

$$ \text{Population variance} = {{\sum_{i=1}^{n}(x_i - \mu)^2} \over {n}} $$ $$ \text{Population variance} = {SS \over {n}} $$

A modification to the calculation must be made when we try to estimate the variance of a population from a sample. If we use the equation above it ends up underestimating the true population variance, but if we divide the sum of squares of the sample by the sample size minus one instead of the sample size directly we get an estimate that isn't biased.

$$ \text{Sample variance} = {{\sum_{i=1}^{n}(x_i - \bar{x})^2} \over {n-1}} $$ $$ \text{Sample variance} = {SS \over {n-1}} $$

For example, if we have a population consisting of the values { 5, 10, 15 }, the populationvariance would be ((5-10)2+(10-10)2+(15-10)2)/3 = (25+0+25)/3 = 50/3 = 16.667.

For example, if we have a sample consisting of the values { 5, 10, 15 }, the sample variance would be ((5-10)2+(10-10)2+(15-10)2)/2 = (25+0+25)/2 = 50/2 = 25 which is our best estimate of the population variance for the population these values came from.

For example, if we have a population consisting of the values { 2, 3, 5, 6 }, the populationvariance would be ((2-4)2+(3-4)2+(5-4)2+(6-4)2)/4 = (4+1+1+4)/4 = 10/4 = 2.5.

For example, if we have a sample consisting of the values { 2, 3, 5, 6 }, the sample variance would be ((2-4)2+(3-4)2+(5-4)2+(6-4)2)/3 = (4+1+1+4)/3 = 10/3 = 3.333 which is our best estimate of the population variance for the population these values came from.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. However, due to certain mathematical properties it has, the variance is the basis for the most commonly used statistical tests and it is therefore used widely.

Standard deviation

The standard deviation is a measure of the total amount of variation based on calculating the sum of all the squared differences between each value and the mean, divided by the size of the data set - then taking the square root. This is the square root of the variance (see above).

$$ \text{ For a population: } \sigma = \sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n} } $$
or
$$ \text{ For a sample: } s = \sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n-1} } $$

The same modification to this calculation must be made when we try to estimate the standard deviation of a population from a sample as we saw in the case of the variance calculation. This is because the standard deviation is v=based entirely on the variance. We have to use the right variance value to get the right standard deviation value.

For example, if we have a population consisting of the values { 5, 10, 15 }, the population standard deviation would be (((5-10)2+(10-10)2+(15-10)2)/3)1/2 = ((25+0+25)/3)1/2 = (50/3)1/2 = 16.6671/2 = 4.083.

For example, if we have a sample consisting of the values { 5, 10, 15 }, the standard deviation would be (((5-10)2+(10-10)2+(15-10)2)/2)1/2 = ((25+0+25)/3)1/2 = (50/2)1/2 = 251/2 = 5 which is our best estimate of the population standard deviation for the population these values came from.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the population standard deviation would be (((2-4)2+(3-4)2+(5-4)2+(6-4)2)/4)1/2 = ((4+1+1+4)/4)1/2 = (10/4)1/2 = 2.51/2 = 1.581.

For example, if we have a data set consisting of the values { 2, 3, 5, 6 }, the variance would be (((2-4)2+(3-4)2+(5-4)2+(6-4)2)/3)1/2 = ((4+1+1+4)/3)1/2 = (10/3)1/2 = 3.3331/2 = 1.826 which is our best estimate of the population standard deviation for the population these values came from.

This value is not very precise since is prone to being greatly increased by the presence of any outliers. However, if we are fairly sure the shape of the distribution is normal (i.e., a bell-curve), then approximately 66% of the data lies within one standard deviation of the mean and 95% lies within two standard deviations of the mean. This can be a very useful thing to know so the standard deviation is widely used to describe variation. Statistical tests don't directly work on the standard deviation however, they usually use the variance.

Coefficient of variation

The coefficient of variation is a measure of the variation relative to the mean and is calculated by dividing the standard deviation by the mean and multiplying by 100 to get a percentage.

$$ \text{ For a population: } CV = {{\sqrt {{{\sum_{i=1}^{n}(x_i - \mu)^2} } \over {n} } } \over \mu} \times 100 $$ $$ \text{ For a population: } CV = {{\text{population standard deviation} } \over \mu} \times 100 $$
or
$$ \text{ For a sample: } CV = {{\sqrt {{{\sum_{i=1}^{n}(x_i - \bar{x})^2} } \over {n-1} } } \over \bar{x}} \times 100 $$ $$ \text{ For a sample: } CV = {{\text{sample standard deviation} } \over \bar{x}} \times 100 $$

If we have a data set with a standard deviation of 10 and a mean of 30 the coefficient of variation would be 10/30 x 100 = 33.333%.

If we have a data set with a standard deviation of 10 and a mean of 60 the coefficient of variation would be 10/60 x 100 = 16.667%.

If we have a data set with a standard deviation of 5 and a mean of 30 the coefficient of variation would be 5/30 x 100 = 16.667%.

If we have a data set with a standard deviation of 5 and a mean of 60 the coefficient of variation would be 5/60 x 100 = 8.333%.

You can see how the variation and mean combine to give the relative value. Although the values in the second data set are twice as spread out as the third, since the mean is twice as big, the relative variation is equivalent. This value is used for comparing the relative variation in data sets that may have different means because often variation increases with the mean even when the relative variation is the same.

Just like the variance it's based on, this value is not very precise since is prone to being greatly increased by the presence of any outliers. The standard deviation used must also match the scenario - i.e., whether the data is a sample or population. Lastly, statistical tests don't directly work on the coefficient of variation, this is purely a descriptive statistic for comparing different samples or populations.

SHAPE STATISTICS

Shape statistics are used to describe how asymmetric the values are and how their frequency distribution compares to a reference distribution called the normal distribution. Is there a long tail of rare values in one direction that are smaller or larger than the rest? Is the distribution pointier or more flattened out than the bell-shaped normal distribution?

The most commonly used shape statistics for a data set are the skewness, kurtosis, and excess kurtosis

Comparing the shapes of different distributions is somewhat rare in statistics, but ensuring that a distribution we are working with has the shape that a certain technique requires is very common. Being able to recognize when a distribution has the shape we expect is therefore important.

Skewness

The skewness is a measure of the asymmetry and based on calculating the sum of all the cubed differences between each value and the mean, divided by the size of the data set. This initial value includes information about spread, so we divide this by the standard deviation cubed to remove the influence of this variation and what remains indicates the asymmetry alone.

For a population of data this would be:

$$ \text{Population skewness} = { {{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {\sigma^3} } $$

A modification to the calculation must be made when we try to estimate the skewness of a population from a sample. There are two alternative equations used for sample data.

$$ \text{Sample skewness} = { {{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {s^3} } $$ $$ \text{Sample skewness} = { { \sqrt { {n(n-1)}} \over {n-2} } {{{\sum_{i=1}^{n}(x_i - \mu)^3} \over {n}} \over {\sigma^3} } } $$

The skewness of a normal distribution (or any symmetric distribution really) is zero which is what we usually want.

Kurtosis and excess kurtosis

The kurtosis is a measure of the "peakedness", how much pointier than the normal distribution the distributon is. More precisely, it compres the thickness of the tails to those of the normal distribution so it's really a "tailness" measurement, but that's not as catchy so people use the "peakedness" descriptor. This is based on calculating the sum of all the fourth powers of the differences between each value and the mean, divided by the size of the data set. This initial value includes information about spread, so we divide this by the standard deviation to the fourth power to remove the influence of this variation and what remains indicates the peakednes alone. Larger values indicate distributions that are pointier and smaller values indicate distributions that are flatter.

For a population of data this would be:

$$ \text{Population kurtosis} = { {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {\sigma^4} } $$

The kurtosis of the normal distribution would end up being 3 so we usually calculate the excess kurtosis so that instead of cpomparing our value to 3, we look at whether it's positive or negative.

$$ \text{Population excess kurtosis} = {{ {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {\sigma^4}} - 3 } $$

A modification to the calculation must be made when we try to estimate the excess kurtosis of a population from a sample.

$$ \text{Sample excess kurtosis} = { { {(n+1)n} \over {(n-1)(n-2)} } { {{\sum_{i=1}^{n}(x_i - \mu)^4} \over {n}} \over {s^4}} - 3{{ {(n-1)^2}} \over {(n-2)(n-3)} } } $$

The excess kurtosis of a normal distribution is zero which is what we usually want.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-summary-statistics-intro.pdf

TRANSCRIPT OF VIDEO:


Slide 1

Let's look at the type of values that are used when doing statistics.

Slide 2

When we're conducting a study we usually care about the parameters of a population, that's all the individuals or items of interest, but we can't measure them all because it's not feasible or practical.
We therefore usually measure the values for a subset and calculate statistics of a sample, and our sample statistics are therefore estimates of the population parameters. As indicated in the diagram here where we want to know about the entire population, we take a sample calculated statistics and then use those statistics to estimate the population parameters.

Slide 3

So we then need a way to summarize the sample or population because populations and samples usually have too many values to make sense out of every single one of them. And we usually want to know the basic properties anyway, not every single value. For example, if we wanted to see whether disease raised the diastolic pressure of individuals with it, do we really need to look at all the numbers individually or does the average accomplish the goal?
In the box shown we can see sample data sets of blood pressure values for baseline and then for individuals with the disease. And if we looked data value by data value, first of all it would be too overwhelming. And second, if we looked at the first couple values or the last couple of values, we would actually see that the first set of values seems to be larger than the second. Only when we calculate an average for all the values together do we see that the second set has an average or typical value that is larger than the first set. So we usually want to calculate basic properties like this for a sample or for a population, so what basic properties do we calculate?

Slide 4

What are the basic properties of data sets that we typically want to know?
Well, first we want to know about the location, what is the typical value or average value? Second we want to know about the spread of the data, how variable are the data values? And third, we sometimes want to know how the distribution of values compares to others that have similar locations and spreads.

Slide 5

First, let's look at statistics of location. So here we're asking the question what is the typical value or the average. The first important thing is to realize there is no The AVERAGE, there are four different averages that can be calculated.
The first is the mean which is the sum of all the values divided by the number of values. The second is the median which is the value that's in the center of our data set with half on each side. The third is the mid-range which is halfway between the smallest and largest value. And the fourth is the mode which is the most frequent or common single value. You can have more than one mode if more than one value are equally common.
The first two are commonly used and it's always important to know which one someone is talking about when they're talking about the average. And the last two are rarely used so we won't focus any more on them.

Slide 6

The mean and median are the most common statistics of location and we'll take a look at them here with this data set consisting of a two, an eight, an eleven, twelve, thirteen, and a fourteen
For the mean, we'll sum all the values and divide by six because that's how many there are.
For the median we would look at the particular value halfway through the data set, so that would be between the eleven and the twelve. So the mean uses the exact positions and you can think of it as calculating something like a center of mass where values like that two, that are far out to one side, will pull the mean in their direction. The median just divides the data into regions, upper and lower, and doesn't really take into account where they are located relative to each other within those regions.
And both provide a location that the values are ranged around as you can see in the figure.

Slide 7

Here's the same kind of figure, but now I have a more asymmetric data set with a two, a couple of threes, a four, a seven, and a twenty. And now you can see how the median and the mean are becoming a bit more different because the mean is being pulled upwards by that large value, the 20.
And this figure illustrates why we must be aware of the average that someone is using. In this case, this value of the 20 is bringing the mean upwards, away from the median. And the mean no longer represents where most of the values are because of that extreme value.
For this data set if someone wanted to make the average value seem small, they would use the median, but if they wanted to make the average value seem large, they would use the mean. So you always have to beware of which average someone is talking about when they're describing a data set where you can't see all the values, they're just telling you the average.

Slide 8

An important thing to think about when we think about statistics is what the properties of those statistics are. And a major property is how robust, that is consistent or resistant to the effects of randomness, are the values that we're calculating.
So one way to think about this is, what happens to the statistic in the presence or absence of outliers (which are rare extreme values like we saw in the previous slide)?
The mean, as we saw, is not robust. The median is robust, because it's only looking at the middle it's by definition ignoring the extreme values. The mid-range is not robust, because in fact the only values being used to calculate it are the smallest and the largest single values. And then the mode typically is robust because it's based on most common values, so outliers would not show up.
Another way to think about how robust something is, is how consistent is it when calculated from repeated samples from the same population?
In this case the mean and the median are both robust. Now the mean is robust because it's using the maximum amount of information, it's using the location of every single value. The mid-range is not robust, it's only using two values. And the mode in this case is also not robust. In particular, if the distribution has a lot of equally common values, the mode might be all over the place in different samples.

Slide 9

Now for statistics of spread, also called variability or dispersion. And now we're asking the question how variable are the values?
So we'll look at six different spreads.
The first is the range, which is just the distance from the smallest to the largest.
Second is the interquartile range, which is the width of the middle fifty percent of the values.
And then we'll look at four values that are all based on the sum of squares, which is the sum of squared differences from the mean. The variance is based on the sum of squares and you can think of it as the mean of the sum of squares values. The standard deviation is going to be the square root of the variance and the coefficient of variation is the standard deviation relative to that mean.

Slide 10

First, the range and interquartile range.
The range is just the difference or the distance between the smallest value and the largest value. In this case it would be fourteen minus two, is a range of twelve. And the range is not robust because it's only using two of the values.
The IQR, which represents the middle of 50 percent, in this case you can see it's encapsulating the middle three out of the six values, is more robust because it uses quartiles and it's related to the median in a way we'll get into in just a moment.

Slide 11

Here are the same statistics of spread, the range and interquartile range, for a different data set, our data set that has that large twenty.
You can see now how the range is greatly inflated, now it's eighteen, whereas the IQR is still fairly small because the middle 50 percent of the data is fairly close together.

Slide 12

As mentioned, the IQR uses quartiles to determine its value. So what the quartiles do is divide the data into four equal regions.
And the median is actually one of these quartiles, that's dividing the data into a top half and a bottom half.
And then we have Q1, which is the first quartile, and that's going to divide the bottom half into two equal sized chunks. You can also think of this as being placed at the 25 percent spot.
Q2 is our median, placed at the 50 percent spot.
And our third quartile is placed at the 75 percent spot, and that's dividing the top half the data into two halves.
And the IQR is going to be the value of Q3 minus Q1.
In this example, with eight data values, you can see the median is in between the fourth and fifth value. And then in the bottom four values, Q1 is in between the second and third. And in the top half, the Q3 is in between the second and third out of those four values.
The quartiles are fairly straightforward to figure out if you have an even number of values like this.

Slide 13

If the data set has an odd number of values, it's a little bit more complicated.
So the median can then be the individual single value that's in the middle. So for example, with this data set of seven values, the median is the fourth out of seven. Now Q1 will be the median of the bottom half, and we'll include the median in that bottom half. So for those four values Q1 is in between the second and third, that is the four and the seven. And for the top half of the data, Q3 is in between the twelve and the fifteen. And then again, the IQR would be Q3 minus Q1.

Slide 14.

So to recap how to calculate quartiles using a simple method, the method's going to ignore the shape of the distribution and just focus on the values.
Step 1 we'll arrange the values in order from smallest to largest.
Step 2 depends on whether n, the number of values, is odd or even. If n is odd, the median will be the center value, and we'll include this value in both halves during step 3 our calculation of Q1 and Q3.
If there's an even number of values, the median is the middle of the two middle values. Because this median is not an actual data value, we would not include it when we're doing step three.
And then in step three we're going to repeat either step 2a or 2b, depending on whether the top half and bottom half of the data set have an even or odd number of values.
And then Q1 would be the median of the smallest half of the values and Q3 would be the median of the largest half of the values.

Slide 15

So it turns out that other methods exist because the definition of the median is actually arbitrary, there's no good theoretical foundation for it. If the exact quartile values are important, make sure you know which is being used.
So there's an alternative number one, which is same as the previous method described, but we would not include the median when we're calculating Q1 and Q3 when we have data sets with an odd number of values.
And then a second alternative, we would calculate Q1 and Q3 using a weighted average of the data points if the data set has an odd number of values.
So if the data set can be represented by 4n+1 values, so that would be 5 values, 9 values, 13 values etc. Q1 would be 25 percent of the nth value and 75 percent of the (n+1)th value and Q3 would be 75 percent of the (3n+1)th value and 25 percent of the (3n+2)th value.
If the data set has 4n+3 values, so that would be 7 or 11 or 15 values, etc. Q1 would be seventy-five percent of the (n+1)th value and twenty-five percent of the (n+2)th value q3 would be 25 percent of the (n+2)th value plus 75 percent of the (n+3)th value
This is obviously a more complicated way to do it, so most people would go with the two methods described previously. However this method does take the shape of the distribution into a bit more account. If you're using a computer software program to calculate your medians you'll want to make sure you understand which one is being used.

Slide 16

Now for the other four statistics of spread, the sum of squares, the variance, the standard deviation, and the coefficient of variation.
So medians and quartiles are used for describing data sets, but these are the values that are used in statistical tests due to mathematical properties of sums of squares and variances.

Slide 17

First up the sum of squares, which is the sum of the squared differences from the mean.
So if we look at our data set here with six values, if we calculate the mean we get an eleven. So then our sum of squares is going to be looking at each of those six values and calculating the sum of the squared difference between each value and that mean.
So that first value would be 6 minus 11 which is negative 5 squared, second value would be 8 minus 11 which is negative 3 squared, and so on. And then we would add them all up to get 60.

Slide 18

Looking at a slightly different data set that has the same mean, but we can see is more spread out, our sum of squares would be the 2 minus 11 which is negative 9 squared, plus the 6 minus 11 which is negative 5 squared, and so on. And we will get a sum of squares of 200, which is much larger than the 60 we got for the previous data set.

Slide 19

Here we can see a data set where most of the values are much closer together, but one is much larger than the others. And if we calculate the sum of squares we'll get a 235, which is even more than the one in the previous slide.
And this illustrates one of the unfortunate properties of the sum of squares is that it is very prone to being inflated by the presence or absence of outliers. Single values that are much larger or smaller than the others can have a large impact on the sum of squares.

Slide 20

So to recap, the sum of squares is the sum of the squared differences from the mean.
This is the foundation of the variance, standard deviation, and coefficient of variation that we'll talk about in a second.
It's extremely prone to being inflated by outliers, which is unfortunate. However the variance, the next statistic we'll talk about, has a useful property that for two independent data sets the sum of their variances is the same as the variance of the combined data set.
This mathematical property is so useful that it causes us to use sum of squares and variances despite this problem it has with outliers.

Slide 21

Okay, now for the variance. And this is represented by the greek sigma symbol, or s, and then with a square.
The variance is always positive because it's based on a sum of squares. It's not robust to outliers since it's going to be based on the sum of squares.
If we have a population of data, we represent the variance with sigma squared and it's the sum of squares divided by n where n is the number of data values.
However if we have a sample, we represent the variance with s squared and we have the sum of squares divided by n minus 1. We use this second equation for a sample because the estimate of a population variance using a sample often underestimates the true population variance. Therefore the different denominator is used to create what's called an unbiased estimate of the population variance.
Remember that all sample statistics are estimates of population parameters and so we'll never expect them to always be exactly correct, but we don't want there to be any bias where they consistently underestimate or overestimate. And this adjustment to the variance formula prevents it from consistently underestimating the population variance.

Slide 22

Now for the standard deviation, represented with the greek symbol sigma if we're looking at a population or s if we're looking at a sample. It has the same properties as the variance because it's just the square root of the variance. And the standard deviation now has the same units as the original data unlike the variance. If you think about it earlier, because everything was squared, any units that our data values would have had would have become squared which actually makes the variance hard to interpret. With the standard deviation however, now the units are back to the original data set and it's much easier to interpret.
For this reason, the standard deviation is used as a descriptive statistic much more than the variances.

Slide 23

Our last measurement of spread is the coefficient of variation, sometimes referenced with the value cv. It has the same properties as the variance because it's based on the variance.
For a population we're going to take the population standard deviation divide by the mean multiply by 100 to get a percentage.
For the sample we'll take the sample standard deviation divide by the mean multiply by 100 to get a percentage.
And this puts the variation into context. How variable is the data compared to the typical value? It's often used as a descriptive statistic, it's not particularly common, and it's rarely used in statistical tests. It's used enough however that you should be familiar with it.

Slide 24

Before we move on to the discussion of the shape statistics there's something you should be aware of.
There are two different equations that can be used to calculate the sum of squares as shown in the top right of this slide.
There's the summation of the squared differences, which is fairly straightforward. And then there's that second equation which is the sum of the squared values, minus the sum of the values squared, all divided by the sample size.
Both of them will give you the exact same number with a particular data set, as you can see in the example shown on the slide here for a data set of three, four, six, and seven which has a mean of five.
The first equation is much easier to understand and visualize what it represents, whereas the second seems more complicated. However if you're doing calculations using a calculator and doing it by hand, the second equation requires less button pushing, has less rounding error, and will actually give you a more accurate value. Because people used calculators before we had computers for such a long time, that second equation was used quite a lot, and is therefore still written in a lot of textbooks.

Slide 25

Now for the statistics of shape and what we'll be doing is comparing the shapes we see for our data to the normal distribution, i.e the bell curve, which is a baseline shape for comparison.
So there are two, or three depending on how you think about it, different shape statistics.
There's the skewness which is going to measure asymmetry and the kurtosis which is going to measure peakedness, what it really does is measures the thickness of the tail however, and then a value called the excess kurtosis which is just the kurtosis minus 3.

Slide 26

The skewness is calculated by the equation there which is the sum of the cubed differences between each data value and the mean divided by the number of data values. And then we have to scale that for the variation by dividing the whole thing by the standard deviation to the third. And what this does is it measures asymmetry.
If we have a perfectly symmetric distribution, like shown in the middle, we'll get a skewness of zero. If there's a tail that moves off to the left with a small number of values that are much smaller than what most of the values are, we'll get a negative skewness. And if we have a tail that goes off to the right, where there's a small number of values much larger than the typical value, we'll get a positive skewness.

Slide 27

The kurtosis, which measures the peakness or thickness of the tails, is a very similar equation, but now we have the fourth power instead of the third power.
If you calculate the kurtosis of the normal distribution as shown in the middle, you'll get a kurtosis of three. So if we subtract three that would give us an excess kurtosis of zero. We would call that distribution mesokurtic. Meso for like middle.
If you have a peaked, kind of pointier distribution with skinny tails, the kurtosis would end up being larger than three, excess kurtosis would be positive, and we would call that leptokurtic.
Whereas on the left, if the distribution is kind of fatter with tails that have more values in them, we would get a kurtosis that's less than three, which would be a negative excess kurtosis, and we would call that platykurtic. Platy meaning flat.
The exact values of kurtosis are often not used directly. We're really just comparing them to the three for kurtosis, or the zero for excess kurtosis, to figure out whether our distribution looks like the normal distribution, that is mesocritic, or whether it looks like it's leptokurtic or platykurtic.

Slide 28

The equations for skewness and kurtosis on the previous slides were for the population of data. Estimates of the population skewness or excess kurtosis from sample data can be biased and so the equations shown in this slide are better. For the skewness we have two options for equations to use, for the excess kurtosis we have one to use. And these will allow our sample data to estimate a population data in a way that is not consistently too large or consistently too small.

Slide 29

What's the application of the skewness and kurtosis? The skewness and kurtosis are rarely studied for their own sake, they are usually calculated to see if the distribution is normal, which is typically what we want because most statistical tests require normal distributions for our data sets.
So what we usually really want is for the skewness to be zero and the excess kurtosis to be zero, so that we can be confident our statistical tests will work.

Slide 30

So here's kind of a summary of our four different location statistics our six different spread statistics, and are two different shape statistics.
The mean, median, and standard deviation are often used when describing data. The range, IQR, and coefficient of variation are also sometimes used, but not as much.
And then for doing statistical testing, most statistical tests work with the means if they're interested in location, and variances if they're interested in spreads.
And then the skewness and kurtosis are often prerequisites for doing our tests.

Zoom out.

So there you have it, that's an overview to the basic statistics that are calculated for a population of data, but much more commonly a sample of data that we're trying to use to estimate the values for population of data.
There's a companion video to this one linked to in just a second that goes through a step-by-step example using a data set and calculates all the values shown.

End screen.

Click at the top to subscribe to the channel. On the left is a link to the playlist for stats examples and on the right is the video I mentioned that walks through calculations of all the statistics discussed in this video.


Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.