SUMMARY STATISTICS (EXAMPLES)

Tweet
Share

EXAMPLE 1

Blah, blah, blah.

Blah, blah, blah.

EXAMPLE 2

Blah, blah, blah.

Blah, blah, blah.

EXAMPLE 3

Blah, blah, blah.

Blah, blah, blah.



Connect with StatsExamples here



LINK TO SUMMARY SLIDES FOR VIDEO:


StatsExamples-summary-statistics-examples.pdf

TRANSCRIPT OF VIDEO:



Slide 1

Let's look at some step-by-step examples of calculating summary statistics for the location spread and shape of a data set.

Slide 2

When we're estimating data values we usually care about the parameters of a population that is all the individuals of interest, but we usually can't measure them all so more often we measure the values for a subset and calculate the statistics of a sample to estimate those population parameters.

Slide 3

And what are the statistics that we calculate. They're the statistics of location spread and shape and you can see there's four location statistics, six spread statistics, and two shape statistics. And we'll actually use slightly different equations for some of those because it can vary depending on whether it's a sample or a population that we're measuring. See the companion video linked below and at the end of this video for more detail about what these values represent and why we use these different equations for samples and populations.

Slide 4

Okay let's look at our data set here this will be the number of birds in eight different trees. So we've measured eight trees and counted the number of birds we've gotten the values of one, four, five, seven, three, nine, seven, and twelve. So the best order to calculate our location statistics will be the median, the mode, the mid-range, and then the mean.

Slide 5

Our first step will be to reorder all the values from smallest to largest. So we're going to go to our original data set and reorder those values.

Slide 6

Now that we have an ordered data set we can calculate the median, mode, and mid-range quite easily.
The median will be the value in the middle. This data set has an even number of values, n equals eight, so we find the middle two values and take their mean. So that's the five and the seven, five plus seven equals twelve, divided by two gives us a median of six.
For the mode, since the data is ordered we can look for the longest string of consecutive identical values. In this data set there's only one value that occurs more than once, and that's the seven which occurs twice, so that's our mode
For the mid-range this is halfway between the smallest and largest value, the mean of those two values. Since we have these ordered we can look on the far left, the one is the smallest, on the far right the twelve is the largest. one plus twelve divided by two is six and a half.
So our median is six, our mode is seven, and our mid-range is six and a half.

Slide 7

Our final location statistic is the mean and that's the sum of all the values divided by the number of values, which is n equals eight.
So the mean will be one, plus three, plus four, plus five, plus seven, plus seven, plus nine, plus 12 is 48, divided by eight gives us six.
So now we can look at our four location statistics. The median is six, mid-range is six and a half, mode is seven, mean is 6 and we can see that they're all similar which suggests there's no extreme asymmetry or outliers. And if we look at this figure of the values on the number line, we can see that that's the case.

Slide 8

Now for our spread statistics and again we'll use our ordered data set because it makes a couple of these easy to calculate.
First our range. The range is the difference between the largest and the smallest so in this case it's going to be 12 minus 1 is 11. That tells us how far apart the minimum and maximum values are.
Then the interquartile range which is the difference between Q1 and Q3, which is the first quartile and the third quartile. So before we can calculate that we have to calculate the quartiles first.
And the way we do this is in three steps.
First we divide the data set into a bottom and top half.
Then, we compute the medians for each of those halves, that will be Q1 and Q3.
Then, we compute the difference between the value of Q3 and the value of Q1.

Slide 9

So step one is to divide our data set into bottom and top halves.
Once we've done that, we can compute the medians for each half. So in the bottom half we have the values of one, three, four, and five. Q1 which is the median of the bottom half, will be the mean of three and four, three plus four is seven, divided by two, three and a half. Q3 will be the median of the top half ,so that's going to be 7 plus 9 is 16, divided by 2, is 8 for Q3.

Slide 10

The IQR is going to be the difference between Q3 and Q1 so that's going to be 8 minus 3.5 is 4.5.
And if we look at a figure of our data here we can see that the IQR is the width of the middle 50 percent of the data. So we can see that region of width 4 and a half contains 4 out of our 8 values.

Slide 11

Before we proceed with calculating the rest of our spread statistics let's look a little bit more detail about calculating quartiles.
If the data set has an odd number of values, the middle value is the median and there are actually three different options for calculating the quartiles. It's a bit more complicated than if we have an even number of values as in the previous example.
Option one, we would include the median in both the bottom and top half of the data when we calculate Q1 and Q3.
Option two, we do not include the median in either the bottom or the top half of the data.
And then option three, we would calculate Q1 and Q3 using a weighted average of the data points in the top and the bottom half.

Slide 12

Now for our example here, now the bottom half of the data includes the one, three, four, five, and six so Q1 is the median of that set of data.
So that's going to be the four the value in the middle.
The top half of the data has the six, sever, sever, nine, and twelve so Q3 will be the middle value there the median which is a 7. So now our IQR would be 7 minus 4 is 3.

Slide 13

Option 2 for calculating quartiles when we have an odd number of values would be to not include the median in either the bottom or top half. So we have our ordered data set and then we divide it into two halves but we do not include that six in either half.
Now the bottom half is a one, three, four, and five. We compute the median of that which is going to be three plus four, divided by two, seven divided by two, is three and a half.
In the top half we have the seven, seven, nine, and twelve so we calculate Q3 as seven plus nine is 16, divided by two, is eight.
And the IQR is then eight minus three and a half, gives us four and a half.

Slide 14

Option three for calculating our quartiles with an odd number of values involves using a weighted average of the data points.
And the first step is to see whether the equation four n plus one, or four n plus three, is more appropriate to match the size of our data set. For the example we're looking at here, with 9 you can see that 4 times 2 plus 1 can equal 9, but there's no whole number where 4 times that number plus 3 would equal 9. So we'll be using the first set of equations for 4n plus 1 values.
For 4n plus 1 values, Q1 will be 0.25 times the nth value plus 0.75 times the n plus 1 th value. So n was equal to 2, so that's going to be 0.25 times the second value which is the 3, plus 0.75 times the third value which is that 4, and that will give us Q1.
And then Q3 will be 0.75 times the 3n plus 1 value, plus 0.25 times the 3n plus 2th value, so that would be 0.75 times the 7 and 0.25 times the 9, the 7th and 8th values in our data set.
So when we plug those values in, Q1 will be 0.25 times 3 plus 0.75 times 4 gives us 3.75. Q3 will be 0.75 times 7 plus 0.25 times 9 gives us 7.5. So then the IQR will be 7.5 minus 3.75 gives us 3.75.
These three different options give us three different values for the IQR so we should always be clear about which of these options we're using when we calculate the IQR and report it for our data set.

Slide 15

Returning back to the rest of our statistics for the spread of the data we'll now look at the sum of squares.
The sum of squares is the sum of the squared differences from the mean which we earlier calculated to be six. Sum of squares, you can see this summation here i equals one to eight, so from the first value to the eighth value for x sub i minus the mean squared. So we're looking at each of our values 1, 3, 4, 5, 7, 7, 9 and 12 subtracting 6, which is the mean, squaring, we're doing that separately for all eight of our values, and then adding those up. So we'll get 1 minus 6 squared, plus 3 minus 6 squared, plus 4 minus 6 squared, plus 5 minus 6 squared, plus 7 minus 6 squared, plus 7 minus 6 squared, plus 9 minus 6 squared, plus 12 minus 6 squared, that gives us those values you see there.
And then we sum those values up and we get a sum of squares of 86 for this data set.

Slide 16

The three other values of the spread are all easy to calculate once we have the sum of squares. The variance, standard deviation, and coefficient of variation all begin with the sum of squares.
The variance is just going to be the sum of squares divided by n if it's a population, or n minus 1 if it's a sample. And we'll use the symbols sigma squared or s squared to represent the population variance or the sample variance. So those values would be 86 divided by 8 to give us 10.75, or 86 divided by 7 to give us 12.286.
The standard deviation is the square root of the variance, so for the population sigma would be the square root of 10.75 gives us 3.279. s would be the square root of the sample variance, so that's the square root of 12.286 gives us 3.505.
And then the coefficient of variance is the standard deviation divided by the mean multiplied by 100 to make it into a percentage. So for the population that would be 3.279 divided by 6, that's the mean, times 100 gives us 54.65 percent. For the sample that would be 3.505, divided by 6 times 100, gives us 58.42 percent. These equations are all basically exactly the same, the only difference is we have to know whether we're dealing with a population or a sample to decide how to calculate the variance.

Slide 17

A useful property of the standard deviation is that for normal distributions approximately two-thirds or 66 percent of the values are within one standard deviation the mean and approximately 95 percent of the values are within two standard deviations of the mean.
So if we go 3.279 or 3.505 above and below the 6 that region should hold about two-thirds of our data and we can see that it does it holds 6 out of the 8 values.
In this way the standard deviation is similar to the IQR in that it gives us a region within which we expect a certain percentage of our data.
Now remember this property of the standard deviation will only hold if our data is normally distributed. So one of the things we want to know is whether our data is skewed or non-normal which is what our next two statistics are for.

Slide 18

The final set of statistics we're interested in are ones that describe the shape of the distribution. We calculate a value called the skewness which represents the asymmetry of our data values and the kurtosis which measures aspects of the shape in terms of the height of the peak relative to the height of the tails. The skewness and kurtosis are rarely studied for their own sake they are usually just calculated to see if the distribution is normal. So we compare the skewness and the kurtosis values we get to the values for a normal distribution to see if they match.
And the equations on this slide are the ones used for a data set that is a population.

Slide 19

If the data set is a population, we use this equation for the skewness. We're going to look at the sum of every data value, minus the mean to the third power, divided by the number of data values and then we have to scale that for the overall variation, so we're going to divide that by the standard deviation to the third power. So you can see that we're plugging in the values here so n is equal to eight, our standard deviation is the 3.279, which is our standard deviation for the population. We can bring those values out in front of the sum, so we'll just have 1 divided by 8, divided by the standard deviation to the third, times the sum of every value minus the mean to the third power. I won't read off all these numbers, but you can pause the video here if you want to see exactly how they go in, but we would plug all those numbers in and eventually we would get 84 divided by 282.042 gives us a skewness of 0.298.

Slide 20
For the kurtosis we have almost exactly the same equation, but now it's the sum of each value minus the mean to the fourth power, divided by the number of values. And then again, in order to subtract out the spread, we're dividing that by the standard deviation to the fourth power.
So again we can pull those values out in front of the summation, 1 divided by 8 times the standard deviation to the 4th power, times the sum of each value minus the mean to the fourth power. Plug all those values in and we end up getting a kurtosis of 2.229.

Slide 21
If the dataset is a sample there are different equations to use. For the skewness there are two other equations that we can use. For the excess kurtosis there's a more complicated equation that we would have to use.
But we can use the sums of the cubes and the fourth powers that we calculated from the previous slides so the 84 and the 2102.

Slide 22

So if our data set is a sample and we're calculating the skewness we can use two different equations.
For the first equation, it's the same as the skewness for the population except we divide by the standard deviation for the sample to the third power. That gives us a skewness of 0.244.
For the second equation we have a coefficient in front of our population skewness equation, that's that square root of n times n minus 1 divided by n minus 2. When we plug all those numbers in, so square root of 8 times 7 all divided by 6, and then that's multiplied by this population skewness, we would get 0.371.

Slide 23

For the sample excess kurtosis we have that larger equation, but if you look in the middle part of it we have that summation of the fourth powers, divided by n, divided by the sample standard deviation to the fourth power. That's very similar to the population kurtosis just using the sample standard deviation instead of the population standard deviation.
And then we have a leading coefficient which is n plus 1 times n divided by n minus 1 times n minus 2. So that's just plugging in the number of values in our data set.
And then at the end we have 3 times n minus 1 squared, divided by n minus 2 times n minus 3. So again that's just plugging in the number of data values. So we just have to be careful and go slow and plug things in. So 9 times 8 divided by 7 times 6 our 2102, divided by eight, then sample standard deviation to the fourth power, minus three times seven squared, divided by six divided by five gives us negative 1.915.

Slide 24

Okay to recap for our skewness and kurtosis values if the data set is a population we get a skewness of 0.298, kurtosis of 0.229, and an excess kurtosis of negative 0.771. The excess kurtosis is just the regular kurtosis minus 3.
If our data set instead was a sample we would have two different skewnesses that we could calculate depending on which of the equations we used. Either the 0.244 or the 0.371. Neither one is absolutely correct they are two different estimators of the true population skewness. So we would probably want to calculate both of them and have a fairly good confidence that our population skewness is most likely somewhere between those two. And then the excess kurtosis value we get negative 1.915.
So these values, a positive skewness indicates that the dataset will be right skewed, and a negative excess kurtosis indicates the data set is platykurtic.

Slide 25

Now let's take a look at all of our statistics in relation to a figure of our data.
So our four location statistics are all pretty similar, a six, a six ,a six and a half, and a seven and they're all representing the general location or the typical average value of the data set.
Then our spread values, the range, interquartile range, and sum of squares are the same whether or not this is a population or a sample, but then the variances, standard deviations, coefficients of variation, skewnesses and, kurtosis values differ depending on whether this is a population or a sample.

Zoom out

Hopefully this video can help you put everything together so that you can calculate the basic statistics for any population or sample that you're interested in calculating the location spread and shape statistics for.

End screen

This video calculated a bunch of statistics but there's a companion video that describes more about what the statistics represent that's linked below on the right.


Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.