CONFIDENCE INTERVALS FOR THE POPULATION MEAN (INTRODUCTION)

Tweet
Share

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-confidence-intervals-intro.pdf

TRANSCRIPT OF VIDEO:


Slide 1

Confidence intervals are an extremely important method that we use to estimate the mean of a population we're interested in. Let's take a look at what they represent and how to calculate them.

Slide 2

It is often the case in statistics that we want to know the mean of a population. We can't measure every individual in the population, so we have to take a random sample from that population and calculate the mean of that sample to estimate the mean of the population.
The sample mean is an estimate of the population mean , but sampling error makes it inexact. The mean of the sample will never be exactly the same as the mean of the population. So what can we say about the likely population mean, based on the sample mean?
We know it is probably close, but we also know there's some chance our sample could be very inaccurate. How do we estimate that inaccuracy and create a range where we think the population mean probably is?

Slide 3

Luckily, there is a mathematical result called the central limit theorem. This theorem states that the distribution of sample means from a population will be normally distributed. No matter what the population looks like.
So if we took a series of samples from the population illustrated at the top and looked at what the values of the means of those samples are, they would form a normal distribution centered around the population mean.
Most of the sample means would be close to the mean of the population But a few would be further away. The central limit theorem tells us exactly how many would be close, and how many would be far, and in what proportions.
The width of this distribution of sample means is based on the variance in the population and the sample size of each sample. The more variance in the population, the wider the distribution of sample means. The more values in each sample, the more narrow the distribution of sample means.
If we look at this distribution of sample means, any individual sample mean comes from a normal distribution. And the most likely result is that it comes from the middle of the distribution which corresponds to the mean of the population.
Sometimes, the mean of any particular sample would be would not be close to the mean.
Only rarely would the mean of any sample be far away from the population mean
The central limit theorem allows us to calculate how close any particular sample mean is likely to be to the true population mean.
How close the sample mean is to the population mean is influenced by the variance in the population and the sample size.

Slide 4

We can use the central limit theorem to estimate where the population mean probably is by thinking about the normal distribution Of our sample means.
The distribution of those sample means is centered around the population mean And the middle 95% of that distribution is what we would expect to see 95% of the time when we look at our samples.
The logic goes both ways. We can start with the mean of one sample and use our knowledge of the width of the distribution of sample means to create a window around that sample mean that will probably include the population mean.
As an example, if we identify the middle 95% of a normal distribution around our sample mean, there is a 95% chance the population mean is within that 95% region. We call this region a 95% confidence interval because it indicates where we are confident the population mean is.
Note that the width of that confidence interval will be based on the variance in the population and the number of values in the samples. The standard deviation of that distribution of sample means is called the standard error.

Slide 5

An important pair of terms to keep straight is how the standard error relates to the standard deviation. These two values measure different things
When we think about a distribution of values, like a population of values or a sample of values, we often describe them as normally distributed around a mean with some particular variance. The standard deviation describes the spread of the data values in that population or sample.
Our distribution of sample means is also normally distributed but it's variance comes from the variance in the population divided by the sample size. if we think about the standard deviation of that set of sample means that would be the variance of the population divided by the square root of the sample size. This value is called the standard error.
The standard error does not describe the spread of the data in the population or sample, it describes the spread of the sample means and likewise the spread of possible population means.
The terms standard deviation and standard error are very similar, but they measure completely different things. The first one measures the spread of data values in a population or sample, the second measures the spread of sample means taken from a population.

Slide 6

So, how do we calculate the confidence interval for the population mean Using the data from one of our samples?
First, we take a sample of N values.
Then we calculate the mean variance and standard deviation of that sample
Then we get the middle region of the standard normal distribution corresponding to the degree of confidence desired. If we want to be 95% confident of where the population mean is, we want to know how many standard deviations wide the middle 95% of the standard normal distribution is. If we wanted to be 99% confident of where the population mean is, we would want the middle 99% of the standard normal distribution, in terms of standard deviations.
Standard deviations within our distribution of sample mean Are really standard errors. Therefore, the width of this region in the standard normal distribution tells us the width of the confidence interval in terms of the number of standard errors above and below the sample mean.

Slide 7

For example, if we take a sample of 16 values from a population And they have a mean of 15 and we know the population standard deviation is 5 we can calculate our confidence intervals.
Using the properties of a normal distribution the middle 95% region of that normal distribution would be 1.96 standard errors above and below the sample mean.
Plugging these numbers in gives us a result of 15 plus or minus 2.45 which gives us a range for our 95% confidence interval of 12.55 to 17.45.
Based on this we can say there is a 95% probability that the population mean we took this sample from is between 12.55 and 17.45. we have a 95% confidence that the population mean Is within that region.
Note that there is a 2 1/2% chance the population mean is larger than 17.45 and a 2.5% chance the population mean is smaller than 12.55. That would occur when sampling error causes our sample mean To be smaller or larger than the true population mean just due to randomness.

Slide 8

Unfortunately in the real world we would never have the population variance if we didn't already know the population mean . And if we already knew the population mean we don't need to be doing statistics at all. We therefore estimate the population variance from the sample variance
However, samples usually underestimate the actual variance of the population due to sampling error. Therefore, to adjust for this sampling error, we use the t distribution instead of the normal distribution. The t distribution is wider to account for the sampling error and the underestimate of the population variance that would result from it.
As shown in the figure here, the middle 95% of the t distribution we have to use, will result in a slightly larger confidence interval than if we were able to use the normal distribution itself.

Slide 9

As illustrated here the T distribution is wider to account for the sampling error but as the sample size increases it narrows to become the normal distribution.
Because the imprecision due to the sampling error depends on the sample size, there is a different T distribution for each sample size, otherwise known as degrees of freedom.
in this figure we can see the normal distribution at the top where the middle 95% of the distribution is within 1.96 standard errors of the mean.
The bottom figure is the T distribution for 6 degrees of freedom which is a sample size of 7 values. in this situation The middle 95% of the distribution Is within 2.447 standard errors of the mean.
The middle figure shows a larger sample size, 11 degrees of freedom which is a sample size of 12, and we can see that now the middle 95% of that T distribution is 2.201 standard errors above and below the mean.
As the sample size increases, the t distribution narrows to eventually become the normal distribution for an infinite sample size.
In the same way that there are Z tables of areas for the normal distribution, there are t tables that can be used to calculate these areas.

Slide 10

Here we see a comparison of the most common Type of Z table and T table. The Z table on the left shows the areas to the left of particular points on the X axis which represent various Z values.
Z tables usually describe the area to the left which allows them to be versatile for a variety of different types of calculations.
The numbers in the table are areas.
In contrast, T tables usually describe the location on the X axis that corresponds to different areas for each column. The areas specified are usually the ones on the outside of the middle region.
The numbers in the table are the width of the T distribution in terms of standard errors.
The T tables do this because they are mainly used for determining confidence intervals, which we do by specifying the area outside of the middle region we're interested in.
Rather than have a separate tea table for every different degree of freedom and all the detailed areas, T tables usually have one row per each different degree of freedom and highlight a small number of areas in the columns.
The most common table looks like the ones illustrated which just shows the area to the right of a particular distance. To get the middle region, you would need to double that area and look at that distance above and below the value of 0.
An example will make this more clear.

Slide 11

Let's use the T distribution to figure out the width of the 95% confidence interval if we take a sample of 16 values and get a mean of 15 and a sample standard deviation of 5.
First, a sample size of 16 corresponds to a degrees of freedom value of 15.

We would then go to our t table and identify the row that corresponds to 15 degrees of freedom.
To have a region in the T distribution with 95% in the middle, that would mean an area of Alpha equals 0.025 on each side. It's those Alpha values that correspond to the columns in our t table.
We therefore go to the column corresponding to Alpha equals 0.025.
We read down that column until we get to the row for degrees of freedom 15 and the value is 2.131.
That tells us that if we have a T distribution with 15 degrees of freedom, in order to have a middle region with 95% of the area, and two and a half percent above and below that middle region, the region would need to go 2.131 standard errors above and below the mean.

Slide 12

This figure illustrates how wide the regions would have to be for a variety of different Alpha values which correspond to different regions in the center of the distribution. And it's the center of the distribution that gives us our confidence interval.
to get our 95% confidence interval we have to go 2.131 standard errors above and below the sample mean.
If we wanted to be more confident and have something like a 99% confidence interval we would have to go 2.947 standard errors above and below the mean.
If we were content to be less confident and use something like a 80% confidence interval we would only have to go 1.341 standard errors above and below the mean.
By far the most commonly used degree of confidence for confidence intervals is 95%, but it is possible to calculate these others if you want to.

Slide 13

Back to our example where we took a sample of 16 values and obtained a sample mean of 15 a sample standard deviation of 5.
the 95% confidence interval would be the sample mean plus or minus 2.131 standard errors which would give us 15 plus or minus 2.66 which would give us a 95% confidence interval of 12.34 up to 17.66.
Based on our data we would be 95% confident that the true population mean is somewhere between 12.34 and 17.66, with only a 5% probability that it is outside of that interval.

Slide 14

Let's compare our two different confidence intervals, using the normal distribution And using the T distribution. For both of these we have a sample mean of 15 and the standard deviation value of either the population or sample is 5.
If we knew the population standard deviation And used the normal distribution to create our confidence interval it's 1.96 standard errors above and below the sample mean.
If we have to estimate the population standard deviation from the sample standard deviation and use the T distribution to create our confidence interval, it's 2.131 standard errors above and below the sample mean.
In the real world the first situation Is unrealistic because how could we know the population standard deviation if we didn't know it's mean Therefore The T distribution Is what we should always use when calculating confidence intervals.
For example, If we had 25 degrees of freedom the width of our 95% confidence interval from the t distribution would be 2.06 standard errors versus 1.96 from the normal distribution. This is a difference of about 5% which means that if we used the normal distribution instead of the T distribution our distribution would be 5% too narrow.

Slide 15

Keep in mind the purpose of all of this. the purpose of confidence intervals is to estimate the value of the population mean from a sample mean.
If we knew the population mean and standard deviation, we wouldn't need confidence intervals or statistics at all, we would have our answer.
If we have a sample mean and we know the population standard deviation, we could use the normal distribution to calculate our confidence interval, but this is an unrealistic situation.
In the real world, confidence intervals are used when we know the sample mean and sample standard deviation, in which case we use the T distribution to calculate the confidence intervals.
Since confidence intervals are representing our confidence in the true value you can think of them as real-world versions of significant figures. Lots of people learn about significant figures in science classes and how the number that you report is an indication of how sure you are about your correct answer. Significant figures are really a form of shortcut and casual way to represent the uncertainty in an answer we obtain. in the real world of science, it's confidence intervals that are used to indicate uncertainty, not significant figures.

Slide 16

So what are confidence intervals used for?
1st , they're used in the way we've been talking about, as a descriptive statistic for our estimate of the population mean.
2nd, the concept of confidence intervals are used as the basis for the T test,, which is a test of whether two populations appear to have different means from one another. We can't just compare the means of our samples because sampling error would cause them to be different even if the populations had the exact same mean.
The way this test works is diagrammed conceptually here.
On the left is a situation where the confidence intervals for our two samples overlap which is what we would see if the means of the populations they were taken from were equal to each other.
On the right is the situation where the confidence intervals for our two samples don't overlap which is what we would see if the means of the populations they were taken from we're different from each other.
The details of T tests are slightly more complicated than what's shown here, and those details are described in another video on this channel and in this playlist if you're interested, but the fundamental concept of how that very common statistical test works comes from the idea of confidence intervals.

Zoom out

Confidence intervals are widely used in the reporting of data from experiments and scientific studies and they provide the foundation for one of the most common statistical tests.

End screen

Click to do all the usual YouTube things. You can also find a full collection of resources, articles, and links to statistical videos at the website shown on the bottom of the screen.



Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.