POWER ANALYSIS

Link to summary slide and video transcript below

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.

Connect with StatsExamples here

LINK TO SUMMARY SLIDES FOR VIDEO:

StatsExamples-power-analysis.pdf

TRANSCRIPT OF VIDEO:

Slide 1.

When we perform statistical tests, we hope they'll be able to detect genuine differences or patterns in the data, without being misled by noise.
The power of a test to identify these patterns is therefore important for understanding and interpreting our results.
So, what determines how powerful a statistical test is?

Slide 2.

First, a quick review of when we do statistical tests. We often use sample data to test hypotheses about population data.
As depicted in the diagram - we take a sample from the overall population and then use that data to make conclusions about the overall population.
But sampling error, also known as noise, means that sometimes our samples are misleading. If the mean of the population is 20, sometimes our samples will have a mean of 19.
This sampling error leads to statistical errors when we make our conclusions about the population. If our sample was 19 we would make an error if we conclude the population mean is 19.
Note that these errors are not mistakes, mistakes would be doing the math wrong - like having a sample with a mean of 20 and messing up to calculate a mean of 19 for it.
Mistakes can be avoided by doing the math right, but errors cannot be avoided because samples are never guaranteed to be perfect representations of the population they came from.

Slide 3.

There are two basic types of errors we can make when we use sample data to test a hypothesis about a population data.
In the diagram to the right we've shown a standard way of thinking about this. We have a null hypothesis about reality that is either true or false.
When we collect our data and perform a statistical test - we will conclude that the null hypothesis is true, and accept it - or decide it is false and reject it. Technically we don't accept a null hypothesis, we fail to reject it, but let's go with these terms for now.
► Obviously, if the null hypothesis is true and we accept it - that is, fail to reject it - this is correct.
► Obviously, if the null hypothesis is false and we reject it this is correct.
► However, if the null hypothesis is true and we reject it then we've made an error which we call a type I error.
► If the null hypothesis is false and we fail to reject it, that is accept it, then we've made an error which we call a type II error.
The StatsExamples channel and website have a whole other video that goes into this in more detail, check that out if you're not 100% sure of what these two distinct types of errors are and what they represent.
► When doing statistics, we typically focus on the value alpha, which is the probability of rejecting a true null hypothesis. Choosing the alpha value for our test is a standard practice done all the time.
► Less commonly, we can think about beta, which is the probability of failing to reject a false null hypothesis. This can be very important however because it tells us what our chance of seeing a real pattern in the data is.

Slide 4.

Let's think about what alpha and beta are in more detail.
► Alpha is the probability of rejecting a true null hypothesis.
This is the risk of deciding that there is a real pattern of difference if there isn't one. This would be like testing a useless drug in some patients and deciding it was useful when it isn't.
Reducing this risk can be done by specifying a smaller p value threshold for our statistical tests. Usually this is set at 5%, but if we want to be more certain, there are easy ways to set the threshold at 1% or lower.
► Beta is the probability of failing to reject a false null hypothesis
This is the risk of not seeing a real pattern or difference if there is one. This would be like testing a useful drug in some patients and deciding it wasn't useful when it was
Estimating and reducing this risk is more complicated because it depends on many factors
If we think of beta as the risk of missing a real pattern, then the value 1 minus beta is the power of a statistical test. When this power gets larger we are less likely to miss out on seeing things that are there

Slide 5.

Looking at this again we can see that for our two different types of error we can design studies to minimize both
Type 1 error can be controlled because we have our choice of the Alpha value threshold.
Type 2 error can be controlled because we can estimate the value of beta to determine the risk of type 2 error
1 minus beta is the power of our test which tells us what we can or can't see when we look for something
Obviously, we want the smallest value of beta possible, the largest statistical power, So what factors are important for maximizing the power of a statistical test?

Slide 6.

In general, the power of a test will be higher when four things are true.
► The power of a test is higher when the pattern or difference we're looking for is stronger or larger. It's easier to detect big differences when we look for them.
Incidentally, this can lead to a statistical bias called the Beavis effect when the patterns and differences we are able to detect tend to be large in magnitude and we are unable to detect a variety of genuine, but smaller, patterns and differences.
► The power of a test is higher when the variance in our data is small.
Some of the variance is due to genuine differences between the values in the population, but some may be due to inaccuracies caused by our measurement technique.
We can't do anything about the variation in the data, but we should always ensure that we are using the best methods of measurement to minimize this extra noise as much as possible
► The power of a statistical test is higher when the sample size is large. This is because with larger sample sizes there's less sampling error
► And finally the power of a statistical test is higher when it's a parametric test as opposed to a nonparametric test.
Parametric tests use more of the information from the sample, but are more prone to violations of the mathematical assumptions that they require.
Nonparametric tests are more robust to weird data distributions, but gain that robustness at the expense of discarding some of the information in the sample
Let's take a look at each of these four factors by looking at some two-tailed two-sample homoscedastic T tests comparing the means of two populations
If you don't know this statistical test well, you can check out our video on the two-sample T test on this channel.

Slide 7.

Consider two populations each with a variance of eight and we'll take two samples of size 10 from each for a two-sample T test.
The critical value for the 18 degrees of freedom and an Alpha value of 0.025 will be 2.101
► First, lets look at when the population means differ by 3.
► The T calculated equation for a two-sample homoscedastic T test is the one shown. We'll be using this same equation for all our examples.
► If the means of the populations are different by three, then our T calculated value is most likely going to give us a difference of three in the numerator, and in the denominator the pooled variance will be the 8 and the sample sizes of 10 go into the equation as shown.
► Calculating this out gives us a T calculated value of 2.372 which is larger than the critical value of 2.101 and we would correctly reject the null hypothesis of equal population means.
►Now let's look at when the population means differ by 2 instead.
► Plugging everything in would give us a value of 1.581 which is not larger than the critical value of 2.101 which would cause us to fail to reject the null hypothesis of equal means even though they are different by two
If our populations have variances of eight and we take samples of size 10, we would likely detect differences in the means of the populations if they're as big as three, but not if that difference is as small as two.
► Small magnitudes of the difference obscure our ability to detect real differences

Slide 8.

Now let's look at the effect of the variance.
Consider two populations with means that differ by 3 from which we take two samples of size 10.
Like before, for a two-sample T test the critical value for 18 degrees of freedom and an Alpha value of 0.025 will be 2.101
►First, let's look at when the population variance is 8, the most likely sample variance will also be 8.
► Plugging everything in gives us a difference of three in the numerator, and in the denominator the pooled variance will be the 8 and the sample sizes are 10.
► Calculating this out gives us the T calculated value of 2.372 which is larger than the 2.101 and we would correctly reject the null hypothesis of equal population means.
►Now let's look at when the population variance is 11 and the likely sample variance will be 11 as well.
► Plugging everything in would give us a value of 2.023 which is not larger than the critical value of 2.101 and we would fail to reject the null hypothesis of equal population means even though they are truly different by three
When we take samples of size 10 from population with means that differ by 3, we would be likely to detect differences in their means when their variances are 8, but not if their variances are 11.
► Large sample variances, whether caused by large population variances or inaccurate measurement methods, obscure our ability to detect real differences

Slide 9.

Now let's look at the effect of the sample size.
Consider two populations with means that differ by 3 which each have a variance of 8, but we take a sample of either 10 or 8.
Since the sample sizes are changing, we'll be using two different T critical values, one for each sample size.
► First, let's look at when our sample sizes are 10.
► Plugging everything in gives us a difference of three in the numerator, and in the denominator the pooled variance is 8 with the sample sizes of 10.
► Calculating this out gives us the T calculated value of 2.372 which is larger than 2.101, the critical value for 18 degrees of freedom and an alpha value of 0.025.
Based on this we would correctly reject the null hypothesis of equal population means.
► Now let's look at when our sample sizes are 8.
► Plugging everything in would give us a value of 2.121 which is not larger than the smaller critical value of 2.145 which is used for 14 degrees of freedom with alpha value of 0.025.
We would fail to reject the null hypothesis of equal population means even though they are actually different by three.
When we take samples from populations that have variances of 8 and means that differ by 3, we would be likely to detect the difference when our samples have 10 values each, but not if they have 8.
► Smaller sample sizes obscure our ability to detect real differences

Slide 10.

For our 4th comparison we can't really write out the equations because it involves more than just the homoscedastic T test.
The comparisons are more complicated, but basically because the non-parametric tests discard some of the information in the data they have less power.
There are a couple of explicit general comparisons that have been made however.
The Mann-Whitney U test is an alternative to the unpaired T tests we've just been looking at. For large samples, the power of the Mann-Whitney test relative to the two-sample unpaired T test ends up being 3 divided by pi which means that its about 95% as powerful.
The Mann-Whitney U test is therefore almost as good as a T test for comparing means with paired data when we have large sample sizes.
The Wilcoxon signed-rank test is an alternative to the paired T test. For this test however, for large samples, the power of the Wilcoxon signed-rank test relative to the two-sample paired T test is only 2 divided by pi which is about 64% as powerful.
The Wilcoxon signed-rank tests is always a poor substitute for the paired T test because it comes with a much higher risk of Type II error, no matter what the sample size.

Zoom - slides 7-10.

These four factors determine the power of our statistical tests. The strength of the signal, the degree of noise, the amount of data, and which test we can use.
I should also note that the calculations we did using the T test were essentially looking at when the power of the test if 50%. For example, when the population variance is 8, our sample variances have a 50/50 chance of being larger or smaller than the 8, but we assumed they would always be equal.
Genuine and accurate power analysis for power values other than 50% is much more complicated and typically requires the use of computer software.
Nevertheless, these 4 factors will always be important.

Slide 11.

Now that we've seen what determines power, when do we use it?
Using an alpha value of 0.05 for the risk of type I error is standard.
That's a low enough value that most people find it an acceptable risk.
It's bit more complicated for the power.
We want as large a power as we can manage because there's no maximum power we would want, the higher the better. That costs money though.
We also obviously want some minimum power, if it's too low why are we even wasting our time doing the study?
Planning for a power of 80% is common, but that's not as orthodox and as accepted by everyone as the 5% for the alpha value.
► When planning a study, calculations of power are common for two main reasons.
► First, we want to know - how big does the study need to be to detect a relevant pattern?
This is important because this may determine how many mice need to die or how much money we need to spend.
If the number of deaths or amount of money is unreasonable then we need to scrap the experiment and find some other way to answer our question.
Animal welfare, the desire to minimize animal deaths, is a major reason why funding agencies often require power analyses for proposals that involve using animals.
► Second, if we plan out an experiment and there are external limits on the size we can obtain, can we detect a relevant pattern with the data available?
When the data sets have fixed sizes, we should figure out if we can detect what we're looking for before spending a bunch of time and money doing a study.

Slide 12.

There is a second use of power analysis, we can also calculate the power of a test after the fact.
If we do a statistical test and fail to reject the null hypothesis, we can calculate what we could have detected based on our sample size and observed variance.
Imagine that we did a T test looking for the difference between the means of two populations.
► The amateur conclusion for a two-sample T test that fails to reject the null hypothesis is the phrase that students learn in their first statistics course.
"We fail to reject the null hypothesis and therefore conclude there's no difference in the population means."
► The professional conclusion for a two-sample T test that fails to reject the null hypothesis is more nuanced.
"We fail to reject the null hypothesis, which suggests that any difference between the population means is less than approximately blah blah blah."
The second statement admits to the risk of type II error and puts a value on the upper bound for what could have been missed given the power of our test.
Let's look at a couple of examples of this using our homoscedastic two-sample T test again.

Slide 13.

Consider a t-test of two populations with Var = 8.0 and two samples of size 8.
► For two samples of 8, the degrees of freedom value is 14, which gives us a T critical value of 2.145 for an alpha value of 0.025.
► We would use the same T test equation as before.
► We can plug in our values, substituting the term "d" to represent the difference that would be detectable.
► This T calculated value will give us the correct result, rejecting the null hypothesis, when "d" divided by 1.4142 is larger than 2.145
► Cross multiplying tells us that this study can only detect differences when the value of "d" is larger than 3.033
If the true difference is 3.033 our T test will return a significant result half the time.
But half the time our sample data, because of sampling error, will result in a T calculated value slightly less and we wouldn’t reject our null hypothesis.

Slide 14.

Let's look at the same situation, but with a study twice as big.
Consider a t-test of two populations with Var = 8.0 and two samples of size 16.
► For two samples of 16, the degrees of freedom value is 30, which gives us a T critical value of 2.042 for an alpha value of 0.025.
► We would use the same T test equation as before.
► We can plug in our values, substituting the term "d" to represent the difference that would be detectable.
► This T calculated value will give us the correct result, rejecting the null hypothesis, when "d" divided by 1 is larger than 2.042
► This study is able to detect differences when the value of "d" is larger than 2.042.
If the true difference is 2.042 our T test will return a significant result half the time.
But the other half the time our sample data, because of sampling error, result in a T calculated value slightly less and we wouldn’t reject our null hypothesis.
Note that this study was twice as large, but the detectable difference didn't get half as big - it's about two-thirds as big.
Let's look at this in more detail.

Slide 15.

As we just saw, a larger sample size is always better, but there are diminishing returns.
That is, doubling the sample size doesn't halve the detectable difference.
The figure shown here shows the detectable difference on the Y-axis for two-sample homoscedastic T tests. On the X-axis is the sample size of each of the two samples. The three different colored lines show the detectable difference for three different standard deviation values: 2, 3, and 4.
We can see that for extremely small sample sizes, ones we shouldn’t be using anyway, the minimum detectable differences become huge.
► Let's focus on more realistic situations, when the sample sizes are least 5 each.
The curve shows signs of an asymptote at a level just less than about half the standard deviation of each population.
► If we look at the required sample sizes to detect a difference in means of half the standard deviations, we can see that we would need a sample size of about 30.
► This gives rise to a common rule of thumb that with T tests we have excellent statistical power when we have 30 values in each sample. This would allow us to detect a difference in population means of only half a standard deviation in about half of our tests.
Again, these examples are all done for a power of 50%, if we want more power the sample sizes would need to be bigger.
As I mentioned however, those calculations are more complicated, and we'd want to use computer software for those other cases.

Slide 16.

So, to recap.
The power of a statistical test is 1- where  is the probability of a type II error.
The power increases when the pattern or difference is stronger or larger, the variance is smaller, the sample size is larger, and when the test is parametric instead of nonparametric.
Power is calculated in two situations.
First, we calculate power before an experiment to determine the design and cost needed to detect a relevant pattern or difference.
Second, we calculate power after an experiment to describe the range of patterns or differences that could have been detected.

Zoom out.

Understanding statistical power is useful for people doing statistical tests, but it's also important for those of us who read about studies that other people do.
Especially ones which conclude that certain patterns aren't seen or certain factors aren't important.
We need to think about whether the study design was powerful enough to see what they were looking for. If they conclude that a pattern isn't seen, we need to think about whether their study had enough power to see it?
A high resolution PDF of this screen is available on the StatsExamples website. That site also has a ton of other, 100% free, resources to help you with learning and performing statistics.

End screen.

Click to subscribe and stay in touch with the channel and website.
Like and share to help others find this video and channel when they have questions about power analysis.

Connect with StatsExamples here

This information is intended for the greater good; please use statistics responsibly.

ABOUT contact privacy credits