ONE FACTOR ANOVA (INTRODUCTION)
Watch this spaceDetailed text explanation coming soon. In the meantime, enjoy our video. The text below is a transcript of the video.
Connect with StatsExamples here
LINK TO SUMMARY SLIDE FOR VIDEO:
TRANSCRIPT OF VIDEO:
The one factor ANOVA is the most common technique to see if the means of more than two populations are different from each other. In this video we'll look at the concepts behind it and how it works. There's also a companion video on this channel that works through two step-by-step examples.
We often want to compare the averages, or means, of different populations. When we have two populations, we typically do a single t-test for the pair. This usually gives us the correct answer, but it does have a 5% chance of type I error - that is, telling us the populations have different means when they don't.
►When we have more than two populations, four for example, we would have to do six different t tests, one for each pair. These would also usually give us the correct answer, but now the overall chance of at least one type I error rises to about 28%. This is now a more serious problem and we face a genuine risk of making the wrong conclusion.
►The problem I've illustrated is that multiple separate comparisons inflate the overall probability of type I error. We need a new test with 5% overall probability of type I error.
As I mentioned, when we have just two populations we use the t-test to decide if the population means differ. When doing a t-test we compare the difference between the sample means to the combined standard error using the equation shown there.
The bigger the t calculated value, the more likely it is that the population means differ because that indicates cases in which the difference between sample means are large relative to how different we expect them to be based on the variation within the groups.
For more details about this idea and the t-test, watch our intro to the two sample t-test video.
With the ANOVA we use the same basic concept as the t-test, comparing the overall variation in the means of the groups to the variation we would expect if it was caused only by whatever causes the variation within the groups.
For the ANOVA we measure the variations we want to compare using variances, instead of the pure difference in means or standard errors, and then we compare these using an F test. The F calculated value divides the variance among the groups, the measure of the variation in the means, by the variance within the groups, the measure of the variation or noise within each group.
The bigger the F calculated value, the more likely it is that the population means differ because that indicates cases in which the variance of the sample means is larger than we expect it to be based on the variances within the groups.
One aspect of the ANOVA that makes it a bit different from some other statistical tests is that the conceptual hypotheses and the actual tested hypotheses differ somewhat.
Our conceptual null and alternative hypotheses are about the means of the populations - are they all equal or do at least two differ?
But the formal hypotheses, the actual thing we will test, is about the mean sums among, MSA, and mean sums within MSW. The hypotheses ask whether the mean sums among is less than or equal to the mean sums within or whether the mean sums among is greater than the mean sums within.
Note that while the language of the ANOVA uses the term mean sums, the MSA and MSW are actually variances.
These hypotheses are connected - the decision to reject or fail to reject the formal null hypothesis is equivalent to rejecting or failing to reject the conceptual null hypothesis.
If we accept the null hypothesis, then we lack the evidence to decide that any of the population mean differ from one another.
But if we reject the null hypothesis and decide that one or more of the population means differ, a second question then arises - which ones? We'll come back to this later.
We know we're going to do the ANOVA using an F test of the variances, but let's look at how these variances relate to each other in more detail so we can understand exactly what we'll be doing.
Mean sums are variances based on sums of squares so before we calculate the mean sum, we need to calculate three sets of sums of squares.
The first sum of squares we will calculate is one using all the data regardless of which population the sample data value is from. We compare each value to the overall mean to calculate a sum of squares total, SST.
From the figure we can see that this overall variation will be caused by two things.
When the means of the groups are more variable, as shown for the examples on the left, the overall data values will tend to also be more variable. When the means of the groups are less variable, the overall data values will tend to be less variable.
When the values within each of the groups are more variable, as shown for the top examples, the overall data values will tend to be more variable. When the values within each of the groups are less variable like in the lower figures, then the overall data values will tend to be less variable.
The sum of squares total value for all our data will therefore be caused by both the differences between the means of the groups and the variation within them.
Focusing on the variation coming from the means of the groups we could calculate something called the sum of squares among, SSA. This is a sum of squares value that is based on the values of the group means, compared to the overall mean of everything.
Since each of these means is based on multiple values, we will be weighing this sum of squares value by the number of values in each of the groups.
Focusing on the total variation coming from the variation within the groups we could calculate something called the sum of squares within, SSW. This is a sum of squares value that is based on the values for each group, compared to the means for each group, then all added up afterwards.
Looking at these together we can see visually how the sum of squares total is due to both the sum of squares among and the sum of squares within.
►In fact, there is a mathematical proof that states that if we calculate the sums of squares the way I just described, the numerical value for the sum of squares total will be equal to the sum of the sum of squares among and sum of squares within.
This useful mathematical property is actually the main reason why so much of statistics is done using sums of squares and variances even when these values can have serious problems as descriptive statistics for the variation.
We have a sense of our three sums of squares values and their relationship, now what?
If all the variation was just caused by random noise, then the SSA and SSW would both contribute quite a bit to the sum of squares total. How do we determine whether the differences between the groups, the "Among" factor is larger than we expect based on the "Within" factor however?
►For this we will convert the sums of squares values into their mean sums, otherwise known as variances using these equations. The mean sum is the sum of squares divided by the degrees of freedom value.
►Once we have these two variances, we can use an F test to compare them and see if they differ. We could do a two-tailed F test to see if the group means are either more different than we expect or more similar, but the second question isn't usually what we care about.
We therefore use a one-tailed F test, dividing MSA by MSW, to see whether we have evidence that the variance among the groups is significantly larger than the variance within the groups.
Now we just need the degrees of freedom values so we can do the F test.
Let's consider a data table such as the one shown to the right with a column for each of "k" groups and rows for the "n" individuals within each groups.
►The degrees of freedom for the among groups variance will be the number of group means we used minus one, this is k-1.
►The degrees of freedom for the within groups variance is the number of values in each group minus one for each of the k groups. This can also be represented by capital N minus k where the capital N is the total number of data values in the entire data set.
►The degrees of freedom for the entire data set would be capital N minus one.
►The means sums among is therefore SSA divided by k- minus one.
►The means sums within is therefore SSW divided by capital N- minus k.
►The F calculated value for our one-tailed test is MSA divided by MSW. We get this and compare it to the critical values from our F table for alpha equals 0.05 to see if the variances are significantly different. We can use additional tables to get a more precise range for the p value of our test. A computer with statistical software can give us the exact p value if we have access to one.
You can watch our intro to the F-test video for more about this test if you're not familiar with it.
That's the idea, let's look at this again to see how we would actually use a data table directly.
How do we get SST, SSA, and SSW from a data table?
The top figure shows what we're looking for. The rest shows a diagram of the data table, with values for each group arranged in the red columns.
To get SST we calculate the sum of squares for all data values, comparing each value to the overall mean. The purple box around all the data represents this.
This measures the overall variation - how the data values differ depending on which groups they're in, combined with the noise within each group.
You can watch our intro to summary statistics video for more about calculating sums of squares.
To get SSA we first calculate the mean value for each group, these are the X bar values shown below the data table.
Then we calculate the sum of squares of these group mean values, comparing them to the overall mean, and multiplying these squares by the group sample size, n.
This measures the variation that is associated with the differences in the means of the groups. How much is the data spread out because the means of the groups are spread out?
To get SSW we calculate the sum of squares values separately for each of the k groups using the group means and sum them.
This measures how much of the variation comes from variation within each of the groups. How much noise is in the data?
To help keep everything organized and to present it clearly to others, data is usually presented in an ANOVA table.
The table has a column for the source of the variation, the degrees of freedom for each source, the sums of squares and mean sums values, the F calculated value and the p value that corresponds to it.
By convention, the sources are listed with among group variation first, followed by within group and then total on the third line.
The degrees of freedom values are the ones from before - k minus one for the among, capital N minus k for the within, and capital N minus one for the total.
The sums of squares values are entered in the next column.
The mean sums are then easy to calculate since the two values needed are on the same row already. The MSA is SSA divided by k minus one and the MSW is the SSW divided by the capital N minus k.
The F value is then just the MSA value divided by the MSW value.
Finally the p value from our F test indicates if the variances are different which corresponds to whether any of the means are significantly different.
If p is less than 0.05 then one or more means are significantly different from one or more other means.
If p is larger than 0.05 then we lack evidence to decide that any of the means are different from any of the others, we would typically conclude that they all appear to all be equal.
Note that if we fail to reject the null hypothesis, we don't prove that the means are the same, we decide that they appear equal because we looked for evidence that they differed and didn't find it.
Published tables will often omit the sums of squares column since it is easily figured out from the degrees of freedom and the mean sums columns.
That's all there to doing an ANOVA, calculating variances and doing and F test, but there are a few other things we need to be aware of.
The first thing to be aware of is that the ANOVA is a homoscedastic test. The ANOVA requires equal variances to prevent one unusually variable group from overwhelming the SSW value as shown in the figure.
For example, the gigantic variance of the red group would overwhelm the SSW term and render our F test nonsignificant even it's obvious that it should tell us there's a difference between the smallest and largest groups.
Unequal variances can cause type II errors, that is, the failure to detect genuine differences between population means.
►A prerequisite for the ANOVA is therefore a test for equal variances such as the Fmax test. If the variances are equal, then we can do ANOVA, but if not then can't do the ANOVA. If that happens, we have to transform the data into a new data set that has equal variances or use a less powerful alternative test like the Kruskal-Wallis test.
Here's the formal procedure for the one factor ANOVA.
►First, we create our null and alternative hypotheses.
The null is that the means of the population are the same, which is equivalent to the mean sums among being less than or equal to the means sum within.
The alternative is that the means of the population are not all the same, which is equivalent to the mean sums among being greater than the means sum within. ►Then we calculate the three sums of squares values and use the SSA and SSW to calculate the mean sums values MSA and MSW.
►Then we perform a one-tailed F test on the MSA and MSW to test whether the mean sums among is larger than the mean sums within using the F statistic calculation shown.
►We then compare the F calculated value to various F critical values - the most important being the value for an alpha value of 0.05 since that is what we typically use to reject or fail to reject the null hypothesis of our F test.
►By using multiple tables or a computer we determine the precise p value for our test, that's the probability of seeing an F calculated value as large as we do if the null hypotheses are true.
►Then we decide to "reject the null hypothesis" or "fail to reject the null hypothesis " based on the p value.
If the p value is greater than or equal to 0.05 then we would fail to reject the null hypothesis and decide that we lack evidence to decide that any of the means are different. We would then generally say that they're all equal. Note that we don't prove this, we're just making an informed decision.
If the p value is less than 0.05 then we would reject the null hypothesis and therefore decide that we do have good evidence to decide that some of these means are different. Again, this isn't proof, it's an informed decision based on probability.
This procedure doesn't tell us everything that we want to know however - for one thing it doesn't tell us which means are different or not.
The ANOVA procedure just described is the first step.
If the null hypothesis is accepted, that is when p is not less than 0.05, then we conclude that none of the means are significantly different from any of the others. We are done unless we want to do a power analysis to estimate the maximum undetectable differences, but honestly this isn't done as often as it should be.
If the null hypothesis is rejected however, when p is less than 0.05, then one or more of the means are significantly different from one or more of the others. The ANOVA doesn't tell us which ones differ, just than one or more do, but obviously this is information that we are probably really interested in.
There are two options to determine which means differ.
The first option for determining which means differ is to use Bonferroni corrected t tests.
For this we go back to the original data sets and do all the possible pairwise t-tests, but with a smaller alpha value threshold, one less than 0.05, as the threshold for significance.
This is called the Bonferroni correction for the critical alpha value and we usually use a corrected alpha value equal to 0.05 divided by the number of tests we have to perform.
►For example, if we're comparing three data sets then we would do the three pairwise t-tests, but only reject the null hypothesis of equal means for a pair when the p value is less than 0.05 divided by 3 which is 0.01666.
►If we're comparing four data sets then we would do the six pairwise t-tests, but only reject the null hypothesis of equal means for a pair when the p value is less than 0.05 divided by 6 which is 0.00833.
The second option for determining which means differ is to use Tukey-Cramer comparison intervals.
For this we calculate a value called the MSD (minimum significant difference) also known as the HSD (honestly significant difference) and create intervals around each sample mean of 1/2 MSD above and below their means. Non-overlapping intervals indicate differing means.
The equation to get the MSD or HSD uses a value Q which varies based on the desired alpha value, the number of groups, and the degrees of freedom of the mean sums within. The StatsExamples website has several tables of Q values like the one shown for alpha equals 0.05.
This Q value is then multiplied by the square root of the mean sums within, divided by the sample size within each group, essentially the average standard error for the groups.
►For example, imagine that we had data which resulted in a minimum significant difference value of 6.
We would then go to each sample and put an interval of plus and minus 3, half of the six, around each mean.
If our sample means were 8, 10, and 15 we would get intervals of 5 to 11, 7 to 13, and 12 to 18.
We then compare these intervals and identify which ones overlap and which ones don't. If the intervals for two groups overlap then their population means are not significantly different, but if the intervals don't overlap then they are significantly different.
For our example we can see that the means of populations A and B aren't significantly different, the means of populations B and C aren't significantly different, but the means of populations A and C are significantly different.
To recap. When we compare multiple groups we have a problem because the overall risk of type I error gets too high when doing multiple comparisons.
The solution is to perform an analysis of variance which is called an ANOVA.
Before we do that, we do have to keep in mind that the ANOVA is homoscedastic, so we have a prerequisite test for the equality of the population variances. Assuming the data passes that test we can do an ANOVA.
When we do the ANOVA, we have conceptual hypotheses about the means which we test with formal hypotheses about the mean sums among and within.
We do this test by calculating the sums of squares, mean sums, and performing a one-tailed F test to see if the MSA is significantly larger than the MSW.
We typically summarize the results of our test in an ANOVA table and use the p value to reject or fail to reject our null hypothesis. This tells use whether we have evidence that some of the population means differ or not.
If we reject the null hypothesis then we have two option to figure out which means differ - performing a series of Bonferroni correct t tests or calculating the minimum significant difference and comparing Tukey-Cramer comparison intervals.
The ANOVA is often extremely intimidating to students when they first encounter it, but I hope that this video shows that it's not a mysterious black box. Doing an ANOVA doesn't require a computer, you can do one by hand.
As always, a high-resolution PDF of this screen is available on the StatsExample website and check out our companion video that works through two step-by-step numerical examples.
To help others find this video, click like or subscribe.
Connect with StatsExamples here
This information is intended for the greater good; please use statistics responsibly.