CHI-SQUARED (INTRODUCTION)

INTRODUCTION

dihybrid cross example

Link to summary slide and video transcript below

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.

Connect with StatsExamples here

LINK TO SUMMARY SLIDE FOR VIDEO:

StatsExamples-chi-squared-intro.pdf

TRANSCRIPT OF VIDEO:

Slide 1.

The chi-squared test is used analyze count data - it compares the number of observations in various categories to the numbers predicted. It's a widely used and very versatile technique so let's look at how it works.

Slide 2.

Consider a situation in which we have count data and want to know if it matches our predictions. The categories can be whatever we like, as long as we can put each observed value from our sample into a specific category to generate count data.
We can compare the numbers observed to the numbers expected - that is, the predicted values. What we're doing is seeing how well our prediction method works for the population, based on the sample we have.
If the observed numbers and predicted numbers are similar, then the prediction method seems good.
If the observed numbers and predicted numbers are different, then the prediction method seems bad.
However, even if the prediction method is perfect for the population, our sample may differ because sampling error and chance will cause a mismatch. How different is too different? For this we'll need a statistical test and a probability distribution.

Slide 3.

The standard Statistical Test Procedure for doing a chi-squared analysis works like this.
First, we ask a question about population count data.
For our test we need an idea about how we expect the data to be, some kind of model that makes a prediction about how many values should be in each of the categories.
Then our null hypothesis will be that the observed number of counts in each category will equal the expected number. If this happens then our prediction is good.
Our alternative hypothesis will be that the observed number of counts in each category does not equal the expected number. If this happens then our prediction is bad.
To figure out which of these seems correct, we calculate a test statistic from the data.
Our question then becomes, if the null hypothesis is true what is probability of seeing the test statistic we get?
The answer to that question is the probability, p, of the null hypothesis being true.
If p is small, then the null hypothesis probably not true so we would reject the null hypothesis and go with the alternative.
If p not small, then the null hypothesis may well be true so we would fail to reject the null hypothesis
So we need a good test statistic, call it chi-squared.
Let's look at the examples to the right.
The first set of values show close agreement so we would want our chi-squared test statistic to return a small result.
The second set of values show little agreement so we would want our chi-squared test statistic to return a large result.

Slide 4.

The equation we'll use is shown here, this is the chi-squared statistic.
The chi-squared statistic is the sum of the squared deviations between the observed and expected values, relative to the expected values.
This value is larger for bigger mismatches. The bigger the difference between the observed and expected value, the larger the numerators for the summed terms will be.
This difference is squared to make every term positive and then divided by the expected number to measure the magnitude of these deviations relative to the expected values.
The overall chi-squared then measures the overall mismatch between the predicted and expected values.
Looking at our two data sets from before, the chi-squared values would be 2.03 and 15.45. You can see how the values clearly line up with the amount of mismatch.

Slide 5.

Let's look at some of the properties of the chi-squared distribution.
Chi-squared is always positive.
It gets larger as the mismatches increase in magnitude.
It also gets larger as the number of categories, k, increases.
The chi-squared distribution is also skewed right, although it becomes well approximated by the symmetric normal distribution for very large degrees of freedom values.
On the right you can see 3 different chi-squared distributions plotted, for degrees of freedom 4, 8, and 16.
So, if we had a population with values that exactly matched a predictive model and we took a sample, the value of chi-squared we get would come from whichever distribution matched the degrees of freedom that corresponds to the number of categories.
We'll come back to that correspondence later.
If there was no sampling error, the chi-squared we get would always be zero, but sampling error gives us a larger value.
A genuine mismatch between the predictions and observations would also give a larger chi-squared value.
The question then becomes, how large of a chi-squared value is more than just due to sampling error and therefore likely because of a true mismatch between the predictive model and the population data?

Slide 6.

Here's how we use the chi-squared statistic.
First, we line up the observed and expected values.
Second, we calculate the chi-squared statistic.
Note that now I'm showing the calculated value with an English letter capital X squared. This is to distinguish the sample chi-squared test statistic value from the chi-squared distribution.
It's good practice to always make this distinction between capital X and an actual chi-squared, but you'll see lots of people being imprecise about this. You'll have to figure out from context if they're using the Greek symbol to refer to the probability distribution values, or chi-squared values calculated from sample data.
Third, we compare our calculated chi-squared value to the chi-squared distribution.
We identify the point on the horizontal axis that matches our chi-squared value. The area under the curve to the right is the probability of getting a chi-squared value that large or larger by chance alone, the p value.
Finally, we interpret the magnitude of the p value to "reject" or "fail to reject" the null hypothesis that the model used to make predictions is a good one.
So what kinds of models is this technique used to test?

Slide 7.

As mentioned, a chi-squared test compares observed to predicted using a model. There are 3 general types of tests.
First, for "Goodness of Fit" tests the model is any one of a number of mathematical models that make predictions once we have the appropriate parameter values to use it. Any model can be used.
► For these tests we would usually arrange our number of observations in each category and line them up with the predicted number from our model as shown to the right.
Second, there are also tests of "independence" or "homogeneity". These are also a mathematical model, but a specific one in which the independence of two factors is assumed and then used to predict the number of observations in categories representing combinations of factors.
Mathematically these two are the same, but their experimental design differs and the interpretation is slightly different.
► For these tests we would usually arrange our observations in a grid with the rows representing one factor and the columns representing another. We then use the proportions of observed values in each row and column to make predictions about what we would expect those values to be if the rows and columns were independent. Let's look at a hypothetical grid. If we saw the observations in the grid at the left we would use the information about the values to make the predictions to the right.
► For example, the upper left box is in a row that has 1/4 of the overall values and a column that has 1/2 of the overall values so we would expect that, if things were independent, it should have 1/4 of 1/2 of the total number of 48 observations. This number is 6, but it actually has 7 so we can see a little mismatch already. The predictions for the other categories work the same way. We'll come back to this example in a moment.

Slide 8.

For goodness of fit tests, any mathematical model can be used to predict the number of observations.
The degrees of freedom will be the number of categories, k, minus 1, and then subtract the number of parameters that the data is used to help estimate.
There are lots of examples of goodness of fit tests:
Predictions can come from the uniform distribution in which case the degree of freedom will be the number of categories minus one. When using the uniform distribution, we don't have to look at our observations to tell us that we expect the same number of observations in each category. No parameters are estimated for the data.
Predictions can come from the binomial distribution in which case the degree of freedom will be the number of categories minus three. We need to estimate the probability of success and the number of trials from looking at the data, two parameters.
Predictions can come from the normal distribution in which case the degree of freedom will also be the number of categories minus three because we would use our data to estimate the mean and variance of the population.
Predictions can come from the Poisson distribution in which case the degree of freedom will be the number of categories minus two because although we use our data to estimate the mean and variance, for a Poisson process they're the same so we're only estimating one parameter.
These are just a few examples, there are lots more out there.

Slide 9.

For tests of independence and homogeneity we saw a little of how the independence of rows and columns was used to predict the number of observations in each combination. Let's look at two examples.
► The first set of observations are the ones from before. The rows have 1/4 and 3/4 of the values while the columns have 1/2, 1/3, and 1/6 of the observations in each. There are 48 observations overall.
Our predicted numbers are therefore 1/4 of 1/2 in the upper left, that's 6. 3/4 of 1/2 of 48 for the lower left gives 18. 1/4 of 1/3 of 48 for the upper middle gives 4. 3/4 of 1/3 of 48 for the lower middle gives 12. 1/4 of 1/6 of 48 for the upper right gives 2. And finally, 3/4 of 1/6 of 48 for the lower right gives 6.
► Here's another example. The rows have 1/2 of the values each while the columns have 1/4, 1/2, and 1/4 of the observations in each. There are 64 observations overall.
Our predicted numbers are therefore:
1/2 of 1/4 of 64 for the lower left gives 8.
1/2 of 1/4 of 64 for the lower left gives 8.
1/2 of 1/2 of 64 for the upper middle gives 16.
1/2 of 1/2 of 64 for the lower middle gives 16.
1/2 of 1/4 of 64 for the upper right gives 8.
And finally, 1/2 of 1/4 of 64 for the lower right gives 8.
► The degrees of freedom for tests of independence and homogeneity is the number of rows minus 1 multiplied by the number of columns minus 1.
► You can visualize this as the number of cells in the grid except for the last one in each row and column.
In fact, this is where the term "degrees of freedom" comes from - once all the boxes in yellow have values, the remaining cells aren't free to vary, they must be whatever it takes to make the row or column total. Only the yellow cells have any degree of freedom.

Slide 10.

So what's the difference between chi-squared tests of independence and homogeneity?
The math is identical, but the experimental design differs.
For tests of independence the samples are taken randomly from one population and measure the numbers observed in two sets of categories.
No special effort made to include members of each category, the population is just randomly sampled and the numbers will indicate whether the two sets seem independent or not.
In contrast, for tests of homogeneity, samples are randomly taken from multiple populations and we measure numbers in a set of categories for each population. The population sampled is essentially one of the sets of categories.
But in this case, we make sure to include plenty of members of each population. The two sets are being treated a little differently.

Slide 11.

Let's look at the formal procedure for a chi-squared test.
First, create the null and alternative hypotheses.
The null hypothesis is that the observed counts match the predicted counts. The alternative hypothesis is that the observed counts don't match the predicted counts.
Calculate a chi-squared test statistic using the observed count data and expected values from mathematical model.
Then compare this calculated chi-squared value to critical chi-squared from a table to see if the difference between the observed and expected values is significant. The usual threshold is an alpha of 0.05.
Then determine the probability, p value, of seeing a calculated chi-squared value as large as we do. We can do this by using values from our table or by using a computer to generate an exact P-value.
Technically the P-value is the smallest alpha value we could choose and still reject the null hypothesis, but, a better way to think about this is that it is the probability of getting a calculated chi-squared value as large as we do if our null hypothesis is true.
Then we decide to "reject the null hypothesis " or "fail to reject the null hypothesis " based on the p value.
If the P-value is larger than 0.05 we would fail to reject the null hypothesis and conclude that the observed counts appear to match the predicted counts. Deviations are within the range that sampling error could easily cause.
If the P-value is smaller than 0.05 we would reject the null hypothesis and conclude that the observed counts don't match the predicted counts. Deviations we see are more than sampling error alone could easily cause so there is likely some non-random reason for the mismatch in our data.

Slide 12.

So what does it mean to reject the null hypothesis?
For a goodness of fit test this means that our mathematical model is not accurate or valid. Essentially this means that one or more assumptions of the mathematical model are violated. Depending on the model we used, this can have a variety of implications for the system we're studying.
For a test of independence this means that there is a non-random association between some of the factors.
For a test of homogeneity this means that the frequencies of the categories differ between the populations, there is a non-random association between the population and the other factor we're measuring.

Slide 13.

Before we finish, there are a few things about chi-squared tests to be aware of.
Predicted values that are too small is a problem that comes up sometimes. Extremely small predictions for categories can create large chi-squared values from just one or two chance observations. Technically speaking, at this point the continuous chi-squared distribution doesn't match our discrete scenario appropriately.
There are therefore a couple rules of thumb about when we shouldn't do a chi-squared analysis.
First, no prediction less than 1.0 is allowed.
Second, a maximum of 20% of the predictions can be less than five.
To solve this issue, we can combine adjacent categories to create larger predicted values.
In the example shown we wouldn't want to do a chi-squared with these predictions because the predicted values for the orange and grapes categories are too small. We can fix this by combining these to create a larger category of oranges or grapes.

Slide 14.

As mentioned, the calculated chi-squared distribution is discrete, but the theoretical chi-squared distribution is continuous. This can cause problems, especially for 2x2 tables with small predictions.
Therefore, the Yates Correction was proposed to improve the match and give more conservative chi-squared values and lower the risk of type I error.
The correction involves modifying the numerator terms to use the absolute difference between the observed and expected values, minus 1/2, and squaring this in the sum.
Most modern statisticians consider this correction to be too conservative however because it increases the risk of type II error too much.
Nevertheless, this correction is often used.

Slide 15.

As mentioned, 2x2 tables may not be the best for chi-squared analyses, especially if the overall total number of observations is less than 20 - or when the overall total number of observations is between 20 and 40, with any predictions less than 5.
In these cases, Fisher's Exact Test should be used instead.
The math is complicated, but this test can give exact p values for these cases when the chi-squared test would be inaccurate.

Slide 16.

The chi-squared test is perhaps the most versatile test there is.
Observations can be compared to predictions from ANY mathematical model as long as the data can be expressed as number of observations (i.e., count data).
This allows for an almost unlimited number of types of tests.
This test is often the test that people go to when their data is weird or their system so complicated it can't be easily analyzed with a simple T test or linear regression.

Zoom out.

I hope you found this introduction to the chi-squared method useful.
There's a hi-resolution PDF of this screen on the StatsExamples website.
StatsExamples has several videos of worked examples using this method with different scenarios - some goodness of fit tests and some tests of independence and homogeneity.

End screen.

Subscribe if you want to be able to easily find this video, and any of our other videos, easily in the future.

Connect with StatsExamples here

This information is intended for the greater good; please use statistics responsibly.

ABOUT contact privacy credits