CHI-SQUARED EXAMPLE (DIHYBRID CROSS)

Tweet
Share

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-chi-squared-intro.pdf

TRANSCRIPT OF VIDEO:


Slide 1.

The chi-squared test can be used for a variety of analyses of count data. One important example in biology is examining the results from a dihybrid cross between two heterozygous individuals which have dominant and recessive alleles at each of two loci. Let's look at how this works.

Slide 2.

First, a quick review of the chi-squared approach in general. You can check out our introduction to the chi-squared video on this channel if this is new to you.
The first step is to create the null and alternative hypotheses for the analysis.
The null hypothesis will be that the observed counts match the predicted counts.
The alternative hypothesis will be that the observed counts don't match the predicted counts.
Then we calculate a chi-squared test statistic using the observed count data and a mathematical model which makes predictions for what the count values are expected to be.
Then we compare this calculated chi-squared value to critical chi-squared from a table. The usual threshold is an alpha of 0.05. This allows us to determine the probability, the P-value, of seeing a calculated chi-squared value as large as we do.
Technically the P-value is the smallest alpha value we could choose and still reject the null hypothesis. However, a better way to think about this is that it is the probability of getting a calculated chi-squared value as large as we do if the count values in the population our sample is from match the predicted values perfectly.
Then we decide to "reject the null hypothesis " or "fail to reject the null hypothesis " based on the p value.
When the P-value is larger than 0.05 we would fail to reject the null hypothesis and conclude that the observed counts roughly match the predicted counts. Any deviation is within the range that sampling error could easily cause.
When the P-value is smaller than 0.05 we would reject the null hypothesis and conclude that the observed counts don't match the predicted counts. The deviation we see is more than sampling error alone could easily cause so there is likely some other reason for the mismatch in our data.
Our conclusion about the null hypothesis ties directly to the mathematical model we used to generate the predicted values. The conclusion therefore tells us about the accuracy or validity of the mathematical model we are using.

Slide 3.

Let's look at that last part in more detail. What does it mean to reject the null hypothesis in a test like this?
When we do a goodness of fit chi-squared test, if we decide to reject the null hypothesis that means that one or more of the assumptions in our mathematical model are violated.
For Mendelian segregation, the main assumption is that alleles from different loci segregate (that is, go into gametes during meiosis) independently because they are on different chromosomes or very far apart on the same chromosome.
In the diagram shown, a heterozygous individual with alleles represented by capital "A" and lower case "a" and alleles capital "B" and lower case "b" might have both capital letter alleles on the same chromosome or they may be on different chromosomes.
If those loci are on different chromosomes, then the pair of alleles that go into gametes will be random because chromosome segregate randomly during meiosis.
If those loci are on the chromosome, but are distant from one another, then there is plenty of opportunity for recombination to occur between them and randomize which pair of alleles end up together at the end of meiosis.
In these two cases, the proportions of gametes produced would be 1/4 for each of the four possible allele combinations.
This is a mathematical model that predicts these 1/4 values for each combination of the alleles from each parent.
It assumes that the loci with these alleles are on different chromosomes or far apart on the same chromosome. That is, they're not linked in some way.

Slide 4.

Let's look at what we would therefore predict for the result of a dihybrid cross, a cross between two heterozygotes.
For each individual, color-coded blue and red, I've drawn out the four different types of gametes they may produce.
► The offspring will have genotypes that are the combination of these gametes.
► Since we're assuming that the proportions of each gamete are 1/4, we can predict how often we should get each genotype, each genotype shown would be equally likely and 1/16 of the offspring should have each genotype.
► Now let's think about a situation in which the alleles represented with capital letters are dominant to those represented by lower case letters. Now our set of 16 genotypes will collapse into a set of four possible phenotypes.
How many out of the 16 genotypes will have each phenotype though?
► First let's think about the double dominant phenotype. Every individual with one or two capital "A" and one or two capital "B" alleles will have this phenotype. Looking through the array of offspring genotypes, we can find 9 that match these conditions.
► Now let's think about the dominant phenotype for the first locus and the recessive phenotype for the second. Every individual with one or two capital "A" and both lower case "b" alleles will have this phenotype. Looking through the array of offspring genotypes, we can find 3 that match these conditions.
► Now let's think about the recessive phenotype for the first locus and the dominant phenotype for the second. These are the individuals with both lowercase "a" alleles and either one or two capital "B" alleles. Looking through the array of offspring genotypes, we can find 3 that match these conditions.
► Lastly, the double recessive phenotype is the individuals with both lowercase "a" alleles and both lowercase "b" alleles. There's just one location that matches those conditions.
These fractions indicate the proportions of all the offspring we would expect to have each phenotype if we crossed a pair of heterozygotes and the assumptions of mendelian segregation are met. Comparing what we see with what we just predicted could therefore allow us to conclude that these loci are on different chromosomes, or far apart of the same one, if the data matches the prediction. If the data doesn't match the predictions, then we have a good evidence that these loci are on the same chromosome and too close for free recombination between them.

Slide 5.

Consider two diallelic flower loci with dominant alleles at each.
There's one locus for shape with a dominant round allele such that the genotypes represented by "RR" and "Ro" will have the round shape phenotype and the genotype "oo" will have the recessive oval shape phenotype.
There's one locus for color with a dominant blue allele such that the genotypes represented by "BB" and "By" will have the blue color phenotype and the genotype "yy" will have the recessive yellow color phenotype.
Two heterozygous, round blue, individuals are crossed. Note that this is the double dominant phenotype.
Then we count the number of offspring with each phenotype to see if the numbers match the prediction of our model,
We do this and get 871 offspring with round blue flowers, 295 offspring with round yellow flowers, 301 offspring with oval blue flowers, and 112 offspring with oval yellow flowers.
► These are the observed counts. We need to compare them to the predicted counts and calculate the chi-squared value to see if they match the predictions or deviate too much for just randomness to explain.

Slide 6.

We start by doing two things.
First, we'll create a set of grids corresponding to the phenotypes as shown. This will help us stay organized. The first grid will be for the observed numbers of individuals. The second grid will be for the predicted numbers of individuals. And the third will have the differences for each category, as measured using the chi-squared method. The total of these last grid values will be our overall chi-squared value.
We also need to figure out how many total offspring we have so that we can calculate what 9/16, 3/16, and 1/16 of them will be. This is just adding up all the observed values to get 1,579.
Now we want to get the predicted values. This just involves going through our phenotypes.
► The predicted number of round blue individuals is 9/16 times 1579 which gives us 888.188. Obviously, it's not possible to have non-integer observations, but our predicted values can be decimal and we should keep them that way to avoid rounding error.
► We'll put that value in our center grid.
► Then the predicted number of oval blue individuals is 3/16 times 1579 which gives us 296.063 and we'll put that in our grid.
► Then the predicted number of round yellow individuals is 3/16 times 1579 which gives us 296.063 which goes in our grid.
► Finally, the predicted number of oval yellow individuals is 1/16 times 1579 which gives us 98.688.
► Now we calculate the first chi-squared term, for the round blue category. This will be 871 minus 888.188, squared, divided by 888.188 to give 0.33260.
► We'll put that value in our bottom grid.
► Now the next chi-squared term, for the oval blue category. This will be 322 minus 296.063, squared, divided by 296.063 to give 0.08234.
► Now the next chi-squared term, for the round yellow category. This will be 295 minus 296.063, squared, divided by 296.063 to give 0.00381. This really small value indicates the almost perfect match of the 295 and the 296.063.
► Finally, the chi-squared term for the oval yellow category. This will be 112 minus 98.688, squared, divided by 98.688 to give 1.79580.
This is the largest value in our grid of chi-squared contributions, showing where the biggest relative mismatch is. If we get a significant deviation from our test, this set of phenotypes will be the one most responsible. It's sometimes useful to keep track of which categories deviate the most for follow-up analyses to figure out what possible causes of deviations may be.
► Now we add all these terms up to get our final calculated chi-squared value of 2.2146.

Slide 7.

We have our chi-squared value and now we want to know if it's significant.
What are the degree of freedom for our test?
For a goodness of fit test, the degrees of freedom are the number of categories minus one, minus the number of parameters in our mathematical model we had to estimate from our data.
In this case there weren't any - the 1/4 values and the 9/16, 3/16, 3/16, and 1/16 all come from outside the data we had. We didn't use any of our 1579 values to get those predicted proportions.
So our degrees of freedom is therefore 4 minus 1 equals three.
► Now we can use a table of chi-squared critical values, like the one shown here from the StatsExamples website. We look at the row for three degrees of freedom and the column for alpha equals 0.05 and our critical value of interest is 7.815.
► Since 2.2146 < 7.815 this tells us that p > 0.05.
► Because our P-value is larger than 0.05 we would therefore say the following.
"The observed frequencies of offspring phenotypes from the dihybrid cross do not significantly deviate from the predictions of a model assuming free recombination between these loci (p>0.05)."
This result from our statistical test indicates that the loci for flower shape and color are not linked. They seem to be on different chromosomes, or if they're on the same chromosome they're very far apart.

Slide 8.

Now let's consider two diallelic loci in the fruit fly Drosophila with dominant alleles at each.
There's one locus for eye color with a dominant red allele such that the genotypes represented by "RR" and "Rs" will have the red eye color phenotype and the genotype "ss" will have the recessive sepia eye color phenotype.
There's one locus for wing morphology with a dominant wild type allele such that the genotypes represented by "WW" and "Wc" will have the wild-type phenotype and the genotype "cc" will have the recessive curled wing morphology phenotype.
Two heterozygous, red-eyed wild-type wing, individuals are crossed.
Then we count the number of offspring with each phenotype to see if the numbers match the prediction of our model.
We do this and get 42 offspring with red eyes and wild-type wings, 9 offspring with red eyes and curled wings, 11 offspring with sepia eyes and wild-type wings, and 10 offspring with sepia eyes and curled wings.
► These are the observed counts. Now we compare them to the predicted counts and calculate the chi-squared value to see if they match the predictions or deviate more than just randomness can explain.

Slide 9.

As before, we create a set of grids corresponding to the phenotypes. The first grid is for the observed numbers of individuals. The second grid is for the predicted numbers of individuals. And the third will have the differences for each category, the chi-squared contributions. The total of the values in the last grid will be our overall chi-squared value. We also need to figure out how many total offspring we have so that we can calculate what 9/16, 3/16, and 1/16 of them will be. In this case it's 72.
Now we want to get the predicted values. This just involves going through our phenotypes.
► The predicted number of red eyed wild-type individuals is 9/16 times 72 which gives us 40.5.
► We'll put that value in our center grid.
► The predicted number of sepia eyed wild-type individuals is 3/16 times 72 which gives us 13.5.
► The predicted number of red eyed curled wing individuals is 3/16 times 72 which gives us 13.5.
► Finally, the predicted number of sepia eyed curled wing individuals is 1/16 times 72 which gives us 4.5.
► Now we calculate the first chi-squared term, for the red eyed wild-type category. This will be 42 minus 40.5, squared, divided by 40.5 to give 0.05556.
► We'll put that value in our bottom grid.
► Now the next chi-squared term, for the sepia eyed wildtype category. This will be 11 minus 13.5, squared, divided by 13.5 to give 0.46296.
► Now the next chi-squared term, for the red eyed curled wing category. This will be 9 minus 13.5, squared, divided by 13.5 to give 1.5000.
► Finally, the chi-squared term for the sepia eyed curled wing category. This will be 10 minus 4.5, squared, divided by 4.5 to give 6.72222.
► Now we add all these terms up to get our final calculated chi-squared value of 8.7407.
From looking at the contribution values, we can see that most of this comes from the larger than expected number of double recessive phenotype individuals.

Slide 10.

We have our chi-squared value and now we want to know if it's significant.
What are the degree of freedom for our test?
Just like before, our degrees of freedom is 4 minus 1 equals three.
► Looking at the row for three degrees of freedom and the column for alpha equals 0.05 from our StatsExamples table of critical chi-squared values we get the 7.815 value again.
Our calculated value of 8.7407 is larger than this so we should also look at the other columns to find critical values that bracket it.
► The next critical value is 9.348 which corresponds to alpha equals 0.025.
► Since 7.815 < 8.7407 < 9.348 this tells us that 0.025 < p < 0.05.
► Because our P-value is less than 0.05 we would therefore say the following.
"The observed frequencies of offspring phenotypes from the dihybrid cross significantly deviate from the predictions of a model assuming free recombination between these loci (0.025 This result from our statistical test indicates that the loci for eye color and wing morphology are linked in some way. They seem to be fairly close to one another on the same chromosome.

Zoom out.

This video looked at the most common analysis of a dihybrid cross, when both loci have alleles in which one is dominant. Things can be more complicated if one or more of the alleles are co-dominant, but the principle and procedure is essentially the same.
There's a hi-resolution PDF of this screen on the StatsExamples website.

End screen.

Click to subscribe if you found this video useful. Like or share if you think others will find this useful too.


Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.