CORRELATION & REGRESSION (CONCEPTS)

CONCEPTS

procedure

examples

Link to summary slide and video transcript below

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.

Connect with StatsExamples here

LINK TO SUMMARY SLIDE FOR VIDEO:

StatsExamples-correlation-regression-concepts.pdf

TRANSCRIPT OF VIDEO:

Slide 1.

When we want to know if two variables are associated with one another we use the correlation and regression techniques. This video introduces the general concepts for these techniques and is the first video in a two-part series.
This video is more conceptual whereas the other video describes the mathematical details about the calculations and more technical aspects.

Slide 2.

The regression and correlation methods are for detecting a possible relationship between two variables, X and Y. Generally, we visualize this with an XY plot and the usual idea is to put the causing variable as the X and the caused variable as the Y.
More technically, the X-axis represents values of the independent variable and the Y-axis represents values of the dependent variable.
They have these names because the independent variable can vary and we're looking at how that variable may be related to, or influence the values of, another variable that depends on it. Hence the term dependent variable.
► These are sometimes also caused explanatory and response variables. These are better terms because sometimes the X values aren't really independent of the Y values. However, the terms dependent and independent are much more widely used.

Slide 3.

Before we continue, a caution. It's important to always look at the data values using an XY plot before continuing with an analysis.
This is because plotting the data may reveal mistakes in the data set or situations in which a correlation or regression analysis is inappropriate.
► Consider Anscombe's quartet, that's the four data sets shown in the figures. These four data sets obviously have very different values, but the summary statistics for them are extremely similar.
The four data sets all have the same mean and variance of X values, mean and variance of Y values, the best fit line (which we will talk about later) would have the same equation and the correlation coefficient and coefficient of determination are the same for all four data sets.
If we just relied on statistics and calculations we might think these four data sets are essentially the same, but they're obviously very different.
And only the one on the far left is appropriate for a correlation or regression analysis.
Modeling the other three as data points lying along a straight line doesn't really make any sense.

Slide 4.

A regression or correlation analysis generally has two purposes.
► First, to establish causality.
If we know that the only differences in our data are different x values, then differences in the Y values we see are due to those different X values.
If that's the case, then we can conclude that whatever the X-axis represents is causing or influencing the values represented by the Y-axis.
However, this requires detailed knowledge of the overall system so that we can be sure that other things aren't causing an apparent relationship.
► Second, we can use our results to make predictions.
We can obtain the equation for the line that best represents the overall linear or straight relationship between our two variables X and Y.
The equation for the line is Y equals A plus B X, where A is the Y intercept and B is the slope.
Note that we do not use the equation y equals M X plus B as is often taught in high school. In fact, that equation from high school can be very misleading because in the real world, B usually represents the slope of the line not the Y-intercept.
► We need to keep in mind that prediction and causality are very different things.
We can identify and measure a strong and consistent relationship between two things that allows us to make predictions, even if neither of those two things directly influences the other.
► This is where the well-known saying "correlation does not imply causation" comes from.
Keep in mind that correlation not automatically implying causation doesn't mean that correlations are useless, just that we have to be cautious about how we use that information.

Slide 5.

Let's take a look at a couple of examples that demonstrate why correlation does not always imply causation.
On the left we can see a hypothetical plot of the number of traffic accidents in different cities plotted against the number of red lights those cities have. The more red lights a city has, the more traffic accidents it has.
This would seem to imply that adding red lights causes more traffic accidents. If we wanted to reduce traffic accidents, we could remove some or all of the red lights.
Obviously, this doesn't make sense. What's really going on is that larger cities with more people will have more red lights and also more traffic accidents.
The lights are not causing the accidents, the overall population size is causing both of those values together and creating the relationship that we see.
On the right is a hypothetical plot of the number of drowning deaths at a state's beaches each year plotted against the sales of ice cream cones in that state.
The more ice cream cones sold, the more people die from drowning.
This would seem to imply that ice cream increases the risk of drowning and that to reduce drowning deaths we should ban the sales of ice cream.
Again, this doesn't make sense. What's going on is that during hotter summer months, people are more prone to purchase ice cream and also more prone to swim, which raises both the sales of ice cream and the risk of drowning.
The ice cream cones are not causing the drowning deaths, the weather patterns are causing both of these values together and creating the relationship we see.
Nevertheless, correlation studies are quite common.
► A major pro, or benefit, of correlation studies is that it is relatively easy to get the data to perform one.
For example, to perform the study on the left, all we need to do is get data from the city about the number of accidents and number of red lights that they have.
No need to set up a complicated experiment or find control groups. This is a major reason why correlation studies are still widespread.
► A major con, or detriment, of correlation studies is that other factors can cause the correlations that we see.
We have to be very careful when we see a clear relationship that something we haven't accounted for isn't influencing both values and creating the correlation between them.
It's often easy to figure out things like population size for the traffic light example, but trickier to think of things like the weather for the ice cream example.

Slide 6.

In contrast to correlation studies, regression studies can demonstrate causation. Let's take a look at a couple of examples that demonstrate how regression can be used to imply causation.
On the left is hypothetical data for the rate of a chemical reaction plotted against concentrations of an enzyme.
As more enzyme is added, the rate of the reaction seems to increase.
Assuming everything else about the situation is identical, this demonstrates that the enzyme is directly involved in some step of that chemical reaction.
On the right is hypothetical data for the height of plants grown in treatments with different amounts of fertilizer added to their pots.
As more fertilizer is added, the plants grow taller.
Assuming everything else about the situation is identical, this demonstrates that adding fertilizer will cause these plants to grow taller.
► A major pro of regression studies is that if we're able to control everything, we can demonstrate causation and get a very good idea of what exactly is going on.
Controlling everything can be difficult however.
In the example on the left it is probably fairly easy to ensure that the conditions of the chemical reactions are identical.
For the example on the right, it might be more difficult. We would have to ensure that all the plants are genetically identical, that all the plants receive exactly the same amount of sunlight, that all the plants have the same amount of soil, and that of them receive them same amount of water - in other words, that they are treated identically except for the amount of fertilizer added.
That's doable if we're careful, but it takes planning and equipment to pull that off.
► And that leads us to the major con of regression studies - they are relatively difficult to do.
We need to be able to control our system to a great extent to ensure no other factors can vary for our data points.
This is essentially only possible in a lab, which limits what kinds of things we can study using regression techniques.

Slide 7.

Let's look at an interesting example that shows how subtle things can get.
Have you ever wondered where the legend of storks delivering babies come from? It's an ancient folk legend, but how does that even make sense?
Well, it turns out there is a long-standing relationship between storks and babies. These two figures are based on real data that come from Germany as presented in the paper with the citation shown.
Direct links to this, and other papers cited in this video, are in the video description for this video.
These two plots show the annual number of births in a town in Germany and the number of pairs of storks that were seen nesting on the roofs of houses in the town.
The left-hand figure shows births at home whereas the right-hand figure shows births in hospital.
The relationship between storks and births at home is statistically significant with a P value of 0.025 whereas the relationship between storks and births in hospitals is not significant.
For most of Germany's history, people had their babies at home however, so that left-hand relationship has been true for much longer than the lack of a relationship shown on the right.
So what's going on? Are storks delivering babies after all and that's why there are more babies when there are more storks?
Maybe they work like Santa Claus and use chimneys which is why they can deliver to homes, but not to hospitals.
In the absence of a good alternative explanation for this correlation, why shouldn't we think that storks deliver babies? After all this cultural belief predates most of what we know about modern human reproduction.
Perhaps as a responsible science educator I should teach both sides of this issue in my biology classes. In fact, maybe health education in all our schools needs to be adjusted to include this long-held cultural belief.
Of course not. It doesn't actually make sense for storks to deliver human babies. But a significant correlation suggests that something's happening, what's going on?
It turns out that what seems to be happening has to do with the weather.
When the weather is good, and summer is pleasant, then more storks survive to return the next year in the spring to start nests.
And when the weather is pleasant, instead of raining all the time, people's moods are better and they get along with each other and couples are more likely to do what it takes to generate a baby nine months later.
So good weather in this town leads to more babies and more storks the next year which explains the pattern on the left.
The lack of a pattern on the right is due to the fact that births in the hospital draws women from a much wider geographic area. Women who come to the town to have their baby are coming from places with weather patterns less directly related to the weather this town had the year before.
There are plenty of examples of situations like this - an observable pattern between two things that suggests an explanation, but the truth is different.

Slide 8.

So, in summary.
Regression controls the values of all other factors and therefore can be used to demonstrate causation.
Correlation does not control for the values of all of the factors and therefore can be suggestive, but does not necessarily demonstrate causation.
► One implication of this is that we need to beware of medical studies that show links between factors. These kinds of studies are almost always correlation studies.
We should always ask ourselves whether these studies control for, or try to minimize, all the other factors that may be important. And the answer is that some studies do and some don't.
For example, if we study whether taking a particular medication has an impact we need to make sure that people actually take the drugs as they're supposed to, we need to ensure their compliance.
When I tested experimental drugs years ago, the drugs came in a vial with an electronic cap that recorded when I opened it to make sure I opened the bottle when I was supposed to take the medication.
However, it's not like they were spying on me to make sure that I took it, they were only able to make sure I opened the bottle at the right time.
We could require people come to an office to take their drugs in front of a witness, but that would add dramatically to the cost and difficulty of doing the study.
In medical studies with larger samples, think about how much variation there is in the diet and lifestyles of all the subjects.
This includes other drugs that some people use that they would be unlikely to admit to using. There are also genetic differences between individuals that can't be controlled even if you did the study in confinement.
These and other factors are therefore not fully controlled in most studies of drugs or other medical interventions.
Some of these problems can be minimized with large trials and randomization procedures, but that makes them even more expensive.
These issues don't mean that medical studies like these and other correlation studies are useless, just that we need to be very cautious about how quickly we accept proposed explanations for observed correlations.
One thing to point out though is that while correlation alone doesn't automatically imply causation, causations do imply correlations. The lack of a correlation between two factors can therefore be used as good evidence to weaken or reject a claim that one factor causes the other.

Slide 9.

As another example, how do we know that sun exposure causes skin cancer.
► We do seem to observe a correlation between sun exposure and cancer rates. People who spend more time in the sun tend to have higher rates of skin cancer. However, there are potential confounding factors. People who work outdoors all day often come from lower positions on the socio-economic spectrum compared to people who spend all their time indoors. And those different groups of people may have a variety of other things that influence their risk of cancer.
For this reason, medical researchers try to identify situations in which they can minimize these others factors and do a regression style analysis instead of a correlation.
► This was done by looking at skin cancer rates on the left and right arms of individuals in the United States. In the United States, drivers tend to get more sun exposure on their left arms than on their right arms, and data shows that skin cancer is more common on people's left arms than on their right arms.
People's left and right arms have the same diet, drug use, and socioeconomic status as one another. This design allowed researchers to remove all these other factors as possible explanations for the observed correlation.
But maybe it's just a quirk of humans' left arms, how can we address this?
► One way is to also look at data for a country like Australia where people drive on the other side of the road so that now their right arms get more sun exposure than their left arms.
Sure enough, it turns out that skin cancer is more common on right arms than left arms in Australia. In the absence of some other explanation for the flipping of the difference between right and left arms in the US and Australia, now the sun exposure correlation data is becoming very convincing.
Well-designed sets of studies like these are how we can move from correlation to regression and therefore demonstrate causation.
In the non-technical world however, anecdotes and examples can be more persuasive.
► For this reason, the New England Journal of Medicine published a case study examining the effects of sun exposure on the two sides of the face of an individual who drove a delivery truck for 28 years. The left side of his face got more sun exposure than the right side of his face and you can see from the picture what the effects were.
A single individual is not something we should base scientific conclusions on, but in the context of a process for which we have lots of other evidence, it can be a very useful and persuasive piece of evidence.

Slide 10.

The explanations for correlations can sometimes be hard to determine. There is also a relationship between sun exposure and prostate cancer, which is a cancer in a location where the sun does not shine.
This article from the Journal of Health Geographics shows that as individuals experience more sun exposure, they have lower rates of prostate cancer.
This is just a correlation, so it doesn't prove causation, but the pattern was statistically significant, so something is going on. The challenge then is to figure out what that something is.

Slide 11.

A possible explanation for the relationship between sun exposure and prostate cancer was described in this paper from the journal Frontiers in microbiology.
► In this study they took individuals and either gave them vitamin D supplements or not. Data for the non Vitamin D supplemented individuals is indicated in orange and data for the Vitamin D supplemented individuals is indicated in Gray.
The metabolism and physiology of vitamin D is known to be influenced by the UV radiation a person receives.
They measured various aspects of their subjects' intestinal microbiome before and after treatment with UV radiation.
The paper has lots of data, but let's focus on just a couple general aspects of what they saw.
If you look at figures C and D the amount of bacteria changes more, before and after UV exposure, in the subjects that did not have vitamin D supplementation compared to ones that had extra vitamin D.
In figure B, the species diversity as measured by the Shannon index shows two things. First, vitamin D supplementation seemed to increase the diversity of bacteria in people. Second, the effects of sun exposure on increasing gut microbiome diversity were less apparent in individuals supplemented with vitamin D.
Why that happened is a topic for future studies.
This study may have been motivated by the prior one showing that UV radiation has something to do with prostate cancer, a type of cancer that may well be influenced by the gut microbiome.
The correlation seen in that previous study did not demonstrate causation, but it was not useless because it motivated this study.
► In general, correlation is not useless, it is often the first step for further studies.

Slide 12.

As we saw in that previous study, sometimes the relationship between two variables is not a straight line, but a curve. What do we do about curved relationships when we see them? Our first approach is usually to perform a transformation on the data to see if we can generate a linear pattern.
For example, in the data set shown of X and Y values they would exhibit this curved relationship and a straight line would not do a very good job of modeling that relationship.
► But if we look at the log base 10 transformed values for the data, we can see that now the pattern is extremely linear and a straight line would be an appropriate way to model this pattern.

Slide 13.

What criterion should we use for the best fit line for data? Looking at the top right, obviously a straight line doesn't do a good job for curved data whereas the straight line goes perfectly through all the data points.
That makes sense, but what about more less obvious situations?
► In the two figures below, the data points are in exactly the same locations. We can imagine two different straight lines to represent these relationships, but which of these two straight lines is better?
Which provides the best fit and how do we decide objectively?

Slide 14.

The criterion used to determine the best fit line for data involves looking at the differences between the data values and the prediction for where we would expect those data values to be based on the independent variable.
For each of our two figures, the line shown is making predictions about where the Y-values should be based on the X-value. And we can see that sometimes the actual values are above or below where we would expect them to be. These differences between the data values and the predictions are called residuals.
All these residuals together give us a sense of how good a job the line is doing at getting close to all the data values. The smaller they are in general, the better the line is doing at showing where the data is.
► To put a number on this, we square the positive and negative residuals to make their magnitudes all positive and then we sum them all up. The best fit line is then the line with the smallest sum of these squared residuals. It's the straight line that gets the closest, in general, to the data values.
Working with sums of squares is something that is very common in statistics and if you've done statistical tests like the chi-squared or the ANOVA you've seen this approach before.
The companion video to this one goes into more details about thesecalculations.

Slide 15.

Lastly, you may wonder where these technical terms get their names.
► Why is our first technique called correlation?
► This name comes from combining the Latin word "co" which means together and relation to get co-relation. That one makes sense.
► Why is the regression method called regression?
► It turns out that this name comes from Darwin's cousin Francis Galton. He was one of the first people to make plots of data like this and put straight lines through the data points to understand the relationship between variables.
► In particular, he was interested in studying genetics and the relationship between parents and their offspring.
When he started making these plots he noticed that the slope was always less than one.
What that means is, that if you have a pair of parents that have above average values for some trait like height or IQ, the predicted average value for their offspring will actually be less than their own.
A pair of very tall parents will tend to have kids that end up shorter than they are, and pair of very smart parents will tend to have dumber kids.
► This concerned him and he used the term regression to describe how offspring weren't as good as their parents.
► What he failed to notice, or not focus on because he was somewhat classist, racist, and sexist like most rich white men in Victorian England, was that it goes both ways.
At the other end of the X-axis a pair of very short parents will tend to have kids that are taller than they are, and a pair of very dumb parents will tend to have kids that are smarter than they are.
There's no need to worry about society collapsing from regression, the overall average tends to stay the same.
► The reason for this is that measurable traits of organisms are typically due to both genetic factors from their parents and random or environmental factors.
The very tallest individuals have both genetic alleles that contribute to height and experienced an environment that made them taller.
When they reproduce their children will inherit the alleles, but not necessarily the environment and therefore not end up as tall.
Similarly, the offspring of very short parents will inherit their short alleles, but probably not the same environment that caused the parents to be short.

Slide 16.

This general fact that measurable quantities tend to be a combination of both consistent and random factors can lead to something called the fallacy of regression. When we have repeated measurements or trials, the consistent factor will be seen again, but the random part will not.
► Consider this analogous scenario, drawing sample values from a population.
If we plotted the current value on the x-axis and the next value on the Y-axis we would get a similar plot with a slope less than 1.
The average is the consistent aspect of the values, but the variation in the data adds random noise to the value of each particular sampled value.
Whenever we have a process like this, if we select an extreme value then we expect the next sample value from the population to be more like the average.
Now remember that populations and samples have a broad interpretation.
► Consider a situation in which an athlete or a sports team has a particular performance or season. We can think of that as a sample value drawn from the population of potential performances or seasons they could have had.
If the current performance or season is exceptionally good, then we would expect the next performance or season to "regress" back to their population mean.
An amazing season followed by an inferior one doesn't indicate that the person or team is losing ability, it's what we would expect.
The fallacy involves focusing on any tangential action taken after a given performance or season and assuming it is the cause for the regression to the mean.
If a team wins a championship and then changes their workout schedule, a lower record the next year doesn't mean the new schedule is inferior, the decline may just be what was expected.
Likewise, if an individual's performance is extremely bad, and then we make some kind of change which is followed by an improvement, we can't assume that change caused the improvement.
It may just be a case of returning to the mean because that's exactly what we would expect to see happen if we had done nothing.
This fallacy doesn't just hold for sports.
► Imagine that a person experiences chronic pain. The amount of pain they experience each day is drawn from a population of possible pains based on a consistent factor like their condition and other random factors that are unknown.
► If the person goes and gets some kind of treatment on a day when their pain is especially bad,
► Then we expect an improvement in the pain even if the treatment was totally useless because they're just regressing to the mean. However, what it seems like to the person is that the treatment worked.
The apparent effectiveness of the treatment based on an improvement in symptoms is an example of the danger arising from the fallacy of regression.
This fallacy is a major reason why anecdotal reports of patients getting better after some unusual treatment are highly suspect.
It's not that doctors don't believe the patients, it's because there's a good chance they would have improved anyway due to a regression to the mean.
This fallacy is why the medical community generally disregards anecdotal reports until well-designed studies that can detect this process have been performed.
Anecdotes can be used to inspire and motivate studies, and can be very persuasive as we saw earlier with the delivery driver's face, but they should never be used as evidence all by themselves.

Zoom out.

Correlation and regression can be powerful tools to help us understand what's going on, but we must use them properly. Maxims like "correlation does not imply causation" and fallacies like the "fallacy of regression" remind us that we need to be cautious about how we use these techniques. These aren't just phrases to memorize, they are useful mottos to hold onto throughout all aspects of our life.
As always, a high-resolution PDF of this summary slide is available on the StatsExamples website.

End screen.

Liking, subscribing, commenting, and sharing is correlated with this video getting more views and helping more people understand correlation and regression.

Connect with StatsExamples here

This information is intended for the greater good; please use statistics responsibly.

ABOUT contact privacy credits