CORRELATION & REGRESSION (PROCEDURES)

Tweet
Share

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-correlation-regression-procedures.pdf

TRANSCRIPT OF VIDEO:


Slide 1.

One way to determine whether two quantitative variables are associated with one another is to use the correlation and regression techniques.
This video describes the calculations for the correlation and regression techniques and is the second of a two-part series.
This video is more technical and mathematical whereas the other intro video is more conceptual.

Slide 2.

The goal of a regression or correlation analysis is to identify whether or not there is a non-random relationship between X & Y.
In terms of the math that we perform for these analyses, it would be exactly the same whether we are doing a correlation analysis or regression analysis. The main difference comes from whether we have enough information to determine causation.
If we have controlled all the other factors except the two variables we're measuring, then we can do a regression analysis a demonstrate causation.
If we just have information on the two variables and are looking for a relationship then we're doing a correlation analysis, which is still useful, but doesn't conclusively demonstrate causation.
Either way we're going to use the same five steps to perform our analysis.
► First, we check the assumptions of our model. These include linearity of the data points, independence of the data points, and some assumptions about variances as we'll see.
► Second, we estimate the parameters for the population based on our sample. We estimate the equation for the population relationship based on the best fit line through our sample data. We can also estimate the correlation coefficient and coefficient of determination which measure the strength of the relationship.
► Third, we perform tests for the statistical significance of the slope of our best fit line. There are two approaches to this. The ANOVA approach uses mean sums values to determine whether the slope is significantly different from zero. The t-test approach compares the observed slope to the standard error of that slope to determine whether the slope is significantly different from some hypothesized slope - usually, but not always, a slope of zero. The t-test approach is not restricted just a comparison to a slope of zero whereas the ANOVA approach is. We can also easily perform a t-test for the significance of the correlation coefficient.
► Fourth, using standard errors for the Y intercept and slope, we can calculate confidence intervals for these values and the best fit line. We can also calculate inclusion intervals for where we predict the population data values will be.
► Fifth, the final step is always to interpret what our statistical and mathematical results mean for the data that we are studying. This is where we think carefully about whether our data allows us to perform a regression analysis or just a correlation analysis. We must also be cautious about the range of data values for which we can make our conclusions.

Slide 3.

Let's look at the assumptions of a correlation or linear regression analysis.
► First, we require that all the data points are independent of one another. That is, the values of data points do not directly influence the values of other data points or are more related to some than others.
Testing this assumption is very hard to do just from just our data however, we have to know about the nature of the values we are studying.
► Second, the independent variable values, the x-axis values, must be fixed and known without error. This is essentially impossible in the real-world, but luckily this is a weak assumption and usually doesn't cause problems as long as our measurement error is low.
► The final set of assumptions have to do with the distribution of the data points.
The pattern must be linear, the differences between the Y-axis values for our data points and our best fit line, the residuals, must exhibit a normal distribution along the entire range of the independent variable, and the data values must exhibit an equal variance along the entire range of the independent variable.
We generally check for these assumptions by plotting the data and the residuals to see whether these appear to be true.
Only if we are fairly confident that these five things are true should we proceed with our analysis.

Slide 4.

As mentioned before, we generally check the assumptions about the pattern of data points visually. In these three pairs of plots the data is shown in the upper plot and the residuals, the differences between the data values and the ones predicted by the best-fit line, are shown in the lower plots.
In the left pair of figures, we can see that the data seems to be fairly linear and the residuals cluster around zero with no obvious pattern indicating a problem.
In the center pair of figures, we can see that the data seems to exhibit a curved relationship, which is a problem.
We can also see that the residuals do not exhibit a normal distribution centered around zero along the entire range of X values.
For small and large X values the residuals are all positive, and for intermediate X-values the residuals are all negative, the data residuals aren't centered around zero along the entire range.
In the right pair of figures, we can see that while the data appears to be fairly linear, the variance of the data points in the Y direction is quite different for different ranges of the X-values.
We can see this in both the plot of the data and the plot of the residuals. These patterns clearly violate our assumption of equal variances along the X-axis.
If we wanted to do an analysis using the center or right set of data we would have to perform a transformation of some kind to get new data values that are linear, exhibit a normal distribution of residuals centered around zero, and have equal variances for the Y values along the entire X-axis.

Slide 5.

The best fit line is the one that represents the data the best. It's the one that minimizes the sum of the squared residuals, also called SS error.
These three pairs of plots show an identical set of data points. In the top figures I've indicated three different straight lines that are candidates for the best fit line and in the bottom figure I've shown what the residual plot would look like for each.
To find the best-fit line we would look at the residual plots and calculate the square of each residual and add them up.
► The SS errors as we move from left to right are 29, 49.5, and 39.7. In general, the data points are closer to the line in the figure on the left compared to the center or right.
We can also see that the assumption of the residuals being centered around zero along the entire range of X -values is violated in the center and right figures.
► For these reasons we would prefer the line on the far left.
But how do we make sure there isn't some other line that is even better? Luckily there is a procedure that will provide us with the one line that minimizes the sum of the squares of the residuals.

Slide 6.

As mentioned, the best fit line is the one that minimizes the sum of the squared residuals, also called SS error.
Conceptually this approach is partitioning the overall sum of squares in the Y direction into two components, sum of squares regression and the sum of squares error. These two sums add up to the sum of squares total in the Y direction.
The sum of squares regression indicates how much variation we expect in the Y direction because of how variable the data points are in the X direction and the overall pattern of the relationship between X and Y.
The sum of squares error indicates how far away the data points are from the from the exact values we predict from the relationship between X and Y.
Together these two factors, a consistent trend and random error, combine to generate the overall sum of squares in the Y direction, also called SS total.
► This approach of dividing the total sum of squares into a part that comes from a consistent factor and a part that comes from noise is very similar to the approach used in ANOVA analyses.
There is a direct parallel with the ANOVA method of partitioning the overall sum of squares total into the sum of squares among and the sum of squares within. In that situation the sum of squares among was the potentially non-random difference between the means of the groups and the sum of squares within represented the noise.
This channel has a video on the ANOVA that goes over all this that you may want to check out. A link is provided in the video description below.
As shown in the figures, we can think of the sum of squares total, the variation represented by the purple arrows, as being the sum of SS regression (or SS among) in blue and the SS error (or SS within) in red.
In the same way that the ANOVA approach is to calculate the mean sums among and compare it to the mean sums within, we will calculate the mean sums regression and compare it to the mean sums error.
But first we need to have a best fit line to work with.

Slide 7.

How do we estimate the equation for the best fit line from our sample data, the Y intercept and the slope that would go into our equation Y equals A plus BX?
Luckily, we don't have to guess and try a bunch of different lines, there is a procedure to give us the one line that will minimize the sum of squares error.
First, we make a table of the data with columns for the X and Y values as shown.
In the table there is a small data set of three data points. (3, 6) ... (5, 8) and ... (7, 13).
For the X & Y values we also calculate their sum and their mean as indicated. The sums are the 15 and 27 and the means are 5 and 9.
► Then we're going to calculate the sum of squares for the X-values SSx, the sum of squares for the Y-values SSy, and a value called the sum of the cross products, SPxy using the equations shown.
► To calculate the sum of squares X, we subtract the mean X value from each X value, square that value, and then add them all up.
For our data that would be 3 - 5 squared plus 5 - 5 squared plus 7 - 5 squared which is 4 plus 0 plus 4 equals 8 for our SSX value.
► To calculate the sum of squares Y, we subtract the mean Y value from each Y value, square that value, and then add them all up.
For our data that would be 6 - 9 squared plus 8 - 9 squared plus 13 - 9 squared which is 9 plus 1 plus 16 equals 26 for our SSy value.
► To calculate the sum of the cross products, for each data point with values X and Y, we subtract the mean X value from each X value, then subtract the mean Y value from each Y value, multiply those values, and then add them all up.
For our data that would be (𝟑− 𝟓)(𝟔−𝟗) plus (𝟓−𝟓)(𝟖−𝟗) plus (𝟕−𝟓)(𝟏𝟑−𝟗)
equals (−𝟐)(−𝟑) plus (𝟎)(−𝟏) plus (𝟐)(𝟒)
equals 𝟔 plus 𝟎 plus 𝟖 = 𝟏𝟒 for our sum of the cross products.
► The slope of the best fit line is then given by the sum of the cross products divided by sum of squares X.
For our data that's going to be the 14 divided by the 8. But first let's look at why this equation make sense.

Slide 8.

The slope is the sum of the cross products divided by sum of squares X. Let's look at the numerator and denominator of this equation.
If we look at these three data sets the one on the left should give us a positive slope the one in the center should give us a negative slope and the one on the right should give us not much of a slope at all.
Let's think about the sign of each of the terms in the sum of the cross products.
When the X value is larger than the mean X value or when the Y value is larger than the mean Y value those terms would be positive.
But if the x value is less than the mean or the Y value is less than the mean then both of those terms would be negative.
In the left-hand figure, 4 data points show a situation in which the X and Y values are above the mean so you would get two positive values multiplied together in our sum of the cross products.
4 data points show a situation in which both the X and Y values are below the mean so you would get two negative values multiplied together to make a positive.
Only one data point will multiply a positive by a negative and put a negative value into our sum of the cross products.
We therefore expect the sum of the cross products to end up being positive and since that's the numerator of the slope, that would make the slope positive.
We see a similar situation for the center figure.
4 data points show a situation in which the X value is below the mean, but the Y value is above the mean so you would get a negative and a positive value multiplied together, making a negative in our sum of the cross products.
4 data points show a situation in which the X value is above the mean, but the Y value is below the mean so you would get a positive and a negative value multiplied together, making a negative in our sum of the cross products.
Only one data point will multiply a positive by a positive and put a positive value into our sum of the cross products.
We therefore expect the sum of the cross products to end up being negative and since that's the numerator of the slope, that would make the slope negative.
And in the figure on the right, we have a mix of situations in which we will get positive or negative terms in our sum of the cross products and they will mostly cancel each other out.
That would leave us with a very small value for the sum of the cross products and therefore a very small slope when we put that in the numerator of our fraction.

Slide 9.

Now let's think about how the variation in the X direction as represented by the SSx influences the slope.
The three data sets shown have identical y-values, but the X values are very spread out on the left, much closer on the right, with the center plot intermediate.
The SSx value will therefore be largest on the left and smallest on the right.
Thinking about what that does to our slope, for the data on the left we would be putting a larger value in the denominator of the slope which would result in a smaller slope.
For the data on the right, we would be putting a smaller value in the denominator of the slope which would result in a larger slope. This matches what we expect the slopes to be when we look at the pattern of data.

Slide 10.

So we have an equation for the slope that makes conceptual sense, the sum of the cross products divided by the sum of the squares X, but how do we get the y intercept?
► It turns out that the best fit line will always pass through the center point for our data, the means of the X and Y values.
Since that's true, we can therefore rearrange the equation to solve for the Y intercept. We just need to plug in the slope, the mean Y value, and the mean X value. We already have these available.

Slide 11.

Let's look at a couple of super-simple examples of calculating the best fit equation.
On the right are two data sets, each with three data points and we will calculate the slope and intercept for each of them. The figures below each data set show the data values and the dotted line is the best fit equation we'll obtain. The sums and means and SS values for the X and Y values are calculated as done earlier when we introduced the equation for the slope. The data set on the right is the exact same one we looked at earlier.
► For the left-hand data set we calculate the slope by looking at the sum of the cross products which is 0 and dividing it by the sum of squares X which is 8.
0 divided by 8 gives us a slope of 0. This is a flat line and makes sense.
► For the Y-intercept we plug in the mean Y value of 7 and the mean X value of 5 into our rearranged equation using a slope of zero and solve for A to obtain a Y-intercept of 7.
That's pretty much where we would expect the line to intersect the Y-axis from looking at the figure.
► For the right-hand data set we calculate the slope by looking at the sum of the cross products which is 14 and dividing it by the sum of squares X which is 8.
14 divided by 8 gives us a slope of 1.75. This line slopes upwards and we can see it matches the data well.
► For the Y-intercept we plug in the mean Y value of 9 and the mean X value of 5 into our rearranged equation using a slope of 1.75 and solve for A to obtain a Y-intercept of 0.25. That is pretty much where we would expect the line to intersect the Y-axis from looking at the figure.
► We can see from looking at each of these plots of the data, and the best fit lines, that the equations make sense. On the left a y-intercept of 7 and a slope of 0 seems appropriate to represent the lack of a relationship between X and Y. On the right a y-intercept of 0.25 and a slope of 1.75 defines a line that represents the pattern of the data quite well.

Slide 12.

Once we have a best fit line equation our next step is to perform significance testing of that slope. The first approach we will look at is the ANOVA approach.
The sum of squares total is the overall variation and is due to both the sum of the sum of squares error and sum of squares regression.
The figure on the left shows what these values represent. The purple arrow indicates the overall sum of squares of the data values in the Y direction and the two contributors to that value.
The residuals, the differences between the data points and the Y-value predicted by the best fit line, are indicated in red. The sum of the squares of these represent the variation caused by randomness and noise.
Then there is a general pattern where the data points with smaller X-values have smaller Y values and the data points with larger X values have larger Y values. The difference between the predicted Y-values and the mean Y-value using their sum of squares would represent the variation caused by this general linear trend.
Our approach to test the significance of the slope is then to compare the mean sums regression to the mean sums error.
As mentioned previously this is directly analogous to the approach used in the one factor ANOVA which compares the mean sums among to the mean sums within.

Slide 13.

To calculate SS total, which is also SSy, we calculate the differences between every observed Y value and the mean of all the Y values, square them, and add them all up. The Y sub i indicates each observed data value.

Slide 14.

To calculate SS error, we calculate the differences between every observed Y value and the predicted Y-value, using our best fit equation, square them, and add them all up. The Y sub i with the hat symbol indicates each predicted data value.

Slide 15.

To calculate SS regression, we calculate the differences between every predicted Y-value, using our best fit equation, and the mean Y value, square them, and add them all up.

Slide 16.

Once we have these values, we just need to perform an ANOVA F test to see if mean sums regression is significantly larger than the mean sums error.
Our null hypothesis is that the slope of the population is equal to zero. The alternative hypothesis is that the slope of the population is not equal to zero.
► We have our sums of squares terms from before and now we just need the degrees of freedom for regression and error.
► The degrees of freedom for the regression is always one and the degrees of freedom for the error is the number of data points minus 2.
► Then our mean sums error is the sum of squares error divided by n - 2 and mean sums regression is sum of squares regression divided by 1 which is just the sum of squares regression.
► Our F calculated value is the mean sum regression divided by the mean sums error and the significance of this value tests whether the variance in X explains the variance in Y.
Note that we don't say causes, we say explains because demonstrating causation relies on understanding what these numbers represent, it doesn't come from the calculations themselves.
Again, if these ANOVA calculations are new to you then check out the ANOVA videos on this channel and the StatsExamples website.

Slide 17.

Significance testing for the slope can also be done using a T Test.
The T-test is a very common technique for comparing the mean of a set of values to another set or to a hypothetical value.
This channel and the website also has several T-test videos and pages. They're linked below in the video description.
The one-sample t-test works by comparing the difference between a hypothesized population value and the observed sample mean in terms of standard errors.
When the difference between those two means is large, in terms of the standard error of the mean, then we can conclude the population mean differs from the hypothesized one.
When that difference is not large, then we lack the evidence to conclude that a population mean differs from the hypothesized value.
The significance test of the slope works the same way.
We calculate the difference between the observed sample slope and a hypothesized population slope and compare it to the standard error of the slope.
The null hypothesis will be that the population slope equals a hypothesized value and the alternative will be that they are not equal.
The degrees of freedom for this t-test is the number of data points minus two.

Slide 18.

To do the t-test all we really need to do is figure out the standard error of the slope.
An equation for the standard error of the slope is shown in the box. Although this equation is not intuitive, we can see that it just uses a bunch of values we have already calculated, the sum of squares y, the slope, and the sum of squares X.
Then we use this to calculate a T value and use that value in the same way we would any other T value and degrees of freedom.
To test for the significance, we either compare it to critical values to get a range of P values or we calculate the P value that directly corresponds to it.
The result of our T test is a little different from the ANOVA. The T test approach tests whether the slope is different from the null hypothesis slope. If the hypothesized population slope is 0 this approach is equivalent to the ANOVA.
However, this approach can be used to see whether a relationship between X and Y differs from some specific relationship between X and Y, not just to see if there is a nonzero relationship.

Slide 19.

After getting the equation for the best fit line and determining whether it's statistically significant or not, most correlation and regression analyses calculate two additional values.
The first of these is the correlation coefficient, represented by a lowercase r. The goal of this value is to describe the strength, and direction, of the relationship between the two variables.
r is calculated using the equation shown, it is the sum of the cross products divided by the square root of SSx times SSy.
r varies from -1 to positive one where -1 is a negative relationship with no noise and positive one is a positive relationship with no noise.
When data points don't fall exactly on the best fit line, the absolute value of r decreases towards 0.
From the five plots here we can get a sense of how r measures both the consistency and direction of the relationship.
On the far left is a positive relationship with very little noise, hence a value almost as large as one.
On the far right is a negative relationship with a little bit more noise and an r value almost as negative as -1 but not as close to -1 as in the previous example because of the additional noise.
The center plot shows that when the data values are essentially randomly placed, the r value will be close to zero.
Unlike our next value, the exact value of the correlation coefficient does not have a direct quantitative interpretation.

Slide 20.

The coefficient of determination is represented by an uppercase R squared and this value describes the strength, but not the direction, of the relationship between the two variables.
The coefficient of determination can be calculated in two different ways when we have linear data. The first is as the sum of squares regression divided by the sum of squares total. The second is as the square of the correlation coefficient.
R-squared varies from zero to positive one, where zero indicates no relationship and one represents either a positive or negative relationship with no noise.
When data points don't fall exactly on the best fit line, the value of r-squared decreases towards 0.
Using the same five plots as before we can get a sense of how r-squared measures the consistency, but not the direction, of the relationship.
On the far left is a positive relationship with very little noise, hence an r-squared value almost as large as one.
On the far right is a negative relationship with a little bit more noise. The r-squared value is still positive, but its magnitude is not as close to 1 because of the additional noise.
The center plot shows that when the data values are essentially randomly placed, the r-squared value will be close to zero.
The coefficient of determination has a direct quantitative interpretation. R-squared represents the proportion of variance in Y explained by variance in X.
Note that we did not say "caused" because we don't know if causation is warranted from just the data values.
We can use the term "explain" because this value is measuring how precisely we can predict or explain the likely value of one of our variables with the other.

Slide 21.

There exists a fairly easy T test for the significance of the value of the correlation coefficient.
The null hypothesis is that the correlation coefficient in the population is equal to zero, and the alternative hypothesis is that it is not equal to zero. Note the use of the Greek letter for r instead of the English letter because we're talking about the population.
We can calculate a T value for the correlation coefficient by the equation shown. T equals r times the square root of sample size minus 2 divided by one minus the correlation coefficient squared.
The degrees of freedom for the t-test is the number of data points minus 2.
Looking at our five sets of data points, we can see the T values and P values each of them would generate.
No surprise, the data on the left generates a highly significant correlation coefficient.
The data on the far-right also generates a highly significant correlation coefficient.
In fact, even the data in the 4th plot has a correlation coefficient that is statistically significant in terms of being different from zero.
However, the second and third plots have non significant correlation coefficients, as we might expect from the lack of a clear visible linear trend in the data.

Slide 22.

A final procedure that is occasionally done in correlation and regression analyses is to calculate confidence intervals for our best fit line equation and prediction or inclusion intervals for our data.
These intervals use the T distribution as shown in the two equations and generate a different interval for each x value in our data set.
In the figure the blue lines show the 95% confidence interval for our best fit line.
They are more flared out on the ends because when the slope varies within its confidence interval that would change the angle of the line, which moves it more at the far left and right, and not at all in the center because we know that best fit line must go through the mean X and Y value.
The width of the confidence interval in the center essentially represents the confidence interval of our Y intercept.
In the figure the red lines show the 95% inclusion interval, the region where we predict 95% of the population data values are.
Published analyses commonly show the confidence interval for the best fit line, but rarely show the prediction intervals because we're usually more interested in the overall relationship than in exactly where the individual population values are.

Slide 23.

After all the math is said and done, we have to interpret the results of our analysis.
Significant slopes in a regression analysis can imply causation, but the causation may be indirect. Just because we have measured X and see a change in Y, that doesn't mean there aren't intermediate steps that may be important between X & Y.
Significant slopes in a correlation analysis may imply causation, but other factors may be driving the pattern, so we need to interpret our results with caution.
We know that correlation doesn't automatically imply causation, but that doesn't mean it's useless. A statistically significant relationship between two measured values indicates that something non-random is going on to cause them to vary together, we are just unsure of the exact reason why.
A significant correlation between two factors is often the first step in further studies that will be able to uncover the causation.
Also, the lack of a correlation between two factors is a powerful argument against a proposed causation.
If someone claims that X causes Y and then we gather a lot of data and see that there is no relationship between X and Y, that's strong evidence against the initial claim.

Slide 24.

There are two other important things to always keep in mind when interpreting the results.
► First, a significant slope implies a non-random relationship, but is that relationship relevant and important or might that relationship be trivial and unimportant?
► For example, in the box is a citation for paper in which one of their main results was that men have a 10% overall higher risk of cancer for each 10 centimeters (4 inches) of height they have above 175 cm (5 feet nine inches).
This is a significant non-random pattern, but is it important or relevant? Is a 10% elevated risk in something that is fairly rare, and caused by something we can't do much about, really something we should worry about?
► Second, we need to keep in mind that our results only hold for the range of X-values studied. We should never try to extrapolate beyond the range of our data.
► For example, the figure shows the mean age of women in the United States when they had their first child on the Y-axis plotted against year on the X-axis.
If we had been researching this in 1985 and just used the data from 1980 to 1985 to make predictions about the future, our best fit line would cause us to predict that the average age of women when they have their first child in 2015 would be almost 30.
This is much larger than the true value of just over 26. This inaccuracy is because the pattern in the data changed after 1985 to no longer exhibit the linear pattern seen earlier.
If we use a much larger data set, all the values from 1980 to 2000, we get a more accurate prediction that is closer to the true value of just over 26.
If we are trying to make predictions outside the range we have examined, we have no way of knowing whether we would get an accurate prediction such as in the second data set or a highly inaccurate prediction such as from the first data set.
And this isn't just a sample size issue, it's a problem that can arise for any sample size because we never know what will happen for ranges of values we haven't examined.

Slide 25.

In summary, the goal of a regression or correlation analysis is to identify a non-random relationship between X and Y.
First, we check the assumptions of our mathematical model to make sure the data exhibits a linear pattern, all the data points are independent of one another, and that the data exhibits the patterns of normality and variance that we require.
Second, we estimate the equation of our best fit line and the correlation coefficient and the coefficient determination using equations based on sums of squares and cross products.
Third, we test for the statistical significance of the slope using the ANOVA method or a t test procedure.
Fourth, we can calculate confidence intervals to define a region where our best fit line probably lies and predict where the original data is.
Fifth, we interpret our results. What exactly does a significant relationship or lack of a relationship between our two factors represent?
The math is the same for both types of analysis, but our conclusions differ.
Have we controlled all other factors and are able to demonstrate causality with a regression analysis, or are there uncontrolled factors and we are doing a correlation analysis which can be suggestive, but not definitive, in terms of causation?
Have we thought about whether the slope is relevant in addition to whether it's significant?
Are we making sure to confine any conclusions that we make to the range of x-values studied?

Zoom out.

The correlation and regression techniques can be a powerful tool for understanding the natural world.
It is often fairly straightforward to gather the kinds of data we need to perform a correlation analysis.
Although this doesn't allow us to prove causation, it is often a very useful first step in understanding what's going on.
At the very least it does allow us to make predictions for the Y values we expect to see if we know the X values.
And if we are then able to control other factors by performing an experiment, and thereby do a regression analysis, we can really identify causal mechanisms.
And identifying causal mechanisms is essentially what science is all about.
A copy of this summary slide is available on the StatsExamples website, along with links to a variety of completely free resources and more videos.
The video description and the website also have links to the ANOVA and T-test videos and citations for the two studies mentioned.

End screen.

Subscribe, like, or share if you thought this was useful.



Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.