CORRELATION & REGRESSION (EXAMPLES)

Tweet
Share

Watch this space

Detailed text explanation coming soon. In the meantime, enjoy our video.

The text below is a transcript of the video.



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-correlation-regression-examples.pdf

TRANSCRIPT OF VIDEO:


Slide 1.

The correlation and regression techniques are a great way to see of two variables are associated with one another.
This video walks through how to do a linear regression and correlation analysis step by step with two examples. How to interpret the results is also discussed.
StatsExamples has two other videos which are more conceptual and go into more detail about what each of the values we will calculate represents and what they're for. If you haven't seen those, or aren't already familiar with this type of analysis I suggest watching them first.
In any case, before doing the math, we'll do a quick review.

Slide 2.

The goal of a regression or correlation analysis is to identify any non-random relationship between the X values and the Y values.
The X-axis is for values of the independent or explanatory variable and the Y-axis represents values of the dependent or response variable.
There are five steps to our analysis.
► First, we check the data to make sure it meets the assumptions of our technique.
► Second, we use our formula to estimate the parameters of the relationship in the population based on our sample data. We estimate an equation for the best-fit line and usually calculate the correlation coefficient and coefficient of determination.
► Then we test for statistical significance of the slope and correlation coefficient using ANOVA and t-test techniques.
► If we wish we can calculate confidence intervals for the slope and equation of the line.
► Lastly, we interpret our results using a combination of the mathematical results and our understanding of the nature of the data.

Slide 3.

Before we jump straight into an analysis, we need to consider the data and whether it meets the assumption of our linear regression methods.
► There are five assumptions: The data points have to be independent, the independent variable values must be fixed, the data must show a linear pattern, the distribution of residuals must exhibit a normal distribution all the way along the X-axis, and the variance in the Y-direction must be constant along the range of X-values.
Only if all five of these things are true should we do a regression or correlation analysis.
The first two come from our knowledge of where the data comes from and the other three we determine by looking at the data values themselves.
► If the assumptions are violated we have two options.
We can transform the data values into ones that meet our assumptions or we can use a different methods, like nonlinear regression, which doesn't require the same assumptions.

Slide 4.

When we finish all the math we need to be careful about interpreting our results.
► Significant slopes in a regression analysis can imply causation, but the causation may be indirect. There may be intermediate steps we're not aware of between our independent variable and the dependent one.
► Significant slopes in a correlation analysis may imply causation, but other factors may be driving the pattern. A significant correlation supports a causal argument, but can't demonstrate it all by itself, a more controlled regression-style study is needed for that.
► Correlation does not prove causation, but it's still useful.
The best-fit line can be used to make predictions. Even if we don't know why things are happening, we can predict what will happen.
Significant correlations are a first step to finding causation. Since correlation studies are usually relatively easy, they make for good exploratory and speculative studies.
A lack of a correlation is also strong evidence against causation. If someone proposes a causal process, but there is no correlation between what they claim the independent variable values are and the observed dependent variable values, their claim is greatly weakened.
► A significant slope implies a non-random relationship, but is it relevant or trivial? Just because something is statistically significant, that doesn't mean it's important.
► Finally, we should always keep in mind that the results only hold for the range of X values studied. If our data doesn't include some range of the independent variable, we should never extrapolate an observed pattern because maybe things are different elsewhere.

Slide 5.

So far so good, but what's our procedure?
► First we're going to calculate some sums of squares values, sum of squares X, sum of squares Y, and the sum of the cross products of X and Y. These sums will be the basis for our subsequent calculations.
► Then we use our sums to estimate the slope and Y-intercept for our best-fit line, the line that most closely follows the pattern of the data.
► Then we use our best-fit line to estimate the sums of squares regression and the sums of squares error terms.
► Next we use the SS regression and SS error terms to perform ANOVA and T-test analyses for the significance of the slope. This allows us to decide whether any patterns we think we see are statistically significant or just the results of random noise.
► We usually calculate the correlation coefficient and coefficient of determination.
► Sometimes we perform a significance test for the correlation coefficient, but honestly that's pretty rare. It's the significance of the slope we usually really care about.
► Sometimes we calculate confidence and inclusion intervals for the line and slope. Confidence intervals for the slope are common, but estimation of the inclusion intervals is fairly rare.

Slide 6.

OK, let's look at our first example. Imagine that we have a set of the first and second exam scores for 8 students in a class. Their scores are shown in the columns to the left. We're leaving a lot of space because we will want to calculate 10 more columns during our analysis.
► First we'll calculate the sum of the X and Y values and their mean values.
► The next step is to calculate a column with each X value minus the mean X value.
63 minus 75 gives the negative 12, 67 minus 75 gives the negative 8 all the way down to 87 minus 75 gives the 12 for the last value.
The sum of all the values in this column will be zero.
► Then a column with each Y value minus the mean Y value.
74 minus 77 gives the negative 3, 75 minus 77 gives the negative 2 all the way down to 82 minus 77 gives the 5 for the last value.
The sum of all the values in this column will be zero.
► Now we want a column for the X minus X mean values, squared.
Negative 12 squared gives 144, negative 8 squared gives 64 all the way down to 12 squared gives 144 for the last value.
The sum of all these values, 496, is the sum of squares X, denoted as SSx.
► Now we want a column for the Y minus Y mean values, squared.
Negative 3 squared gives 9, negative 2 squared gives 4 all the way to 5 squared gives 25 for the last value.
The sum of all these values, 102, is the sum of squares Y, denoted as SSy.
► The next column is slightly more complicated. We take each value from the X minus X mean column and multiple by the Y minus Y mean value to get something called the cross-product.
Negative 12 times negative 3 gives 36, negative 8 times negative 2 gives 16 all the way to 12 times 5 gives 60 for the last value.
The sum of all these values, 171, is the sum of the cross products, denoted as SPxy.
SHIFT to NEXT SLIDE
Let's take a look at our three values and use them.
► The slope of the best-fit line is the sum of the cross-products divided by the sum of squares X which gives us 171 divided by 496 to give 0.345 for the slope.
► Now we get the Y-intercept by knowing that the best-fit line must pass through the point of mean X and mean Y. Solving this for the Y-intercept tells us that the Y-intercept will be the mean Y value minus the slope times the mean X value.
► This is 77 minus 0.345 times 75 to give us 51.143.
► The equation for our best-fit line is therefore: Y equals 51.143 plus 0.345 times X.
► Now we'll go back to our data and use this best-fit line to calculate the rest of the values we need.
BACK to DATA SLIDE
► Our first column using the equation for the best-fit line comes from using the equation for the line to get predicted Y values. The little hat over the Y symbol indicates that these are the predicted values.
The first value is 51.143 plus 0.345 times 63 to get 72.86.
Then 51.143 plus 0.345 times 67 to get 74.24.
All the way down to 51.143 plus 0.345 times 87 to get 81.14.
► Now we want a column for how these predicted values compare to the overall mean Y value, predicted Y minus mean Y.
72.86 minus 77 gives negative 4.14.
74.24 minus 77 gives negative 2.76.
All the way down to 81.14 minus 77 gives positive 4.14 for the last value.
The sum of all the values in this column will be zero.
► Now we want a column that squares the values we just calculated because that will give us the sum of squares regression.
Negative 4.14 squared is 17.12.
Negative 2.76 squared is 7.61.
All the way down to positive 4.14 squared to give 17.12 for the last value.
The sum of all the values in this column is the sum of squares regression which is 59.954.
Our last two columns will compare the actual values to the predicted ones.
► First a column for each observed Y value minus the predicted Y value. These are the residuals.
74 minus 72.86 gives 1.14.
75 minus 74.24 gives 0.76.
All the way down to 82 minus 81.14 gives 0.86 for the last value.
The sum of all the values in this column will be zero.
► Now we square these values so we can get the sum of squares error.
1.14 squared is 1.29.
0.76 squared is 0.57.
All the way down to 0.86 squared to give 0.74 for the last value.
The sum of all the values in this column is the sum of squares error which is 43.046.
At this point we should a little check to make sure that the SS regression and SS error add up to the total SSY.
58.954 plus 43.046 should equal 102 which they do.

Slide 7.

Here we have our values so far.
► We can quickly calculate the correlation coefficient using the equation shown and we get a value of 0.760 which indicates a strong positive relationship between these two values.
► We can quickly calculate the coefficient of determination by dividing the SS regression by the SS Y to get 0.578. This is also equal to the 0.760 squared.
Both of these values are suggestive of a strong relationship between the exam 1 and exam 2 scores.

Slide 8.

Here is a plot of the data with a smaller plot of the residuals. As we guessed from the r and R squared values, the pattern looks fairly clear and consistent.
Despite how it looks, we need to conduct some statistical tests to be sure it's not just random chance making a figure that has the illusion of a pattern.

Slide 9.

For our ANOVA analysis we will use the null hypothesis of slope equals zero and the alternative that the slope differs from zero.
We will be doing an F test using the means sum regression and means sum error.
► Mean sums regression is the sum of squares regression divided by the degrees of freedom value for a regression which is always one. This gives us 58.954 divided by one to get 58.954.
► Mean sums error is the sum of squares error divided by the degrees of freedom value for the error which is the number of data points minus two. This gives us 43.046 divided by eight minus two in the denominator to get 7.174.
► F is therefore 59.954 divided by 7.174 which is 8.217.
► Looking at our tables of critical F values from the StatsExamples website we see that this value is larger than the critical value for p equals 0.05, but not as larger as the critical value for p equals 0.025.
► We can therefore say that: "The slope of the best-fit line for exam 1 scores vs exam 2 scores is significantly larger than zero (0.025 < p < 0.05)".
This makes sense because we would expect that the first and second exam scores for students would be correlated with each other.

Slide 10.

For our t-test analysis we will use the null hypothesis of slope equals zero and the alternative that the slope differs from zero. Unlike with the ANOVA, in this case we could compare to a slope other than zero if we wanted to however.
We will be doing a t-test using the equation shown - with the difference between observed slope and hypothesized one in the numerator and the standard error of the slope in the denominator.
The degrees of freedom for our t-test is the number of values minus two which is 8 minus 2 gives 6.
► The next step is to calculate the standard error of the slope. We do that using the equation shown and plug in our SS y, slope, SS x, and sample size values.
This gives the big ugly equation there which ends up being 0.1203.
► The t calculated value becomes 0.345 divided by 0.1203 which is 2.867.
► Looking at our tables of critical t values from the StatsExamples website we see that this value is larger than the critical value for alpha equals 0.02, but not as larger as the critical value for alpha equals 0.01. This is a two-tailed test so we have to double these values from our tables.
► We can therefore say that: "The slope of the best-fit line for exam 1 scores vs exam 2 scores is significantly larger than zero (0.02 < p < 0.04)".
It's not a surprise that our T test conclusions closely match our ANOVA test conclusions.
► Note also that if we use this standard error to approximate the 95% confidence interval by using two standard errors above and below the estimated slope we get a range from positive 0.105 to positive 0.595.
This region doesn't include a slope value of zero which implies that the population slope is very unlikely to be zero, just as the results of our two tests have concluded.

Slide 11.

We can do a quick test of the significance of the correlation coefficient, comparing it to zero.
We use the equation shown and plug in our correlation coefficient in the two locations and we will get 2.867.
► It turns out that for linear regressions we will get the same t value as we did for our slope analysis and since the degrees of freedom are also the same we get the same result.
► "The correlation coefficient for the relationship between exam 1 scores and exam 2 scores is significantly larger than zero (0.02 < p < 0.04)"

Slide 12.

OK, let's look at another example. Imagine that we have a set of ID numbers and exam scores for 7 students in a class as shown.
► First we'll calculate the sum of the X and Y values and their mean values.
► The next step is to create our X minus mean X column.
51 minus 47 gives 4, 67 minus 47 gives 20 all the way down to 85 minus 47 gives that last value of 38.
The sum of all the values will be zero.
► Then our column of Y value minus the mean Y values.
74 minus 76 gives negative 2, 72 minus 76 gives negative 4 all the way to 79 minus 76 gives 3 for the last value.
The sum of all the values in this column will also be zero.
► Now we want a column for the X minus X mean values, squared.
4 squared gives 16, 20 squared gives 400 continuing all the way down to 38 squared gives 1444 for the last value.
The sum of all these values, 4824, is the SSx.
► Now we want a column for the Y minus Y mean values, squared.
Negative 2 squared gives 4, negative 4 squared gives 16 all the way down to 3 squared gives 9 for the last value.
The sum of all these values, 54, is the SSy.
► The next column is our cross product values, the values from the X minus X mean column multiplied by the Y minus Y mean values.
4 times negative 2 gives negative 8, 20 times negative 4 gives negative 80 all the way down to 38 times 3 to give 114 for the last value.
The sum of all these values, 90, is the SPxy.
SHIFT to NEXT SLIDE
Now let's use these values to get the equation for the best-fit line.
► The slope is the sum of the cross-products divided by the sum of squares X which is 90 divided by 4824 to give 0.019.
► We use our equation where we solved for the Y-intercept with the line going through the mean X and mean Y values.
► This is 76 minus 0.019 times 47 to get 75.123.
The equation for our best-fit line is therefore: Y equals 75.123 plus 0.019 times X.
Now we'll use this best-fit line to calculate the rest of the values we need.
BACK to DATA SLIDE
► Our next column is the predicted Y values using the equation for the best-fit line.
The first value is 75.123 plus 0.019 times 51 to get 76.07.
Then 75.123 plus 0.019 times 67 to get 76.37.
All the way down to 75.123 plus 0.019 times 85 to get 76.71.
► Now our column for how these predicted values compare to the overall mean Y value, the predicted Y values minus the mean Y.
76.07 minus 76 gives 0.07.
76.37 minus 76 gives 0.37.
All the way down to 76.71 minus 76 gives 0.71 for the last value.
The sum of all the values in this column will be zero.
► Now a column that squares the values we just calculated to give us the SS regression.
0.07 squared is 0.01.
0.37 squared is 0.14.
All the way down to 0.71 squared to give 0.50 for the last value.
The sum of all the values in this column is the sum of squares regression which is 1.679.
Our last two columns compare the actual values to the predicted ones.
► First a column with each observed Y value minus the predicted Y value, the residuals.
74 minus 76.07 gives negative 2.07.
72 minus 76.37 gives negative 4.37.
All the way down to 79 minus 76.71 gives 2.29 for the last value.
The sum of all the values in this column is always zero.
► Now we square these values so we can get the sum of squares error.
negative 2.07 squared is 4.30.
negative 4.37 squared is 19.12.
All the way down to 2.29 squared to give 5.25 for the last value.
The sum of all the values in this column is the SS error which is 52.321.
At this point we confirm that the SS regression and SS error values add up to the total SSY.
1.679 plus 52.321 should equal 54 which they do.

Slide 13.

These are our values so far.
► Calculating the correlation coefficient gives us a value of 0.176 which suggests a weak positive relationship between these two values, but with a value this small and with so few values, the relationship may not be statistically significant.
► The coefficient of determination is the SS regression divided by the SS Y which is 0.031. This is also equal to the 0.176 squared.
Both of these values are fairly small, especially the second one, suggesting little support for a clear relationship between the ID numbers and exam scores.

Slide 14.

Here is a plot of the data with a smaller plot of the residuals. As we guessed from the r and R squared values, the pattern looks weak and unlikely to be consistent.
Despite how it looks, we need to conduct some statistical tests to be sure. What looks like a lack of pattern may be more than chance alone could create and maybe there is something non-random going on.

Slide 15.

For our ANOVA analysis we will use the null hypothesis of slope equals zero and the alternative that the slope differs from zero.
We will be doing an F test using the means sum regression and means sum error.
► Mean sums regression is the SS regression divided by the degrees of freedom value for a regression which is always one. This gives us 1.679 divided by one to get 1.679.
► Mean sums error is the SS error divided by the degrees of freedom value for the error which is the number of data points minus two. This gives us 52.231 divided by seven minus two in the denominator to get 10.464.
► F is therefore 1.679 divided by 10.464 which is 0.160.
► Looking at our tables of critical F values from the StatsExamples website we see that this value is nowhere close to the critical values for p equals 0.05 or 0.025.
► We can therefore say that: "The slope of the best-fit line for ID numbers vs exam scores is not significantly different from zero (0.05 < p)".
This makes sense since we wouldn't have expected ID numbers to be related to exam scores anyway.

Slide 16.

For our t-test analysis we will use the null hypothesis of slope equals zero and the alternative that the slope differs from zero.
We will be doing a t-test using the equation shown.
The degrees of freedom for our t-test is the number of values minus two which is 7 minus 2 equals 5.
► Then we calculate the standard error of the slope using the equation shown which gives us the second equation which ends up being 0.0466.
► The t calculated value becomes 0.019 divided by 0.0466 which is 0.401.
► Looking at our tables of critical t values from the StatsExamples website we see that this value is much smaller than the critical value for alpha equals 0.1, which we double since this is a two-tailed test so we know that p is larger than 0.2.
► We can therefore say that: "The slope of the best-fit line for ID numbers vs exam scores is not significantly different from zero (0.2 < p)".
It's not a surprise that our T test conclusions closely match our ANOVA test conclusions.
► Note also that if we use this standard error to approximate the 95% confidence interval by using two standard errors above and below the estimated slope we get a range from negative 0.075 to positive 0.113.
This region includes a slope value of zero which means that the population slope may well be zero, hence the reason why our two tests have not rejected that hypothesis.

Slide 17.

Our quick test of the significance of the correlation coefficient will show a similar result.
We use the equation shown and plug in our correlation coefficient into the two locations and get 0.401, same as the t calc for the significance of the slope.
► Referencing our t table, "The correlation coefficient for the relationship between ID numbers and exam scores is not significantly different from zero (0.2 < p)"

Slide 18.

Now that we've done a couple of examples with simplified data sets, let's look at some real data.
It just so happens that as I'm making this video I'm teaching a statistics course and have data for the first and second exams of a bunch of my students.
► This figure shows the first exam score on the X-axis and the second exam score on the Y-axis for the 126 students who took both exams. The best-fit line is the red dashed line through the data.
We can certainly see what looks like a positive relationship - students who did well on the first exam did well on the second and vice versa.
We also see some noise, the relationship isn't perfect, there are certainly some students who did very differently on the exams.
But maybe this is just an illusion? Humans are very prone to find patterns even when there aren't any.
► Here are the estimated parameters and the test statistics. The p values for both the ANOVA and the t-test approach are extremely small, there's a less than one in a ten trillion chance that the pattern would be that consistent just by random chance.
We can therefore very confidently say that: "The slope of the best-fit line for exam 1 scores vs exam 2 scores is significantly larger than zero (p < 10-13)"
Note that our 95% confidence interval for the slope does not include the value of zero, it's not even close.
Obviously, this doesn't prove that the score on exam 1 caused the score on exam 2. There is some other factor, academic ability for example, that is causing both of these values.
This study is clearly a correlation study because we didn't control all the non-exam factors or specify treatment groups with certain values on the X-axis.
If this data had been the number of minutes each student was allowed to take the exam on the X-axis with their scores on the Y-axis it would be valid to consider it to be a regression study. We still haven't controlled all the other factors, but the argument for causality would be much stronger.

Slide 19.

Now let's look at more real data from the same class. Now let's think about the last two numbers of the students' ID numbers and their exams scores.
Obviously, we're not going to expect to see a significant pattern for this data.
► This figure shows the two digits from the student ID numbers on the X-axis and the first exam score on the Y-axis for the 127 students who took the first exam. The best-fit line is the red dashed line through the data.
Now there doesn't look to be much of a relationship at all. Maybe there's a slight negative one, but what we see is clearly mostly noise.
► Here are the estimated parameters and the test statistics. The p values for both the ANOVA and the t-test approach are not small, a pattern like the one seen is easily explained by random chance.
► We would therefore say that: "The slope of the best-fit line for ID numbers vs exam 1 scores is not significantly different from zero (0.2 < p)"
This time our 95% confidence interval for the slope completely surrounds zero.
This doesn't prove that there's no relationship between the last two digits of the student ID numbers and their exam scores, but it provides no support for anyone making that claim.
In fact, this lack of a pattern when we have a nice big data set would be very good evidence against someone making that claim.

Slide 20.

Finally, I find this chart useful when learning the correlation and regression techniques.
While the math for correlation and regression is the same, the conclusions and interpretations we can make differ.
Let's think about the two main outcomes, when our tests return significant results and when they return nonsignificant results.
By the way, we use the term nonsignificant instead of insignificant to avoid confusion about whether we're talking about test results of relevance.
First, when our tests return significant results.
► For both types of studies this would be when the p values are less than 0.05 and our 95% confidence interval for the slope would not include zero.
The identification of a nonrandom pattern also means that we can predict values of each variable from the other with some degree of accuracy.
We should keep in mind however that while these results demonstrate a real relationship between the two variables, it might be trivial or unimportant in the real world.
► What differs is what we can say about causation.
If the analysis is a correlation one then a causation process is hinted at, but we can't make any definite conclusions about the nature of the causation.
If the analysis is a regression one however then we can make a definite conclusion about the nature of the causation, the independent variable is causing the dependent one. This process may be indirect of course, but we've learned something concrete about the process.
► But what about when our tests return nonsignificant results.
► For both types of studies this would be when the p values are larger than 0.05 and our 95% confidence interval for the slope would include zero.
The lack of a nonrandom pattern means that we can't predict values of each variable from the other. There's no evidence that these two variable have anything to do with each another.
► There is also a slight difference in what we can say about causation.
If the analysis is a correlation one, then any claim for causation is weakened. But since the study is a correlation, this counterargument is not definitive.
If the analysis is a regression one however, then any claim for causation is severely weakened and this kind of study can essentially demonstrate that there is no causal link between the two factors.

Zoom out.

Correlation and regression are very widely used to search for relationship between measured factors. As we've seen, although the math can get a bit involved, the whole process is doable and the results fairly straightforward to interpret.
As always, a high-resolution PDF of this summary slide is available on the StatsExamples website.

End screen.

If you found this video useful, consider subscribing, liking and sharing to help others find it too.



Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.