NORMAL PROBABILITY DISTRIBUTION (INTRODUCTION)
Connect with StatsExamples here
LINK TO SUMMARY SLIDE FOR VIDEO:
TRANSCRIPT OF VIDEO:
The normal distribution is the most important probability distribution in statistics. Let's take a look at where it comes from, what it's used for, and why it's so important
the normal distribution is the most common probability distribution in statistics. It is based on the binomial distribution which is the most fundamental probability distribution.
In fact, we can think of the normal distribution as the limit of the binomial distribution as the number of trials becomes large and the probability of success of each trial is 0.5.
If you don't remember much about the binomial distribution, this channel an playlist has a video all about the binomial distribution that you can take a look at.
The figure here illustrates the binomial distribution for 10 trials and the probability of success of 0.5.
You can see that as we increase the number of trials the shape of the binomial distribution changes slightly.
And as we increase the number of trials even more, the shape of the distribution changes even more.
And here we can see how the shape of the binomial distribution with lots of trials on the left can be represented by a smooth curve as shown on the right. Even though these two figures look different, one is a histogram and the other is a curved line, they are representing the same thing - a distribution of probability values corresponding to values on the X axis.
As we begin to think about this normal probability distribution, it's important to remember exactly what is being represented.
On the left, the histogram is showing a particular probability corresponding to each number of observations on the X axis.
If we wanted to know the probability of some particular range or group of observations, seeing somewhere between 25 and 35 observations for example, we would add up those individual probabilities to create a sum of those probabilities.
We would essentially be calculating the sum of the Heights of a set of those individual bars.
On the right, the curve is showing the probability corresponding to each location along the X axis.
If we wanted to know the probability of some particular range on the X-axis, the proportion of the total probability corresponding to between 25 and 35 for example, we would calculate the area underneath that curve.
Note that since this is a probability distribution, we know that the total area underneath the curve is equal to 1. The value we get for the area of a region therefore represents a proportion of the overall set of probabilities
Looking at our histogram situation in more detail we would calculate the overall probability as the sum of all those individual probabilities for the numbers of observations we're interested in.
Looking at our probability curve situation in more detail we would calculate the overall probability as the integral, the area underneath the curve, along the range of values on the X axis we're interested in
To perform these integrals or calculate these areas we would need the equation for that curve that represents those probabilities.
The top equation to the right is the general equation for a normal distribution curve. F of X equals one over Sigma times the square root of 2π multiplied by E to the exponent negative 1/2 times , in parentheses, X minus mu divided by Sigma, end parentheses squared.
The bottom equation to the right is the equation we would get for a normal distribution corresponding to the binomial with 50 trials and a probability of success of 0.5. The mean would be 25, the variance would be 12.5, and the standard deviation would be 3.54.
that gives us F of X equals one over 3.5 four times the square root of 2π multiply by E to the exponent negative 1/2 times , in parentheses, X minus 25 divided by 3.5 four , end parentheses squared.
This equation causes the normal distribution to be centered around the mean with a width that depends on the standard deviation. And because the overall area must be equal to 1, if the standard deviation is larger, then the height of the normal distribution is lower.
The equation for the normal distribution can be simplified if the mean is equal to 0 and the variance and standard deviations are both equal to 1. Under these conditions we get a normal distribution called the standard normal distribution.
The standard normal distribution has the same shape as the normal distribution but you can see that the equation is much simpler, F of X equals one over the square root of 2π times E to the negative 1/2 X squared.
The standard normal distribution is centered around 0 and you can see that by the time it gets to negative two and positive 2 the individual probability values are quite small.
But keep in mind, we generally care much less about the individual probability values and much more about the area underneath the curve for certain regions along the X axis.
It's very unusual that you would ever want to calculate an individual probability value for a particular value on the X axis.
The standard normal distribution is useful because any set of values with a normal distribution of mean equals mu and standard deviation equals Sigma can be transformed into a new set of values that have the standard normal distribution by subtracting mu from every value and then dividing that result by Sigma.
When we do this we transform those values X in the original normally distributed data set into new values called Z values in the standard normal distribution.
Doing this transformation allows us to calculate probabilities for values in regions defined by a mean and standard deviation.
For example, if we wanted to know what proportion of the values in our original data set are in the center region bounded by 1 standard deviation above and below the mean, that corresponds to the center region of the standard normal distribution from Z equals negative one to positive one.
Instead of having to think about doing integrals and calculating areas for all the different possible normal distributions, we can transform all of those distributions of data into the standard normal distribution and calculate our areas with this to answer the questions we're interested in.
For example, for a data set with a mean of 23 and standard deviation of five we might be interested in how many values are between 22 and 27.
The values of 22 and 27 from the original data set can be transformed into the Z values of -0.2 and +0.6 using the equation Z equals X minus mean divided by Sigma .
When we do this, our original question is equivalent to asking what proportion of values in the standard normal distribution are between the values of Z = -0.2 and Z = 0.6.
So to recap.
We have an equation for a normal distribution with mean of mu and standard deviation of Sigma and we can calculate probabilities by calculating integrals for the area under the curve between two values
If we transform these values into z values, then that procedure is equivalent to finding the area under the standard normal distribution curve between the z values corresponding to the original values.
So how do we calculate the area under the standard normal distribution curve? Unfortunately, it turns out that this integral cannot be solved exactly. not that it's hard, but it's impossible to get an exact equation.
This is actually one of those secrets they don't tell you about in early calculus courses, but many of the most interesting equations don't integrate easily, polynomials and trig functions do, but this equation for the normal distribution does not.
However, there are more advanced mathematical techniques that can be used to create extremely accurate numerical approximations. Instead of us learning those techniques, what we will do is borrow tables of the results from those numerical approximations.
These tables of areas are called normal distribution tables or Z tables. the most common format for these tables is to indicate the area under the curve to the left of a particular Z value. We can then calculate areas for ranges by looking up values in the tables and subtracting.
Of course if you're doing this using a computer, it has all these tables in a database somewhere and it's referencing those values without showing you.
As mentioned, Z tables are tabulated areas that are usually for the area to the left of a given Z value.
That's not a guarantee however, sometimes this same information is presented in a different way. For this reason it's important to understand why this procedure works and makes sense, so that if you have to use tables with different formats you can still obtain the areas you want.
We can figure out the area between A & B in our original distribution by using this table to figure out the area between the Z value corresponding to A and the Z value corresponding to B.
If the table tells us the area to the left of the Z value corresponding to A is 0.3, and the area to the left of the Z by value corresponding to B is 0.9 then the area between those Z values must be 0.6.
Let's look at a more specific example, what is the area between the Z scores of negative one and positive one?
We would go to our table of Z values and locate the area value in the table that corresponds to the Z score of 1.00.
In this table, which is the one provided on the StatsExamples website, the first 2 digits of the Z scores are in the column at the far left and the columns correspond to the 2nd decimal place as shown in the row at the top. The values in the table are the areas to the left of the Z score indicated.
When we look in the table and find the location for 1.00 the area listed there is 0.8413. when we look in the table and find the location for negative 1.00 the area listed there is 0.1587. subtracting 0.1587 from 0.8413 gives us 0.6826 as the area under the standard normal distribution curve between the Z value of negative one and one.
This is equivalent to the proportion of a normally distributed data set, whatever its mean and standard deviation are, having 68.26% of its values in a region 1 standard deviation above and below the mean.
This calculation we just did leads to a rule of thumb that about 2/3 or 66% of the data in a normal distribution is within one standard deviation of the mean.
Let's look at another example, what is the area between the Z scores of negative 2 and positive 2?
Again we would go to the table and find the areas in it that correspond to AZ score of 2.00 and AZ score of negative 2.00. Those values are 0.9772 and 0.0228. 0.9772 - 0.0228 equals 0.9544.
This calculation leads to the rule of thumb that 95% of the data in a normal distribution is within 2 standard deviations of the mean.
It's actually quite useful to remember these three rules of thumb for regions within the normal distribution.
About 66% of the data will be within one standard deviation of the mean.
About 95% of the data will be within 2 standard deviations of the mean.
And over 99% of the data will be within 3 standard deviations of the mean.
I've been talking about the normal distribution representing data values , which is often the case, but they also represent probabilities.
These same calculations apply to normally distributed probabilities.
For example, if we randomly choose a value from a normal distribution there is an approximately 66% probability it's within one standard deviation of the mean. If we randomly choose a value from a normal distribution there is an approximately 95% probability it's within 2 standard deviations of the mean. And if we randomly choose a value from a normal distribution there is over a 99% probability it's within 3 standard deviations of the mean.
OK, So what is the normal distribution used for? there are two main uses.
First, many populations exhibit a normal distribution. A normal distribution of values is a natural outcome of summing many independent factors - for example multiple genetic effects that contribute to a trait or sequential actions that lead to an outcome will tend to generate normally distributed traits or outcomes.
Therefore, the normal distribution is useful for estimating frequencies and proportions in many populations from just knowing the mean and variance of the population or estimating the population mean and variance from the sample mean and sample variance.
If we wanted to know what range of heights 95% of people are included within, instead of trying to measure everybody and create a detailed distribution we could just calculate the mean and variance and use the normal distribution.
The second reason to study normal distributions is because of something called the central limit theorem. This states that the means of samples from a population form a normal distribution, no matter what the population distribution looks like.
Therefore, the normal distribution is useful for estimating the means of populations. It allows us to calculate confidence intervals, which are regions that describe where we think the population mean probably is, based on our sample mean and variance.
Estimating the mean value of a population is the most common thing that people want to do when they do statistics. The central limit theorem allows us to use the properties of the normal distribution to do that. Details about estimating the mean of populations using confidence intervals are in this channel's video about confidence intervals.
The normal distribution is the most important probability distribution in all of statistics. Many statistical tests actually require that the values be normally distributed in order to work properly.
The normal distribution is the foundation of calculating confidence intervals to estimate the mean of a population and is the distribution that is the basis for performing t-tests that compare the means of different groups to one another.
The linked video shows some examples of calculations with the normal distribution and the playlist includes videos about confidence intervals and t-tests.
Like or subscribe to make sure you can find this channel again in the future.
Connect with StatsExamples here
This information is intended for the greater good; please use statistics responsibly.