NORMAL PROBABILITY DISTRIBUTION (EXAMPLES)

Tweet
Share



Connect with StatsExamples here



LINK TO SUMMARY SLIDE FOR VIDEO:


StatsExamples-normal-probability-examples.pdf

TRANSCRIPT OF VIDEO:


Slide 1

The normal distribution is a very useful distribution for understanding data. Let's look at some examples of how to use the normal distribution to understand a dataset.

Slide 2

First of all let's do a quick review of what the normal, or gaussian, distribution is all about. You can also watch our intro to the normal distribution video featured on this channel for more background information about the normal distribution. That video is linked at the end of this one and in the video description below
The normal distribution is sometimes called the Bell curve distribution and it matches the distribution of many datasets found in nature. In order to use the properties of the normal distribution for our data we typically transform the distribution for our data into something called the standard normal distribution.
Normal distributions with a mean mu and a standard deviation sigma can be transformed into the standard normal distribution with a mean of zero and standard deviation of one using the equation here in the middle of the screen: Z equals X minus mu, all divided by sigma.
The values of X correspond to the values from the original data distribution whereas the values of Z correspond to values on the horizontal axis for the standard normal distribution.
When applied to all the values in our data set, that equation transforms a distribution like the one shown on the left, with a mean of mu and a width described by a standard deviation of Sigma, into a distribution shown on the right, where the mean is zero and the standard deviation is one.

Slide 3

This transformation process is useful because regions within the original normal distribution correspond to regions in the standard normal distribution. Areas in the standard normal distribution are known and tabulated in tables or stored in statistics programs.
This allows us to answer questions such as - what proportion of the overall distribution in the original data set is between the mean and a value one standard deviation above the mean?
We use the equation to transform this question into an equivalent question - what proportion of the standard normal distribution is between zero and one?
Since we have access for all the areas of regions in the standard normal distribution, we can use this information to answer questions about the original normally distributed data distribution.

Slide 4

Tables of the areas in the standard normal distribution usually show areas to the left of certain Z values. This is not always the case, but most tables look like this one here from the StatsExamples website.
The top row and the left column correspond to different values of Z. The left column is the first number and decimal place and the top row gives the value of the second decimal place. This table shows the areas that correspond to Z values from negative 3.00 to positive 3.00. The values in the rest of the table are the areas under the standard normal distribution curve to the left of that particular Z value.
For example , what is the area in the standard normal distribution between Z = 0 and Z = 1?
first, we would identify the two areas in the table that correspond to our Z values .
►For Z = 0 the area to the left from the table is given as 0.5.
►For Z = 1 the area to the left from the table is given as 0.8413
The table doesn't directly give us the area between two Z values, but we can use subtraction to figure that area out. The area between Z = 0 and Z = 1 would be the area to the left of 1 minus the area to the left of 0.
►The area between Z = 0 and Z = 1 is therefore 0.8413 - 0.5 = 0.3413.
►Another thing to note is that the total area for a standard normal distribution is equal to 1. If we were interested in areas to the right of a certain point, we can use that to our advantage.
For example, the area above Z = 1 would be 1 (the area under the entire curve), minus 0.8413 (the area to the left of Z = 1), which equals 0.1587.
Now let's look at three concrete examples to see how we can use the properties of the normal distribution to answer questions about our datasets. We'll start with simple questions and work our way up to more complicated ones.

Slide 5

Our first example involves the masses of Guinea pigs. Consider a population of 80 Guinea pigs in which the mean mass is 1.1 kilograms and the standard deviation is 200 grams , which is 0.2 kilograms. for this example and the others we will assume that the original data set is normally distributed.
Keep in mind, for this example and the next two, the calculations we do will only be valid if the original data set is normally distributed.
For this first example we will ask 3 questions. First how many Guinea pigs are lighter than 1.1 kilograms , then how many Guinea pigs are lighter than 1.4 kilograms, then how many Guinea pigs are heavier than 0.8 kilograms.
We won't answer these questions by looking at the actual data values from our data set, we'll use the properties of the normal distribution and our knowledge of the mean and standard deviation of the original data set to answer them.

Slide 6

First question, how many Guinea pigs are lighter than 1.1 kilograms.
Before we even start, we know that our answer should be 50% (which is 40 guinea pigs) because the mean mass of the Guinea pigs was 1.1 kilograms and the normal distribution is symmetric.
But let's use what we know about the normal distribution to answer this. First, we need to translate our question about the 1.1 kilograms, which is referring to values from our original data set, into a question about Z values in the standard normal distribution.
►We use the equation Z equals X minus mu, divided by sigma, which gives us:
1.1 minus 1.1, divided by 0.2 - which is 0.
►Our question then becomes equivalent to what proportion of the standard normal distribution is less than a Z value of 0?
►Now we would go look at our table of Z values. Reading down the left column we get to 0.0 and then looking at the top row we get that second decimal place of 0 which allows us to locate the position within the table that has the area to the left of Z equals 0.00. The value given in the table is 0.500.
Keep in mind that the total area under the curve is 1 so the area of 0.5000 is equivalent to that exact proportion of the area under the curve.
►Thinking about our original Guinea pigs, 80 times 0.5000 equals 40 Guinea pigs from our original set that are lighter than 1.1 kilograms.

Slide 7

Next question, how many Guinea pigs are lighter than 1.4 kilograms.
Before we even start, we know that our answer should be more than 50% because the mean mass of the Guinea pigs was 1.1 kilograms and the normal distribution is symmetric.
First, we need to translate our question about the 1.4 kilograms, which is referring to values from our original data set, into a question about Z values in the standard normal distribution.
►The equation Z equals X minus mu, divided by sigma, gives us:
1.4 minus 1.1, divided by 0.2 - which is 0.3 divided by 0.2 - which 1.5.
►Our question then becomes equivalent to what proportion of the standard normal distribution is less than a Z value of 1.5?
►Now we would go look at our table of Z values. Reading down the left column we get to 1.5 and then looking at the top row we get that second decimal place of 0 which allows us to locate the position within the table that has the area to the left of Z equals 1.50. The value given in the table is 0.9332.
As before, the area of 0.9332 is equivalent to that exact proportion of the area under the curve.
►Thinking about our original Guinea pigs, 80 times 0.9332 equals 74.656 Guinea pigs from our original set that are lighter than 1.4 kilograms. In the real world this would mean that 74 or 75 Guinea pigs are lighter than 1.4 kilograms.

Slide 8

Next question, how many Guinea pigs are heavier than 0.8 kilograms.
Before we even start, we know that our answer should be more than 50% because 0.8 kilograms is less than the mean mass of 1.1 kilograms and the normal distribution is symmetric. We will also be looking at the area above a certain Z value.
First, we need to translate our question about the 0.8 kilograms into a question about Z values.
►The equation transformation equation gives us:
0.8 minus 1.1, divided by 0.2 - which is negative 0.3 divided by 0.2 - which negative 1.5.
►Our question is now, what proportion of the standard normal distribution is larger than a Z value of -1.5?
►Now we look at our table of Z values. Reading down the left column we get to -1.5 and then looking at the top row we get that second decimal place of 0 which allows us to locate the position within the table that has the area to the left of Z equals -1.50. The value given in the table is 0.0668.
As before, the area of 0.0668 is equivalent to that exact proportion of the area under the curve to the left. But we want heavier guinea pigs, which is the area to the right, so we subtract this value from one.
►One minus 0.0668 equals 0.9332.
►Thinking about the original Guinea pigs, 80 times 0.9332 equals 74.656 Guinea pigs from our original set that are heavier than 1.4 kilograms.
This is actually the same as the answer from our previous example. This makes sense because in that example we were asking how many Guinea pigs were lighter than a mass 1 1/2 standard deviations above the mean whereas in this example we're asking how many Guinea pigs are heavier than a mass 1 1/2 standard deviations below the mean.

Slide 9

Our second example involves viral load values for infected patients. Consider a population of 15,000 infected patients in which the mean viral load is 60,000 viral particles per milliliter of blood with a standard deviation of 7,000 particles. Again, we'll assume that the original data set is normally distributed.
For this first example we'll ask 3 questions. First, how many patients have a viral load less than 62,000, then how many patients have a viral load between 53,000 and 74,000, and finally what viral load value divides the population into a top 5% and a bottom 95% ?
As before, we won't answer these questions by looking at the actual data values for our patients, we'll use the properties of the normal distribution and our summary statistics from the original data set to answer them.

Slide 10

First question, how many patients have a viral load less than 62,000?
Before we even start, we know that our answer should be more than 50% because the mean viral load was 60,000 and the normal distribution is symmetric.
First, we need to translate our question about the viral load of 62,000, which is referring to values from our original data set, into a question about Z values in the standard normal distribution.
►We use the equation Z equals X minus mu, divided by sigma, which gives us:
62,000 minus 60,000, divided by 7,000 - which is 2,000 divided by 7,000 - which is 0.2857.
►Our question then becomes equivalent to what proportion of the standard normal distribution is less than a Z value of 0.2857?
►Now we go look at our table of Z values. Unfortunately, our table only provides areas for Z values to two decimal places and our Z value has four places. Since we don't have the resolution to exactly match 0.2857 we have to use the Z values of 0.28 and 0.29 to estimate it.
Reading down the left column we get to 0.2 and then looking at the top row we get second decimal places of 8 and 9 which allows us to locate the position within the table that brackets our true Z value. The values given in the table are 0.6103 and 0.6141
►If we use the first value, we would multiply our 15,000 patients by 0.6103 to get an estimate of 9154.5 patients.
►If we use the second value, we would multiply our 15,000 patients by 0.6141 to get an estimate of 9211.5 patients.
►Our answer would therefore be - the number of patients with a viral load less than 62,000 would be in a range from 9,155 to 9,212.

Slide 11

Second question, how many patients have a viral load between 53,000 and 74,000?
Again, we translate our question about the viral load into a question about Z values.
►We use our transformation equation twice:
53,000 minus 60,000, divided by 7,000 - is -7,000 divided by 7,000 - which is -1.
74,000 minus 60,000, divided by 7,000 - is 14,000 divided by 7,000 - which is positive 2.
►Our question then becomes equivalent to what proportion of the standard normal distribution is between -1 and 2?
Unlike some of the previous examples where we could make a good guess about whether we would get more or less than 50%, this situation is a little trickier. Nevertheless we could sketch the normal distribution and look at the middle region and guess that it looks like more than 50%. To get the exact number we need the values from our Z table.
►Reading down the left column we get to -1 and then looking at the top row we get that second decimal place of 0 which allows us to locate the position within the table that has the area to the left of Z equals -1. This area is 0.1587.
►Likewise, reading down the left column we get to 2 and then looking at the top row we use 0 to find the area to the left of Z equals 2 of 0.8772.
►Now to get the area between these two Z scores we have to subtract the area to the left of Z = -1 from the area to the left of Z = 2. This would be 0.8772 - 0.1587 equals 0.7185.
►Multiplying our 15,000 patients by 0.7185 gives us an estimate of 10,777.5 patients with a viral load between 53,000 and 74,000.

Slide 12

Third question, what viral load value divides the population into a top 5% and a bottom 95%?
This question is a little different from the ones we've been doing up until now. Instead of having data values that we translate into Z scores and then look up the areas, we'll start with the areas and work backwards to the Z scores and then to the values in our original data set.
►First, it's helpful to draw a little diagram to help us think about this.
►Next we go to our table of values and find the area values in the table that are closest to the one we want. In this case we're looking for the Z score that corresponds to 95% of the area to the left. Our table doesn't have any area values of exactly 0.95 but there is a spot on the table with a 0.9495 and a 0.9505 right next to each other. These areas correspond to Z scores of 1.64 and 1.65.
►Now we're going to use our transformation equation, but we have to rearrange it to solve for x, the values in our original data set. This new equation is x equals sigma times z plus mu.
►Using this equation gives us X = 7000 times 1.64 + 60,000 = 71,480 and X = 7000 times 1.65 + 60,000 = 71,550. Depending on which of the two Z scores we use we get a value between 71,480 and 71,550 for the viral load that divides the population into a top 5% and a bottom 95%
►If we really wanted a more precise value, we could interpolate between these two Z values by taking the mean of those two values, 1.645, and multiplying by 7000 and adding 60,000 to get 71.515.
Interpolating values properly can be kind of tricky so I wouldn't recommend it unless you really know what you're doing. In this case it's easy because 0.95 is exactly halfway between the two area values from our table. But interpolation in other cases, where things aren't symmetric, can be a little more complicated as we'll see in one of the next examples.

Slide 13

Our last example involves a set of exam scores. Consider a population of 2,000 exam scores in which the mean is 74 with a standard deviation of 6. Again, we'll assume that the original data set is normally distributed.
For this example we'll ask 3 questions. How many scores are between 70 and 80? Which score corresponds to the 90th percentile? And finally, what is the range for the middle 50% of the scores?

Slide 14

How many scores are between 70 and 80?
We translate our question about scores into a question about Z values.
►We use our transformation equation twice:
70 minus 74, divided by 6 - is -4 divided by 6 - which is -0.67.
80 minus 74, divided by 6 - is 6 divided by 6 - which is positive 1.
►Our question then becomes equivalent to what proportion of the standard normal distribution is between -0.67 and 1?
Again, we could sketch the normal distribution and look at the middle region. This time it's harder to make a good guess so we really do need the values from the Z table.
►Reading down the left column we get to -0.6 and then looking at the top row we get that second decimal place of 0.07 which allows us to locate the position within the table that has the area to the left of Z equals -0.67. This area is 0.0.2514.
►Likewise, reading down the left column we get to 1 and then looking at the top row we use 0 to find the area to the left of Z equals 1 of 0.8413.
►Now to get the area between these two Z scores we have to subtract the area to the left of Z = -0.67 from the area to the left of Z = 1. This would be 0.8413 - 0.2514 equals 0.5899.
►Multiplying our 2,000 scores by 0.5899 gives us an estimate of 1,179.8 scores between 70 and 80.

Slide 15

Next question, what score corresponds to the 90th percentile? This is the score such that 90% of the other scores are equal or less.
just like the third viral load question, we will start with the areas and work backwards to the Z scores and then to the values in our original data set.
►First, it's helpful to draw a little diagram to help us think about this.
►Next we go to our table of values and find the area values in the table that are closest to the one we want. In this case we're looking for the Z score that corresponds to 90% of the area to the left. Our table doesn't have any area values of exactly 0.90 but there is a spot on the table with a 0.8997 and a 0.9015 right next to each other . These areas correspond to Z scores of 1.28 and 1.29.
►We use the rearranged version of our transformation equation - x equals sigma times z plus mu.
►Using this equation gives us X = 6 times 1.28 + 74 = 81.68 and X = 6 times 1.29 + 74 = 81.74. Depending on which of the two Z scores we use we get a value between 81.68 and 81.74 for the score that corresponds to the 90th percentile dividing the population into an upper 10% and a lower 90%
►If we really wanted a more precise value, we could interpolate between these two Z values. In this case however, because the 0.8997 area is much closer to 0.9 than 0.9015 is, the true Z value is going to be is much closer to the Z value of 1.28 than the Z value of 1.29.
The interpolation would therefore be one using the magnitudes of the deviations as shown and give us an estimate of 81.69. Even this interpolation is not perfect because the normal distribution is not linear and interpolations like this one and the one we did for the third viral load example assume linearity.
An interpolation like this is better than just taking the midpoint between 81.68 and 81.74, but if you really need high precision then you should use a computer which can provide the exact Z value corresponding to an area of 0.9.

Slide 16

Last question, what is the range for the middle 50% of the scores?
►First, it's helpful to draw a little diagram to help us think about this.
In our diagram we can see that we need 25% on each end around the 50% in the middle. This shows us that we want to get the Z scores that correspond to areas of 0.25 and 0.75.
►Our table doesn't have any area values of exactly 0.25 or 0.75, but there are values in the table for 0.2514 and 0.2483 right next to each other and 0.7486 and 0.7517 right next to each other. These areas correspond to Z scores of -0.67 or -0.68 and z scores of 0.67 and 0.68. To keep it simple let's use Z scores of -0.675 and positive 0.675 for our calculations.
►We use the rearranged version of our transformation equation.
X = 6 times -0.675 + 74 = 69.85 and X = 6 times 0.675 + 74 = 78.05.
►Our final result would be that we expect half of all the exam scores to be between 69.95 and 78.05. 25% would be less than 69.95 and 25% would be higher than 78.05.
►We could even use this to calculate the IQR, a variation statistic described in one of the other videos on this channel, as 78.05 - 69.95 equals 8.1
Zoom out

These examples illustrate how the normal distribution can be used to obtain useful information about our dataset just from knowing two values, the mean and the standard deviation. We're able to make useful and descriptive statements about our data without having to look at every single value, just a couple of summary statistics
Keep in mind that this is only possible if we know that our data set has a normal distribution. This isn't always true for all real world data, but it turns out that lots of real world datasets do exhibit a normal distribution, especially when the value we are measuring is the result of numerous factors that contribute to it.
As with all StatsExamples videos there is a link on the StatsExamples website to a PDF of this summary slide.
If you found this video helpful feel free to share this video with others you think might also find it useful.
End screen

You can subscribe to this channel to see when new videos are posted or use the website social media links in the video description and on the website shown to connect with StatsExamples so you can always find it easily in the future. Press the like button to help other people find this video here on YouTube.


Connect with StatsExamples here


This information is intended for the greater good; please use statistics responsibly.