Variation, the normal distribution, and uncertainty

Continuous distributions describe the probability distribution of continuous random variables. Unlike discrete distributions (which deal with distinct, countable outcomes), continuous distributions involve a range of possible values. Here are a few key points:

1. Probability Density Function (PDF):

o The PDF represents the likelihood of a continuous random variable taking on a specific value.

o It’s a function that describes the relative likelihood of different outcomes.

o For example, the normal distribution (or Gaussian distribution) is a common continuous distribution with a bell-shaped PDF.

2. Area Under the Curve:

o In continuous distributions, probabilities are represented as areas under the curve.

o The total area under the curve is always 1 (since the variable must take on some value).

o To find the probability of a specific range of values, integrate the PDF over that range.

3. Examples of Continuous Distributions:

o Normal Distribution: Often seen in natural phenomena (e.g., heights, IQ scores). It’s symmetric and characterized by mean (μ) and standard deviation (σ).

o Exponential Distribution: Models time between events (e.g., time between customer arrivals).

o Uniform Distribution: All values within a specified interval are equally likely.

o Log-Normal Distribution: Useful for modeling positive quantities (e.g., stock prices).

4. Standardization:

o Standardizing a continuous random variable involves transforming it into a standard normal distribution (with mean 0 and standard deviation 1).

o This process simplifies comparisons and calculations.

5. Central Limit Theorem (CLT):

o The CLT states that the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution.

o It’s a fundamental concept in statistics, especially when dealing with sample means.

Remember, continuous distributions play a crucial role in statistical inference, hypothesis testing, and confidence intervals. While we won’t delve into the math here, having an intuitive understanding of these concepts will serve you well as you work with data. 📊🌟

The Understanding Distributions module shows that you can use a histogram to graph the distribution of continuous values. Now, let's look at the concept of continuous distributions.

We won't discuss the formulas used to complete the calculations mentioned in this lesson, but having a general familiarity with these concepts may be useful to you as you continue to explore, understand, and communicate with data.

Density curves’

The Understanding Distributions module explains how histograms can represent the distributions of finite samples of continuous variables. The height of each bar in the histogram is proportional to the frequency of the values within that bin. In other words, the higher the bar, the more frequently the data points from the sample are within that bin.

For example, the histogram on the top shows the distribution of stature, in inches, for 40 people. Clearly, this is a data sample of a finite number of data points. However, when you consider all the possible values of the continuous variable of stature, you see that it could vary widely. We would not have enough time in our lives to create a histogram with bins of every possible stature value. This is true for any continuous variable.

Instead of using a histogram to represent every possible value for a continuous variable, we can use a continuous distribution. A continuous distribution looks like a smooth curve, also called a density curve. The density curve represents more than just the values in a particular sample. It represents all possible values, as well as their probabilities of occurrence (how likely the values are to occur).

When looking at histograms, we use the height of the bars to understand the number of data points occurring within that bin, or how frequently the data points are within that bin. When we look at continuous distributions, however, we can't interpret the height of a probability curve in that way.

Imagine, again, data that contains every possible value for stature. It's not meaningful to ask about the likelihood that someone stands at exactly 61 inches. With an infinite number of values, asking about 61 inches is as arbitrary as asking about the likelihood that someone stands at 61.002 inches or at 60.9997 inches.

Instead, we look at the probability within an interval. The probability within an interval equals the area under the curve within that interval.

The total area under the curve is 1, or 100% because there is a 100% probability that all possible values fall somewhere within the curve.

To summarize, here are some concepts to keep in mind when thinking about density curves:

They are continuous distributions that represent all possible data points at once.
The y-axis represents the density of probability, which shows the chance of obtaining values near corresponding points on the x-axis.
The total area under the curve is 100% or 1.

Normal distribution

Now we will focus on a special density curve, the normal distribution or normal curve. It has a symmetrical "bell" shape.

When you looked at the distributions of continuous variables graphed on histograms, you learned to describe a symmetrical distribution. If you folded a symmetrically distributed histogram in half, the two sides would match perfectly. In symmetrical distributions, the mean and the median are equal.

Just as with symmetrical distributions, in a normal distribution, the shape is symmetrical, and the mean is equal to the median.

Here are the major characteristics of a normal distribution:

They are symmetrical around the mean.
The mean and median are equal.
The area under the normal curve is equal to 1.0 (or 100%).
They are denser in the center and less dense in the tails.
They are defined by two parameters, the mean and the standard deviation.

Look at the normal distribution shown on the curve above. In a normal distribution, 68% of the data falls between +1 and -1 standard deviation from the mean, 95% of the data falls within -2 and +2 standard deviations from the mean. The short "tails" on both sides of the curve indicate that very few values (5%) will fall outside of -2 and +2 standard deviations from the mean.

Normal distributions with smaller standard deviations will be narrower and taller than normal distributions with larger standard deviations.

In this image, both normal distributions have a mean of 50. The taller curve has a standard deviation of 5, and the shorter curve has a standard deviation of 10.

The usefulness of the normal distribution

In his book The Truthful Art, information designer and professor Alberto Cairo explains that "no phenomenon in nature follows a perfect normal distribution, but many approximate it enough as to make it one of the main tools of statistics." Cairo goes on to explain, "If you know that the phenomenon you're studying is normally distributed, even if not perfectly, you can estimate the probability of any case or score with reasonable accuracy." In other words, we can use the properties of the normal curve to estimate the probability of a case or score with reasonable accuracy.

We are often making estimates of a population from a sample because it is rare that we can measure the entire population. If the sample represents the population, the normal curve can be a useful estimation tool.

Confidence intervals

When using the normal curve to make probability estimations on sample data, you can use confidence intervals to arrive at a margin of error.

Confidence intervals are an example of inference. Inference is the process of drawing conclusions about a population based on a sample of the data.

A confidence interval contains a population mean for a specified proportion of the time. For example, if you would like to have a confidence interval of 95%, that means that 95% of the intervals in your data will include the true mean.

The 95% confidence interval is derived by using the normal distribution where 95% of the data falls within -2 and +2 standard deviations from the mean.

Let's consider an example adapted from David M. Lane's chapter on confidence intervals in the online, public domain work Introduction to Statistics.

Imagine you are interested in the mean (average) weight, in pounds, of 10-year-old children in the United States. You obviously can't weigh every 10 year old, so, instead, you weigh a sample of 16 children and find that the mean weight is 90 pounds. This sample mean of 90 is a point estimate of the population mean, but it doesn't give you a clear idea of how far the mean for the sample may be from the mean for the population. In other words, can you be confident that the mean weight for the entire US population of 10-year-old children is within 5 pounds of 90? You simply cannot know.

However, you can use a calculation (not discussed here) to arrive at a confidence interval of 95%. A 95% confidence interval would include mean weights between 72.85 and 107.15 pounds.

In other words, there would be good reason to believe that the mean weight for the entire US population of 10-year-old children would fall between 72.85 and 107.15 pounds because, after taking repeated samples with the 95% confidence interval calculated for each sample, 95% of the time, the intervals would contain the true mean.

This also means, however, that 5% of time, the intervals will not contain the true mean.

Real-world examples in seeing uncertainty

Alberto Cairo, the author mentioned earlier in this lesson, has written a number of blog entries describing real-world examples of how uncertainty has been represented (and misunderstood) in visualizations that depict hurricane paths. You can access a blog entry about misinterpreting forecasting maps for the 2019 Category 5 storm, Hurricane Dorian, in addition to other related topics in Alberto Cairo's professional website.

Note the links will open in a separate window.

Knowledge check

Which of the following statements is the most accurate about normal distributions?

Most natural phenomena perfectly follow normal distributions.
In a normal distribution, the median is greater than the mean.
A normal distribution is the same as a symmetrical histogram showing a finite set of continuous values.
When using a data sample that represents the total population, the normal distribution can be a useful estimation tool.

SUBMIT

TAKE AGAIN

Summary

You've now gotten familiar with continuous distributions, including the special shape of the normal curve. In the next lesson, you'll take a look at the concept of hypothesis testing when using data samples.

Search This Blog

tableu tutorial

Variation, the normal distribution, and uncertainty

Comments

Post a Comment

Popular posts from this blog

Understanding Variation for Wise Comparisons - Measuring variance

Hypothesis testing and p-values