Variation, the normal distribution, and uncertainty
Continuous
distributions describe the probability distribution of continuous random
variables. Unlike discrete distributions (which deal with distinct, countable
outcomes), continuous distributions involve a range of possible values. Here
are a few key points:
1.
Probability Density Function (PDF):
o The PDF represents the likelihood of
a continuous random variable taking on a specific value.
o It’s a function that describes the
relative likelihood of different outcomes.
o For example, the normal distribution
(or Gaussian distribution) is a common continuous distribution with a
bell-shaped PDF.
2.
Area Under the Curve:
o In continuous distributions,
probabilities are represented as areas under the curve.
o The total area under the curve is
always 1 (since the variable must take on some value).
o To find the probability of a specific
range of values, integrate the PDF over that range.
3.
Examples of Continuous Distributions:
o Normal Distribution: Often seen in natural phenomena
(e.g., heights, IQ scores). It’s symmetric and characterized by mean (μ) and
standard deviation (σ).
o Exponential Distribution: Models time between events (e.g.,
time between customer arrivals).
o Uniform Distribution: All values within a specified
interval are equally likely.
o Log-Normal Distribution: Useful for modeling positive
quantities (e.g., stock prices).
4.
Standardization:
o Standardizing a continuous random
variable involves transforming it into a standard normal distribution (with
mean 0 and standard deviation 1).
o This process simplifies comparisons
and calculations.
5.
Central Limit Theorem (CLT):
o The CLT states that the sum (or
average) of a large number of independent, identically distributed random
variables approaches a normal distribution.
o It’s a fundamental concept in
statistics, especially when dealing with sample means.
Remember,
continuous distributions play a crucial role in statistical inference,
hypothesis testing, and confidence intervals. While we won’t delve into the
math here, having an intuitive understanding of these concepts will serve you
well as you work with data. 📊🌟
The Understanding
Distributions module shows that you can use a histogram to graph the
distribution of continuous values. Now, let's look at the concept of continuous
distributions.
We won't
discuss the formulas used to complete the calculations mentioned in this
lesson, but having a general familiarity with these concepts may be useful to
you as you continue to explore, understand, and communicate with data.
Density
curves’
The Understanding
Distributions module explains how histograms can represent the
distributions of finite samples of continuous variables. The height of each bar
in the histogram is proportional to the frequency of the values within that
bin. In other words, the higher the bar, the more frequently the data points
from the sample are within that bin.
For example,
the histogram on the top shows the distribution of stature, in inches, for 40
people. Clearly, this is a data sample of a finite number of data points.
However, when you consider all the possible values of the
continuous variable of stature, you see that it could vary widely. We would not
have enough time in our lives to create a histogram with bins of every possible
stature value. This is true for any continuous variable.
Instead of using a histogram to represent every possible value for a continuous variable, we can use a continuous distribution. A continuous distribution looks like a smooth curve, also called a density curve. The density curve represents more than just the values in a particular sample. It represents all possible values, as well as their probabilities of occurrence (how likely the values are to occur).
When looking
at histograms, we use the height of the bars to understand the number of data
points occurring within that bin, or how frequently the data points are within
that bin. When we look at continuous distributions, however, we can't interpret
the height of a probability curve in that way.
Imagine,
again, data that contains every possible value for stature.
It's not meaningful to ask about the likelihood that someone stands at exactly
61 inches. With an infinite number of values, asking about 61 inches is as
arbitrary as asking about the likelihood that someone stands at 61.002 inches
or at 60.9997 inches.
Instead, we
look at the probability within an interval. The probability
within an interval equals the area under the curve within that interval.
The total
area under the curve is 1, or 100% because there is a 100% probability that all
possible values fall somewhere within the curve.
To
summarize, here are some concepts to keep in mind when thinking about density
curves:
- They are continuous
distributions that represent all possible data points at
once.
- The y-axis represents the density
of probability, which shows the chance of obtaining values near
corresponding points on the x-axis.
- The total area under the curve
is 100% or 1.
Normal
distribution
Now we
will focus on a special density curve, the normal distribution or normal
curve. It has a symmetrical "bell" shape.
When you
looked at the distributions of continuous variables graphed on histograms, you
learned to describe a symmetrical distribution. If you folded a symmetrically
distributed histogram in half, the two sides would match perfectly. In
symmetrical distributions, the mean and the median are equal.
Just as
with symmetrical distributions, in a normal distribution, the shape is
symmetrical, and the mean is equal to the median.
Here are
the major characteristics of a normal distribution:
- They are symmetrical around the
mean.
- The mean and median are equal.
- The area under the normal curve
is equal to 1.0 (or 100%).
- They are denser in the center
and less dense in the tails.
- They are defined by two
parameters, the mean and the standard deviation.
Look at
the normal distribution shown on the curve above. In a normal distribution, 68%
of the data falls between +1 and -1 standard deviation from the mean, 95% of
the data falls within -2 and +2 standard deviations from the mean. The short
"tails" on both sides of the curve indicate that very few values (5%)
will fall outside of -2 and +2 standard deviations from the mean.
Normal
distributions with smaller standard deviations will be narrower and taller than
normal distributions with larger standard deviations.
In this
image, both normal distributions have a mean of 50. The taller curve has a
standard deviation of 5, and the shorter curve has a standard deviation of 10.
The
usefulness of the normal distribution
In his
book The Truthful Art, information designer and professor Alberto
Cairo explains that "no phenomenon in nature follows a perfect normal
distribution, but many approximate it enough as to make it one of the main
tools of statistics." Cairo goes on to explain, "If you know that the
phenomenon you're studying is normally distributed, even if not perfectly, you
can estimate the probability of any case or score with reasonable
accuracy." In other words, we can use the properties of the normal curve
to estimate the probability of a case or score with reasonable accuracy.
We are
often making estimates of a population from a sample because it is rare that we
can measure the entire population. If the sample represents the population, the
normal curve can be a useful estimation tool.
Confidence
intervals
When using
the normal curve to make probability estimations on sample data, you can
use confidence intervals to arrive at a margin of error.
Confidence
intervals are an example of inference. Inference is the process of
drawing conclusions about a population based on a sample of the data.
A confidence
interval contains a population mean for a specified proportion of the time. For
example, if you would like to have a confidence interval of 95%, that means
that 95% of the intervals in your data will include the true mean.
The 95%
confidence interval is derived by using the normal distribution where 95% of
the data falls within -2 and +2 standard deviations from the mean.
Let's
consider an example adapted from David M. Lane's chapter on confidence
intervals in the online, public domain work Introduction to
Statistics.
Imagine you
are interested in the mean (average) weight, in pounds, of 10-year-old children
in the United States. You obviously can't weigh every 10 year old, so, instead,
you weigh a sample of 16 children and find that the mean weight is 90 pounds.
This sample mean of 90 is a point estimate of the population
mean, but it doesn't give you a clear idea of how far the mean for the sample
may be from the mean for the population. In other words, can you be confident
that the mean weight for the entire US population of 10-year-old children is
within 5 pounds of 90? You simply cannot know.
However, you
can use a calculation (not discussed here) to arrive at a confidence interval
of 95%. A 95% confidence interval would include mean weights between 72.85 and
107.15 pounds.
In other
words, there would be good reason to believe that the mean weight for the
entire US population of 10-year-old children would fall between 72.85 and
107.15 pounds because, after taking repeated samples with the 95% confidence
interval calculated for each sample, 95% of the time, the intervals would
contain the true mean.
This also
means, however, that 5% of time, the intervals will not contain the true mean.
Real-world
examples in seeing uncertainty
Alberto
Cairo, the author mentioned earlier in this lesson, has written a number of
blog entries describing real-world examples of how uncertainty has been
represented (and misunderstood) in visualizations that depict hurricane paths.
You can access a blog
entry about misinterpreting forecasting maps for the 2019 Category 5
storm, Hurricane Dorian, in addition to other related topics in Alberto Cairo's
professional website.
Note the
links will open in a separate window.
Knowledge
check
Which of the
following statements is the most accurate about normal distributions?
- Most natural phenomena perfectly
follow normal distributions.
- In a normal distribution, the
median is greater than the mean.
- A normal distribution is the
same as a symmetrical histogram showing a finite set of continuous values.
- When using a data sample that
represents the total population, the normal distribution can be a useful
estimation tool.
SUBMIT
TAKE
AGAIN
Summary
You've now
gotten familiar with continuous distributions, including the special shape of
the normal curve. In the next lesson, you'll take a look at the concept of
hypothesis testing when using data samples.









Comments
Post a Comment