PSY2015F tut 3: Normal and Standard Normal Distribution

1. Normal Distribution

Read Tutorial 4 & 5 of Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

1.1. What does a Normal Distribution look like?

The Normal Distribution is one kind of probability distribution that is commonly used in the study of Psychological phenomena. Because the Normal Distribution is a type of probability distribution, it depicts the distribution of a variable at the population level, displaying the probability associated with all possible outcomes.

It is referred to as “normal” because of it’s characteristic shape:

Unimodal (i.e. one central peak)
Symmetrical (i.e. either side of the distribution is a mirror image from the center)
Asymptotic (i.e. the tails/extreme ends of the graph never reach the x-axis)

Directly below is a Normal Distribution.

The Normal Distribution is particularly useful in the social sciences because many psychological variables that are studied can be approximated by this distribution. In particular, the majority of scores fall in and around the centre (hence why there is a peak), and a minority of scores fall around extreme values (at very small or very high values at the tails of the graph)

1.2. Some variables that a Normal Distribution approximates

Let’s look at some rather crude, but useful examples of psychological variables that follow normal distributions:

Intelligence quotients (IQ)
- The majority of participants score around the average IQ (at the centre), approximately 100
- A minority of participants score more than 20 points below the average IQ, less than 80. They might be considered intellectually less able.
- Another minority of participants score more than 20 points above the average IQ, greater than 120. They might be considered more intelligent than average.

Working memory (Digit Span Test)
- The majority of people can typically remember 5 to 8 numbers
- A minority of people can only remember fewer than 5 numbers. They might be considered to be impaired in memory.
- Another minority of participants can remember more than 8 numbers. They might be considered to have above average memory abilities.

1.3. Normal vs Other Distributions

The Normal Distribution is a just one type of probability distribution. Each probability distribution has its own associated characteristics.

Exercise 1.3.1

All distributions can be described in terms of their symmetry and kurtosis

1.3.1 Symmetry

Normal Distributions are symmetrical, but for asymmetrical distributions, they can be characterised as

Negatively skewed

Most of the values are clustered together at the right-hand side of the distribution, with a longer left-hand tail

Positively skewed

Most of the values are clustered together at the left-hand side of the distribution, with a longer right-hand tail

1.3.2 Kurtosis

Kurtosis is a term used to describe the pattern of extreme values of a distribution, relative to the Normal Distribution, which is considered to be absent of kurtosis.

Normal Distributions are referred to as mesokurtic. Distributions that are NOT mesokurtic can be described as…

Platykurtic

Compared to the Normal Distribution, values are more spread out around the mean (i.e. the centre)

Leptokurtic

Compared to the Normal Distribution, values are more clustered around the mean

Here’s an easier way to think about kurtosis!

1.4. Normal Distribution parameters

Normal Distributions can also be described in terms of two additional characteristics:

Central tendency (In this case the Mean (\(\mu\)))
Spread (In this case the Standard Deviation (\(\sigma\)))

These characteristics are referred to as parameters, as they capture information about the entire population.

1.4.1 Normal Distribution family

The Normal Distribution is not a single probability distribution, but rather consists of a family of distributions.

Normal Distributions can differ from one another in terms of Central Tendency and Spread

Normal distributions of daily hours of sunlight (in a year period) in three different cities is displayed below

Exercise 1.4.1.1

Hint: Think about what parameters can affect the shape of the Normal Distribution (look back at section 1.4.), as well as how they affect it

Exercise 1.4.1.2

Exercise 1.4.1.3

Hint: Think about what parameters can affect the appearance of the Normal Distributions (look back at section 1.4.), as well as how they affect it

Normal distributions of daily hours of sunlight in Cape Town are displayed for the following seasons; Summer, Autumn and Winter

Exercise 1.4.1.4

Exercise 1.4.1.5

1.4.2. Effect of changes in central tendency and variability on Normal Distribution

In section 1.4.1. you will have seen how the appearance of the Normal Distribution changes as central tendency and variability changes.

Normal distributions can be entirely determined with knowledge of the two parameter values:

Mean (\(\mu\))
Standard deviation (\(\sigma\))

Let’s have a look at how changes to the mean and standard deviation can affect the Normal Distribution, through simulation (see Optional advanced section of Guided Tutorial 2 for a brief discussion on simulation).

The function rnorm() allows one to randomly draw values that come from a Normal Distribution characterised by a specific set of parameter values (i.e. mean (\(\mu\)) and standard deviation (\(\sigma\)))

Exercise 1.4.2.1

Inspect the help file for the rnorm() command (familarise yourself with the inputs/arguments required)

?rnorm()

Stop Signal Task (SST) Let’s say that the Stop Signal Task (SST), which is a measure of response inhibition, has a mean commission rate of 50 percent (\(\mu\)), and a standard deviation of 10 percent (\(\sigma\))

Exercise 1.4.2.2

Use the rnorm() command below to simulate 1000 random commission rates from a Normal Distribution that has a mean of 50 and standard deviation of 10, and plot the values in a histogram.

Hint: Remember to set the seed using the set.seed() function (see 3.1.1 for Guided Tutorial 2), and label your plot

set.seed(...)
...=rnorm(n=..., mean = ..., sd =...)
...(..., xlab =..., main =..., xlim=c(0, 100))

set.seed(4)
data=rnorm(n=1000, mean = 50, sd =10)
hist(data, xlab= "Commission Rates (percentage)", main= "Graph of distribution of commission rates on SST", xlim=c(0, 100))

Note: The “xlim” argument defines the limitations of the range of the x-axis. In this case, the x-axis range is 0 (minimum) to 100 (maximum). If we changed the concatenation (aka ‘c’) argument to “0, 200” the axis would extend to 200 and the distribution of the data would spread out and adjust accordingly. The x-axis limitation can technically be set to -∞/+∞, but should ideally be suited to the realistic limits of your data.

Exercise 1.4.2.3

Now simulate 1000 random commission rates from a Normal Distribution, with the same mean (50) but now with a standard deviation of 20, and plot the values in a histogram.

hist(rnorm(..., ..., ...),xlim=c(0, 150), xlab="...", main="...")

hist(rnorm(n=1000, mean = 50, sd = 20),xlab= "Commission rates (percentage)", main = "Graph of distribution of commission rates on SST", xlim=c(0, 100))
#You should notice that this solution is slightly different from that presented in exercise 1.4.2.3, and useful if one doesn't want to save the simulated data into an object. Also, note what the value of 400 in the `xlim` function does to the graph.

Exercise 1.4.2.4

Exercise 1.4.2.5

Hint: Remember that the normal distribution is symmetrical around the mean

2. Area under the curve

2.1 Score to probability

Probabilities are derived from the area under the curve of some probability distribution (see section 2.6.3 of Guided Tutorial 2).

Birth weight Let’s say we know that birthweight is normally distributed, with a mean (\(\mu\)) weight of 2.8kg and a standard deviation (\(\sigma\)) of 0.7kg.

If one were interested in knowing the likelihood (i.e. probability) of any new born child having a weight equal to or under 1.6 kg, the area under the curve from the value 1.6kg and lower would represent the associated probability (see birth weight distribution below). The pnorm() command can be used to extract the probability associated with a value that comes from a normal distribution.

Exercise 2.1.1.1

Use the pnorm() command to determine probability of any infant having a birth weight of 1.6kg or lower

pnorm(q=1.6, mean =..., sd=... )

pnorm(q=1.6, mean = 2.8, sd = 0.7)

Note: the pnorm() command calculates cumulative probabilities, or the sum of individual probabilities of each possible value preceding a specific cut-off. Using the example from exercise 2.1.1.1, pnorm() computes a probability associated with obtaining a weight of 1.6kg or lower. In other words, incorporating all possible weights below this specific cut-off.

Exercise 2.1.1.2

Use the pnorm() command to determine probability of any infant having a birth weight of 5kg or greater

1-pnorm(q=..., mean =..., sd=... )

1-pnorm(q=5, mean = 2.8, sd = 0.7)

Note: Notice how exercise 2.1.1.2 differs from 2.1.1.1. This is because pnorm() command automatically calculates probabilities using the area under the curve from the left-hand side of the value (see graphs below to visualise how pnorm() command operates).

An alternative to 1-pnorm() is to change the pnorm argument from lower.tail = TRUE to lower.tail = FALSE

Exercise 2.1.1.3

Determine the probability of any infant having a birth weight of 5kg or greater, setting lower.tail = FALSE. Check that your answer is the same as that for exercise 2.1.1.2

Note: By default, the lower.tail argument within the pnorm() command is set to TRUE, meaning that probability will include values less than the cut-off score

pnorm(q=..., mean =..., sd=..., lower.tail = ... )

pnorm(q=5, mean =2.8, sd=0.7, lower.tail = FALSE )

Exercise 2.1.1.4

Use the pnorm() command to determine probability of any infant having a birth weight including or between 1.6 and 5kg

pnorm(q=5, mean =..., sd=... ) - pnorm(q=1.6, mean =..., sd = ...)

pnorm(q=5, mean =2.8, sd=0.7) - pnorm(q=1.6, mean =2.8, sd=0.7)

Exercise 2.1.1.5

Exercise 2.1.1.6

Complete the code chunk and to calculate the probability of any infant having a birth weight equal to or greater than 2.8 kg

1-pnorm(q=2.8, mean =2.8, sd=0.7)

2.2 Probability to score

Just like one can obtain the associated probability of some value (e.g. birth weight), by extracting the area under the curve. One can also extract the score from a probability, using the qnorm() command

Using the same example for birth weight distribution with a mean (\(\mu\)) of 2.8kg and standard deviation (\(\sigma\)) of 0.7kg, one may wish to find the birth weight associated with the smallest 5% of infants

Exercise 2.2.1.1

Use the qnorm() command to determine the maximum birth weight associated with the smallest 5% (0.05) of infants

qnorm(p=0.05, mean =..., sd=...)

qnorm(p=0.05, mean =2.8, sd=0.7)

Exercise 2.2.1.2

Use the qnorm() command to determine the minimum birth weight associated with the largest 5% (0.95) of infants

Note: Notice that the top 5% of birth weights is represented by a probability of 0.95. This is because unlike x-axis values that can be negative and positive, probability can only ever be positive and sum up to 1

qnorm(p=0.95, mean =..., sd=...)

qnorm(p=0.95, mean =2.8, sd=0.7)

The distribution below represents the probability of the bottom 5% and top 5% of birth weights respectively

Exercise 2.2.1.3

Use the qnorm() command to determine the 1st and 3rd quartiles for birth weight

Hint: 1st quartile is represented by 0.25 probability, whilst 3rd quartile is represented by 0.75 probaility

qnorm(p=..., mean =..., sd=...)
qnorm(p=..., mean =..., sd=...)

qnorm(p=0.25, mean =2.8, sd=0.7)
qnorm(p=0.75, mean =2.8, sd=0.7)

3. Standard Normal Distribution

We remind you that the normal distribution is more of a family of distributions rather than a single distribution, where normal distributions can vary in mean (\(\mu\)) and standard deviation (\(\sigma\))

The Standard Normal Distribution is a standardized version of the Normal Distribution, where the mean (\(\mu\)) is zero and the standard deviation (\(\sigma\)) is one.

Why is the Standard Normal Distribution useful?

The standard Normal Distribution is unaffected by sample size, where varying sample size can alter the means (\(\mu\)) and standard deviations (\(\mu\)) of Normal Distributions.
The following can always be inferred from the Standard Normal Distribution:

\(\approx\) 68% of the area under the curve lies between -1 and +1
\(\approx\) 95% of the area under the curve lies between -2 and +2
\(\approx\) 99% of the area under the curve lies between -3 and +3

See example of a Standard Normal Distribution below. The values on the x-axis are referred to as Z scores

3.1. Generating a Standard Normal Distribution

A Standard Normal Distribution can be simulated like any other Normal Distribution using the rnorm() command, where the mean (\(\mu\)) and standard deviation (\(\sigma\)) are now specified as 0 and 1 respectively (or simply omitted, since they are default arguments the function assumes)

Exercise 3.1.1.1

Use the rnorm() command to simulate 1000 scores from a Standard Normal Distribution, and plot your data in a histogram

set.seed(...)
...=rnorm(n=..., mean =..., sd=...)
hist(...,xlab="...", main"...")

set.seed(4)
dat.SN=rnorm(n=1000, mean =0, sd=1)
hist(dat.SN,xlab="Standard Normal Distribution", main="Histogram of Standard Normal Distribution")

Exercise 3.1.1.2

3.2. Transform Normal to Standard Normal Distribution (converting to Z-scores)

Any Normal Distribution can be converted to a Standard Normal Distribution by converting all possible raw scores to z-scores

Z-score formula:

\[ \begin{aligned} Zscore =\frac{X - \mu}{\sigma} \end{aligned} \]

\(X\) = any raw score in the range of possible values of some Normal Distribution
\(\mu\) = population mean of Normal Distribution
\(\sigma\) = population standard deviation of Normal Distribution

Let’s look at simulated birth weight scores, in the object vector named “birth.weight”, that comes from a Normal Distribution of mean (\(\mu\)) 2.8kg and standard deviation (\(\sigma\)) 0.7kg.

Exercise 3.2.1.1

Use an appropriate plot to display the distribution of birth weights from the “birth.weight” object vector.

...(...,...)

hist(birth.weight, xlab ="Birth Weights in kgs", main="Plot of Distribution of Birth Weights")

Exercise 3.2.1.2

Using the Z-score formula, convert a birth weight of 1kg into a z-score.

(...-2.8)/0.7

(1-2.8)/0.7

Exercise 3.2.1.3

Using the Z-score formula, convert all raw birth weight scores from the “birth.weight” vector into z-scores. Plot the saved z-scores.

...=(birth.weight-...)/...
...(...,xlab ="...", main="...")

zscore =(birth.weight-2.8)/0.7
hist(zscore, xlab ="Z-scores", main="Graph of Distribution of Z-scores")

You should notice that the original mean (\(\mu\)) of 2.8kg has now been converted to a mean of 0, and the original standard deviation (\(\sigma\)) of 0.7kg has been converted to a standard deviation of 1.

Given that it may seem silly to speak of a mean (\(\mu\)) birth weight of 0kg, we opt to refer to all transformed raw scores as z-scores.

One quicker method of converting raw scores into z-scores is to use the scale() command

Exercise 3.2.1.4

Use the scale() command to convert all raw birth weight scores from the “birth.weight” vector into z-scores. Plot the saved z-scores.

...=scale(...)
...(..., xlab="...", main="...")

zscore2 =scale(birth.weight)
hist(zscore2, xlab="Z-scores", main="Histogram of distribution of Z-scores")

3.3. Z-scores to probability

The pnorm() command can be used to extract the probability associated with a z-score that comes from a Standard Normal Distribution

Exercise 3.3.1.1

Use the pnorm() command to determine the probability of an infant obtaining a z-score of 2 or lower

Reminder: You have previously used the pnorm() command with Normal Distributions.

Hint: Look back at exercise 2.2.1.2

pnorm(q =..., mean=..., sd=...)

pnorm(q = 2, mean=0, sd=1)

3.4. Probability to z-scores

The qnorm() command can be used to calculate the z-score associated with a particular probability

Exercise 3.4.1.1

Use the qnorm() command to determine the minimum z-score associated with the largest 10% (0.90) of infants

Reminder: You have previously used the qnorm() command with Normal Distributions.

Hint: Think about how the mean and standard deviation differ between a Normal and Standard Normal Distribution

qnorm(p=..., mean=..., sd=...)

qnorm(p=0.90, mean =0, sd=1)

4. Advanced Section

Whilst the Normal Distribution is frequently used to characterise psychological variables, not all variables are normally distributed.

For a variable to be normally distributed, it must

Be a continuous variable. It is a numeric variable that can in theory take on any value between minus and plus infinity (but in practice is between some minimum and maximum (see advanced section of Guided Tutorial 1))
Distributed in such a way that the majority of values are close to the central tendency (mean/median), with fewer extreme values (i.e. on lower and upper tails)

Exercise 4.1.1

Hint: Scores can only be normally distributed if a variable is continuous and not categorical

Exercise 4.1.2

Hint: More than one answer is correct, choose any correct answer.

Exercise 4.1.3

Free-form exercise

Move on to and complete Statistics Tutorial Assignment 3 on Amathuba (Activities | Assignments)

Other resources & references

Developed by: Marilyn Lake & Colin Tredoux, University of Cape Town