1. Descriptive vs inferential statistics
Read Tutorial 4 of Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)
1.1. Descriptive statistics
In tutorial 1, we calculated descriptive statistics for samples (including central tendency and variability estimates such as the mean and standard deviation).
A sample refers to the group of observations drawn from the total population of interest (e.g. a sample of 72 female anorexic patients were studied out of the nation-wide hospital registry of female anorexic patients)
Descriptive statistics allow us to summarise the characteristics of a sample
1.2. Inferential statistics
Unlike descriptive statistics, inferential statistics are calculated with respect to the population.
A population includes all possible elements (e.g. the entire female anorexia patient hospital registry)
Inferential statistics allow us to characterise a population.
One does not always have access to an entire population, but probability theory can be used to infer characteristics about a population

2. What is probability?
Probability can be defined in 2 ways:
- Probability as the likelihood of occurrence of random events -
referred to as theoretical probability
- Refers to what we expect to happen
- Expressed as a theoretical probability (e.g. 0.5)
- For example: in a random fair coin toss, there is a 0.5 probability of obtaining tails.
- Probability as the frequency of occurrence - known as
empirical probability
- Refers to what we find out through observation
- Typically expressed as a percentage (e.g. 20%), proportion (e.g. 0.2) or fraction (e.g. 2000/10000)
- For example: 2000 out of 10000 randomly sampled individuals from the population are HIV-positive, suggesting that there is a 20% chance that any randomly chosen individual is HIV-positive
Probability formula:
\[ \begin{aligned} P(E) =\frac{Number\ of\ outcomes\ of\ interest}{Total\ number\ of\ outcomes} \end{aligned} \]
NB When we traditionally speak about probability, we represent it as any value ranging between (and including) 0 and 1
Exercise 2.0.1
Exercise 2.0.2
Exercise 2.0.3
Exercise 2.0.4
2.1. Probability example using smarties
Smartie colours. This particular box of smarties consists of the following:
| Smartie colour | Number of smarties |
|---|---|
| PINK | 14 |
| RED | 6 |
| YELLOW | 5 |
| BLUE | 6 |
| BROWN | 4 |
| ORANGE | 2 |
| GREEN | 4 |
| TOTAL | 41 |

Hint: Use the probability formula to answer these questions.
Exercise 2.1.1
Exercise 2.1.2
2.2. Probability rules: Part 1
Continue with the smartie example for the following exercises under 2.2-2.4
2.2.1. Mutually exclusive events
Two or more events are mutually exclusive when they cannot occur at the same time.
- Some examples:
- Turning left OR right in a car are mutually exclusive because one cannot turn both left and right at the same time (i.e. simultaneously).
- You cannot have both a King and an ace on one playing card
- You cannot get both heads and tails on a coin toss
Exercise 2.2.1.1
Exercise 2.2.1.2
2.2.2. Exhaustive events
Events are exhaustive if they encompass the entire range of possible outcomes. The probabilities of all possible events sum to 1
- Some examples:
- values 1,2,3,4,5,6 on a die are exhaustive of all possible values that can be obtained on dice (i.e. no other value is possible on a die)
- 7 continents on earth (not, say, 5 or 6)
Exercise 2.2.2.1
Exercise 2.2.2.2
Hint: More than one answer is correct, choose any correct answer.
2.2.3. “NOT” rule
The probability of an event NOT occurring is equal to 1-probability of event occurring
“NOT” rule probability formula:
\[ \begin{aligned} P(NOT Event) ={1-P(Event)} \end{aligned} \]
Note: A reminder of the smartie colours and their associated amounts
| Smartie colour | Number of smarties |
|---|---|
| PINK | 14 |
| RED | 6 |
| YELLOW | 5 |
| BLUE | 6 |
| BROWN | 4 |
| ORANGE | 2 |
| GREEN | 4 |
| TOTAL | 41 |
Exercise 2.2.3.1
What is the probability that a randomly chosen smartie is NOT BLUE? (Use code chunk below)
Hint: 1- P(BLUE)
1- 6/41
Exercise 2.2.3.2
2.2.4. “OR” rule
The probability of either one OR other events occurring is equal to the sum of the individual probabilities probability of event 1 + probability of event 2. Otherwise referred to as the ADDITIVE rule
“OR” rule probability formula:
\[ \begin{aligned} {P(EventA~OR~EventB)} ={P(EventA)+P(EventB)} \end{aligned} \]
Exercise 2.2.4.1
What is the probability that when you randomly chose a smartie that it will be either BLUE or ORANGE (use code chunk below)
6/41 + 2/41
Exercise 2.2.4.2
What is the probability that when you randomly chose a smartie that it will be either BLUE or ORANGE or YELLOW (use code chunk below)
Hint: P(BLUE) + P(ORANGE) + P(YELLOW)
6/41 + 2/41 + 5/41
Exercise 2.2.4.3
2.3. Probability rules: Part 2
Note: A reminder of the smartie colours and their associated amounts
| Smartie colour | Number of smarties |
|---|---|
| PINK | 14 |
| RED | 6 |
| YELLOW | 5 |
| BLUE | 6 |
| BROWN | 4 |
| ORANGE | 2 |
| GREEN | 4 |
| TOTAL | 41 |
2.3.1. Independent events
We call events independent when they have no effect on each other. In other words, events are independent when the probability of one event occurring does not influence the probability of another event occurring.
- Some examples:
- When throwing a die, if you obtained a 5 on the previous turn, this will not affect what number you roll on the next turn/roll
- Today’s weather in Cape Town vs weather in Cape Town in exactly a year’s time (we assume that today’s weather will have no bearing on the weather experienced in another year’s time)
Continuing with the smartie example…
2.3.2 Sampling with replacement
The probability of events occurring is independent when there is sampling with replacement
Sampling with replacement refers to the fact that after a sampling unit is drawn (e.g. a smartie from a smartie box), the event space from which it was drawn is returned to its original state (e.g. the total number of smarties in the smartie box does not change as a result of removing a smartie).
Let’s illustrate with the smartie example
Exercise 2.3.2.1
Let’s say a BLUE smartie was picked at random from a smartie box in a previous turn and was NOT replaced, what would be the probability of picking an ORANGE smartie on the current turn? Use the code chunk below
Hint: Remember that there is now one less smartie in the box in total (i.e. Sampling without replacement)
2/40
Exercise 2.3.2.2
2.3.3. “AND” rule
The probability of BOTH events occurring is equal to the multiplication of the individual probabilities: probability of event 1 X probability of event 2. Otherwise referred to as the MULTIPLICATIVE rule
“AND” rule probability formula: \[ \begin{aligned} {P(EventA~and~EventB)} ={P(EventA)*P(EventB)} \end{aligned} \] This formula can also be extended to include more than 2 events (i.e. calculating the probability of more than two events occurring)
Exercise 2.3.3.1
Assuming sampling with replacement, what is the probability that you randomly chose a BLUE smartie followed by an ORANGE smartie? (use code chunk below)
Hint: P(BLUE)XP(ORANGE)
6/41*2/41
Exercise 2.3.3.2
Assuming sampling with replacement, what is the probability that you randomly chose a BLUE, then GREEN, then ORANGE smartie? (use code chunk below)
Hint: P(BLUE)XP(ORANGE)XP(GREEN)
6/41*2/41*4/41
Exercise 2.3.3.3
Now assume sampling without replacement, what is the probability that you randomly chose a BLUE, followed by BLUE, followed by another BLUE smartie? (use code chunk below)
Hint:Think about how the numerator and denominator will change as you remove a BLUE smartie each time
6/41*5/40*4/39
2.4. Probability distributions
So far we have been looking at the probability associated with specific events, one at a time but…
Probability distributions allow us to display the probability of all possible mutually exclusive events alongside each other in one graph
barplot(smarties, xlab="Smartie colour", ylab="Empirical proportion",names.arg = colour,main ="Distribution of different coloured smarties", ylim=c(0,0.4))
The bar graph above visually depicts the proportion of the total number of smarties represented by each smartie colour in the smartie box. It is useful because one can easily see which colours are more abundant (in this case, the largest proportion of smarties are PINK), and which are the least abundant (in this case, the smallest proportion of smarties are ORANGE).
2.4.1. Cumulative probabilities
We can add up probabilities of mutually exclusive, independent events, as we have done previously with the additive rule (see OR Rule, exercise 2.3.2.1)
Child IQ scores on WISC
Let’s take a look at thewiscsem.csvdataset that was first introduced in tutorial 1, where 174 children were assessed on various measures of IQ using the Weschler Intelligence scale (WISC).
Below is a frequency distribution (i.e. empirical) of comprehension scores on the WISC (which can serve as an approximation of a theoretical probability distribution when the sample size is large), with its associated frequency table directly below

Frequency table of comprehension scores
| Scores between 0-4.9 | Scores between 5-9.9 | Scores between 10-14.9 | Scores between 15-20 |
|---|---|---|---|
| 5 | 100 | 63 | 6 |
Reminder:A frequency table displays the number of times each score (or a range of scores) occurs (e.g. There were 5 comprehension scores that ranged between values 0 and 4.9)
Exercise 2.4.1.1
What is the proportion of children scoring less than 5 on the comprehension measure? Use the plot and frequency table. Use the code chunk below
Hint:Look at the frequency of scores less than 5, and the total number of scores (i.e. how many scores are there in total)?
5/174
Exercise 2.4.1.2
What proportion of children score between 10 and 20 on the comprehension test? Use the code chunk below
Hint:Tally the frequency of scores between 10 and 20
69/174
2.4.2. Theoretical distributions
Theoretical probability distributions refer to what we expect to see, as opposed to empirical distributions, which refer to what we actually see in data we have collected. We have only dealt with empirical distributions up until this point.
Unlike empirical distributions, theoretical probability distributions aim to approximate some real-world phenomenon in a population, rather than just in a specific sample.
A theoretical distribution may be approximated by an empirical distribution when the sample size is large, and the sample is representative.
Heights of South African males
Simulated height data of South African males are stored in the objects namedheights.smallnandheights.largen. The data in each vector object has approximatey the same mean (approx 168) and standard deviation (approx 35), except there are 100 observations stored inheights.smalln, and 10000 observations stored inheights.largen. Height is measured in centimeters
Note:The objects storing height measurements are referred to as vectors as they only have 1 dimension (i.e. only contain height data). In contrast, dataframes have 2 dimensions (i.e. row and column data)
Exercise 2.4.2.1
Plot the distribution of “heights.smalln” on a histogram in the chunk below
Hint:Use hist()
hist(heights.smalln)
Exercise 2.4.2.2
Now plot the distribution of “heights.largen” on a histogram in the chunk below
hist(heights.largen)
You should notice that there is a difference in distributions of 100 vs 10 000 observations
The distribution with a larger sample size has a larger proportion of data points clustered at the center and has a more evenly symmetrical shape around the center.
The histogram with the larger sample size is a better approximation of the theoretical distribution, which in this case appears to be the normal distribution
Note: Theoretical probability distributions are often referred to as probability density functions
2.4.3. Area under the curve
Probabilities can be obtained from theoretical distributions by finding an appropriate area under the curve
Integral calculus methods are used to calculate probabilities from under the curve, but R has built in functions to do this for us, so we don’t have to use calculus.
See the various normal distributions below that each display a distribution of Z scores (i.e., with a mean of 0, and an SD of 1), each with a shaded area under the curve.

Exercise 2.4.3.1
3. Advanced section
3.1. Simulation
Briefly put, simulation is a method used to artificially generate data. Simulated data differs from real data in that the former is fictitious, whilst the latter was actually collected in the real world
Simulation has several advantages:
- Simulation is not limited by sample size (we could simulate extremely large numbers of data points with relative ease).
- Simulation allows us to better understand the true nature of theoretical distributions underlying variables (Remember that larger samples get us closer to the theoretical distribution)
- Simulation allows for pure random sampling, which is typically not possible in real life (there is typically selection bias present in real-world samples).
Exercise 3.1.1
Run the set.seed command below, which ensures that results from random simulations can be reproduced (by setting the same seed)
Hint:The set.seed() command ensures
that the same simulated sample will be drawn if the analysis was re-run
(in subsequent code). For example, if the heights 150, 172, and 183 were
drawn previously, these same values will be drawn the next time the
analysis is run
set.seed(4)
Exercise 3.1.2
Run the code below that randomly draws 10 male heights from the heights.smalln vector
Note 1: The argument within set.seed command can be set to any number, as long as that number is consistently used
Note 2:Notice that the x argument refers to the name of the vector/dataframe, size refers to the number of artificial data points generated, and replace refers to whether there is sampling with or without replacement
set.seed(10)
sample(x=heights.smalln, size = 10, replace = TRUE)
You should see 10 artificially sampled heights in centimeters
Exercise 3.1.3
Now randomly sample 20 male heights from the heights.smalln vector without replacement
Hint:You will need to change the size and replace arguments
set.seed(20)
sample(x=..., size = ..., replace = ...)
set.seed(20)
sample(x=heights.smalln, size = 20, replace = FALSE)
Exercise 3.1.4
Change the seed value from that in exercise 3.1.3, and complete the code to plot the simulated data points in a histogram
Hint:Use the hist() command to plot the
simulated data points
set.seed(...)
sim.dat=sample(x=heights.smalln, size = 20, replace = FALSE)
...(...)
set.seed(32) #Seed can be any value except 20
sim.dat =sample(x=heights.smalln, size = 20, replace = FALSE)
hist(sim.dat)
Free-form exercise
Move on to and complete Statistics Tutorial Assignment 2 on Amathuba (Activities | Assignments)
Other resources & references
WISC-R subscale data from Tabachnick, B. G., & Fidell, L. S. (1996). Using Multivariate Statistics (3rd ed.). New York Harper Collins.
Developed by: Marilyn Lake & Colin Tredoux, University of Cape Town