Skip to Tutorial Content

1. Research questions and hypotheses

Read Tutorial 7 of Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

There are two major branches of statistics in Psychology

  1. Descriptive statistics, which concerns the description and summary of a sample
  2. Inferential statistics, which concerns inferring or generalizing characteristics about a population

For example, in a test of an anti-epileptic drug on epilepsy we found a reduction in the number of epilepsy episodes. We would want to test whether this result came about by chance, or if it, in general, is true. In other words, is this result merely true of this particular sample or does it also extend to the larger population

Reminder: There is always natural fluctuation in results depending on the sample drawn, which is referred to as random sampling variation.

In inferential statistics (unlike descriptive statistics), we use hypothesis testing to test whether some sample finding extends to the population. How does this procedure work?

  1. Set out an a priori (don’t know what that means?) claim that the observed finding has arisen solely due to random sampling variation (i.e. the finding is a feature of the sample itself and not true of the population)

    In the case of the anti-epileptic drug, the claim would be that the reduction of epilepsy episodes occurred by chance (is true only of the sample being tested, but not the entire population of epilesy patients)

  2. The claim is empirically tested (i.e. tested using a sample)

  3. Evidence from testing is used to either reject or fail to reject the claim

In summary, Hypothesis testing is a logical and empirical procedure whereby hypotheses are formally set up (in the form of claims) and empirically tested

1.2. Research questions

The first step of hypothesis testing is to formulate a research question.

A Research question is a question that the researcher wants to answer by doing the research.

In our anti-epileptic drug example we would ask, does our anti-epileptic drug reduce the number of epilepsy episodes?

Exercise 1.2.1.1

Hint: More than one answer is correct, choose any correct answer.

Exercise 1.2.1.2

Hint: More than one answer is correct, choose any correct answer.

1.3. Null and alternate hypotheses

The second step of hypothesis testing is to formulate a hypothesis/es

A Hypothesis is a tentative statement or prediction about the relationship or difference between two variables, which is derived from a research question.

For hypothesis testing, a set of two hypotheses are formulated; namely the null and alternate hypotheses.

  1. The Null hypothesis, denoted by (\(H_{0}\)), states that there is no effect, or relationship, between variables
  • In the anti-epileptic drug case, the null hypothesis would state that there no effect of the anti-epileptic drug on the number of epilepsy episodes
  1. The Alternate hypothesis, denoted by (\(H_{1}\)), states that there is an effect, or relationship, between variables
  • In the anti-epileptic drug case, the alternate hypothesis would state that there is an effect of the anti-epileptic drug on the number of epilepsy episodes. More specifically, that the anti-epileptic drug reduces the number of episodes

Exercise 1.3.1.1

Hint: More than one answer is correct, choose any correct answer.

Exercise 1.3.1.2

Hint: More than one answer is correct, choose any correct answer.

1.4. Directional and non-directional hypotheses

Alternate hypotheses can be either directional or non-directional:

  1. Directional alternate hypothesis states the direction of the relationship that is being tested
  • E.g. The anti-epileptic drug will decrease the number of epilepsy episodes OR
  • E.g. An anti-bullying campaign will increase daily satisfaction of students
  1. Non-directional alternate hypothesis does not state the direction of the relationship being tested
  • E.g. The anti-epileptic drug will have an effect on the number of epilepsy episodes

Unlike the directional alternate hypothesis, which states only one possible alternate outcome, the non-directional alternate hypothesis allows for more than one possible alternate outcome. For example, assuming a non-directional hypothesis of the anti-epileptic drug on epilepsy episodes, the anti-epileptic drug could either increase or decrease the number of epilepsy episodes

Because the non-directional alternate hypothesis allows for more than 1 possible alternate outcome, it is considered more conservative than a directional alternate hypothesis.

Exercise 1.4.1.1

Hint: More than one answer is correct, choose any correct answer.

Exercise 1.4.1.2

Hint: More than one answer is correct, choose any correct answer.

2. Research questions and hypotheses: A symbolic form

Thus far you have been introduced to the null and alternate hypotheses written out as words, however, they can be represented in symbolic format

For example (with directional hypothesis):

  1. Research question: Does our anti-epileptic drug reduce the number of epilepsy episodes?
  • Null & directional alternate hypotheses:
  • \(H_{0}\): \(\mu_{1} = \mu_{2}\)
  • \(H_{1}\): \(\mu_{1} < \mu_{2}\)

In the example above, hypotheses are expressed with the use of the Greek letter \(\mu\), which refers to the population mean. This differs from the sample mean which might be represented by \({\bar{X}}\).

The Null hypothesis (\(H_{0}\)): \(\mu_{1} = \mu_{2}\), states that average number of epilepsy episodes amongst the epilepsy patient population is equal to the average number of epilepsy episodes following use of the anti-epileptic drug. In other words, the anti-epileptic drug does not change the number of epilepsy episodes in the population.

The Alternate hypothesis \(H_{1}\): \(\mu_{1} < \mu_{2}\), states that the average number of epilepsy episodes amongst the epilepsy patient population is greater than the average number of epilepsy episodes following use of the anti-epileptic drug. In other words, the anti-epileptic drug decreases the number of epilepsy episodes in the population.

Another example (with non-directional hypothesis):

  1. Research question: Does age-related cognitive decline differ between patients with neuroinflammation vs healthy controls?
  • Null & non-directional alternate hypotheses:
  • \(H_{0}\): \(\mu_{1} = \mu_{2}\)
  • \(H_{1}\): \(\mu_{1} \neq \mu_{2}\)

In example 2, the null hypothesis does not change from that in example 1, where average age-related cognitive decline for patients with neuroinflammation is equal to healthy controls.

The non-direction alternate hypothesis \(H_{1}\): \(\mu_{1} \neq \mu_{2}\), states that the average age-related cognitive decline for patients with neuroinflammation is not equal to that of healthy controls.

Whilst the null and alternate hypotheses are stated in terms of population parameters, the hypotheses are empirically tested using the sample to make an inference/s about the population (where hypotheses are rejected or not-rejected).

3. Probability calculations

3.1 Hypothesis testing with probability

Statistical decisions are made on the basis of probability, and thus there is always a degree of uncertainty tied to any decision that is made. This uncertainty emerges as result of random sampling variation, otherwise referred to as sampling error, which points to the natural deviation of sample estimates from population parameters. For more on sampling error and the Sampling Distribution of the mean, see Guided Tutorial 4.

Let’s take a look at an example. We are interested in testing average post-natal depression:

Pre and Post-natal depression
Let’s say that we know pre-natal depression score to be 34. We draw a random sample of 10 post-natal depression patients, and obtain a sample mean (\({X}\)) depression score of 40, and standard deviation of 10.

We might be tempted to say that post-natal depression (40) on average is higher than pre-natal depression (34), except that this would not be appropriate given the possible impact of sampling error on the sample mean of the random sample. In other words, if we were to draw a second random sample of post-natally depressed patients, it might be possible that we could obtain a sample mean close to 34, and conclude that there is no difference, which would be in contradiction to the finding from our previously drawn sample.

Hypothesis testing overcomes this issue by assessing how probable or improbable a sample mean is.

It is unlikely that with any one random sample, that the sample mean will be equal the population mean, due to the presence of sampling error. In order to combat this in the case of our example, mean depression scores from multiple random samples are taken and presented in a Sampling Distribution of the null hypothesis (empirically derived through simulation), which appears to have an approximate population mean depression score of 34. See the figure below.

The sample mean for post-natal depression is indicated by the black vertical line, at 40.

Our major goal is to determine whether the random sample mean differs significantly from the mean of the null hypothesis. In other words, does the sample mean post-natal depression score differ from that of the mean pre-natal depression score?

Based on the simulated Sampling Distribution above,the probability of obtaining the null hypothesis mean of 34 is highest (see large area under the curve at the peak), which makes sense as it represents the null hypothesis assuming it is true. On the other hand, obtaining a sample mean of (40) or above if the null hypothesis is true appears to be less probable (see smaller area under the curve for values 40 or greater). And if a sample mean is highly improbable, as computed assuming the null hypothesis is true, we take this to suggest that the null is likely not to be true (null is rejected), and the alternate hypothesis is adopted.

In this case, by rejecting the null hypothesis we say that we reject the claim that mean pre-natal and post-natal depression scores are equal, and favour the alternate hypothesis, that mean post-natal depression is higher than mean pre-natal depression.

Exercise 3.1.1.1

Complete the code chunk below to calculate the probability of obtaining a post-natal depression score of 40 or more, on the basis of the simulated sampling distribution means, stored in the object “PNDepression.

Hint: The sum()>= command in the chunk below counts the number of scores with the value ‘X’ (you need to substitute a value for X) or greater, whilst the length() command counts the total number of scores in a vector.

sum(PNDepression>=...)/length(PNDepression)
sum(PNDepression>=40)/length(PNDepression)

Exercise 3.1.1.2

3.2 Hypothesis testing with probability cut-offs

The probability of obtaining a mean depression score of 40 or above was 0.13, which suggested that the possibility of obtaining such scores if the null hypothesis were true is less probable.

But how does one judge when a mean event or score is considered highly improbable?

Very Important

  1. We typically consider a score highly improbable, assuming the null hypothesis is true, if the probability of obtaining that score or higher (or lower) is less than 0.05

  2. If the probability of obtaining a score is < 0.05, assuming the null hypothesis is true, than the null hypothesis is rejected, and the alternate hypothesis favoured. This is because there is a small probability of the null hypothesis being true.

The probability cut-off is referred to as alpha (\(\alpha\)) or the significance level

Conventionally, the significance level is set to 0.05

Exercise 3.2.1.1

Exercise 3.2.1.2

Exercise 3.2.1.3

3.3 Broad summary of hypothesis testing steps

Here is a quick reminder of the steps involved in hypothesis testing:

  1. Come up with a research question, and formulate testable null and alternate hypotheses
  2. Collect data and test the claim that the null hypothesis is true
  3. Apply a probability model on the data collected (e.g. normal distribution). If the data is unlikely to have occurred, assuming the null hypothesis is true, the null hypothesis is rejected for the alternate hypothesis

Obesity and blood pressure
Researchers suspect that obesity is associated with higher diastolic blood pressure. They wish to investigate this potential relationship. The mean diastolic blood pressure for a healthy adult population is 80, and standard deviation is 10. The mean diastolic blood pressure for a randomly drawn sample of obese individuals is 100. There is evidence that diastolic pressure is approximately normally distributed in healthy populations.

Exercise 3.3.1.1

Exercise 3.3.1.2

Exercise 3.3.1.3

Exercise 3.3.1.4

4. Z-test

4.1. Hypothesis testing and Standard Normal Distribution (Z-test)

Hypothesis testing can be used on several types of distributions, including the Standard Normal Distribution, which is referred to as the Z-test.

The Z-test is used to test how likely it is that a sample mean has come from a normally distributed population defined by the null hypothesis. A Z-test can be used when the population standard deviation is known

  1. Null hypothesis (\(H_{0}\)): Sample mean is equal to the population mean or \(\mu_{1} = \mu_{2}\)
  2. Alternate hypothesis (\(H_{1}\)): Sample mean is not equal to the population mean, or \(\mu_{1} \neq \mu_{2}\) (non-directional hypothesis, but it can be reformulated to a directional hypothesis depending on the research assumptions)

Alzheimers and depression
Researchers want to test the relationship between Alzheimer’s disease and depression. They predict that depression is higher amongst those suffering from Alzheimer’s disease in comparison to the normal adult population \(\mu_{Normal} < \mu_{Alzheimer's}\).

What possible conclusions can be drawn?

  1. Do not reject the null hypothesis when…
    • the probability of drawing the observed mean from the distribution specified in the null hypothesis is not very low (e.g., it is higher than .05)

For example, assume that we obtained a sample mean z-score of 0.3 and tested it against the Z distribution of a normal adult population (assuming the null hypothesis is true, i.e. with a mean = 0). See the depiction of the Sampling Distribution of the mean below

In this scenario, it seems likely that the sample mean could be drawn from the population distribution, as a large proportion of values can be equal to or greater than 0.3, assuming the null hypothesis is true. This suggests that one should not reject the null hypothesis.

  1. Reject the null hypothesis for the alternate hypothesis when
    • the probability of obtaining a mean, given the Sampling Distribution of the mean implied by the null hypothesis, is very small (e.g. p < .05)

For example, assume we obtained a sample mean z-score of 2 and tested it against the distribution of a normal adult population (assuming the null hypothesis is true, i.e. the mean = 0). In the Sampling Distributions presented below

Purple curve = distribution assuming the null hypothesis is true Pink curve = distribution assuming the alternate hypothesis is true

Relative to the previous example with sample mean = 0.3, the proportion of samples with a mean z-score greater than 2 is substantially lower (i.e. the area under the curve is smaller). Moreover, the sample mean of 2 appears to be closer to the population mean of the alternate hypothesis (i.e. pink distribution), and further away from the population mean assumed under the null hypothesis (i.e. purple curve). In turn, this suggests that the null hypothesis should be rejected in favour of the alternate hypothesis.

Another way to think about what is being tested:

  1. Null hypothesis is upheld/fail to reject when:
    • Sample z-scores are thought to be drawn from a single distribution
  2. Null hypothesis is rejected in favour of alternate hypothesis when:
    • Sample z-scores are thought to be drawn from different distributions

4.2. Rejection region

As any one sample mean is likely not to be exactly equal to the population mean, due to sampling error, one needs to distinguish between:

1. Difference in sample and population means as a result of random sampling variation (sampling error), where the sample comes from the same distribution as the population (i.e. a scenario where the null hypothesis is true)
2. Difference in sample and population mean as a result of systemic differences, where the sample comes from a different distribution to that of the population (i.e. a scenario where the null hypothesis is rejected for the alternate)

How do we determine when the sample comes from a different distribution to that of the population?

As previously mentioned in section 3.2, in order to identify cases where scenario 2 is more likely than scenario 1, a probability cut-off score is used. This is referred to as the significance level or the alpha (\(\alpha\)), and is conventionally set to 0.05

Significance level (\(\alpha\)) or 0.05 is the probability with which we are willing to reject the null hypothesis when it is in fact correct

Remember that there is always a degree of uncertainty in the statistical decisions we make, and we could be rejecting the null hypothesis when we shouldn’t (i.e. there is a possibility that our sample mean differs from the population as a result of sampling error alone, even though the probability of doing so is low). All we can do is aim to minimise the likelihood of this occurring.

Important: If a sample mean falls within the rejection region, which includes the most extreme sample means expected if H0 is true, we reject the null hypothesis nevertheless, and opt for the alternate hypothesis instead. See visual depiction below.

Exercise 4.2.1.1

Note: More than one answer is correct, choose any correct answer.

Exercise 4.2.1.2

Exercise 4.2.1.3

4.3. Rejection region and one-tailed vs two-tailed tests

In the earlier example of Alzheimer’s disease and depression, the alternate hypothesis was directional, and stated that Alzheimer’s patients would have greater depression than a normal adult population. The sample z-score for Alzheimers patients fell in the upper tail of the distribution (assuming that higher scores reflect greater depression).

In contrast, had the alternate hypothesis been that Alzheimer’s patients exhibit less depression than a normal adult population. If this were the case, then the rejection region would fall in the lower tail of the distribution.

A one-tailed test is used to reach a decision concerning a directional hypothesis.

A third possibility. The alternate hypothesis could have been phrased as a non-directional hypothesis, where Alzheimer’s and depression are related, but we do not specify how. In this example, the rejection region would be split in half between the upper AND lower tails, as there are two possible outcomes - Alzheimer’s patients could be more OR less depressed than the normal adult population. Unlike the directional hypothesis cases, the area under the curve (defined by the significance level) will be divided by two. See the graphical depiction below.

A two-tailed test is used to reach a decision concerning a non-directional hypothesis

  1. Directional hypothesis = one-tailed test (e.g. \(\alpha\) = 0.05 on one-tail of distribution)
  2. Non-directional = two-tailed test (e.g. \(\alpha\) = 0.025 on both tails of the distribution)

Exercise 4.3.1.1

Exercise 4.3.1.2

4.4. Type 1 and Type 2 decision errors

If our sample mean falls within the rejection region, this may mean that we

1. incorrectly decided to reject the null hypothesis when        we shouldn't have,
OR
2.  correctly decided to reject the null hypothesis 

Type 1 error = Incorrectly rejecting the null hypothesis

Type 1 error in the Alzheimer’s and depression example:
- Erroneously conclude that Alzheimer’s patients are more depressed than the normal adult population, when in fact both groups experience similar levels of depression - How could this happen? –> We happened to draw a very unlikely sample of Alzheimer’s patients who happen to be unusually highly depressed

Important: Significance level or \(\alpha\) = Type 1 error

Why couldn’t we then just set the significance level (\(\alpha\)) to some very small probability, such as 0.0001, in order to minimise chances of committing Type 1 error?

Lower Type 1 error –> Higher Type 2 error

Type 2 error = Incorrectly failing to reject the null hypothesis when it is false

Type 2 error using the Alzheimer’s and depression example:
- Erroneously conclude that Alzheimer’s patients are equally depressed as the normal adult population, when in fact the Alzheimer’s patient population is more depressed
- How could this happen? –> We happened to draw an unlikely sample of Alzheimer patients who were less depressed than the rest of the Alzheimer’s patient population

Exercise 4.4.1.1

Exercise 4.4.1.2

4.5. How to carry out a Z-test

In order to test whether our sample mean falls within the rejection region, we need to do the following:

Step 1. Define the rejection region. We do this by setting the significance level or \(\alpha\), and converting it to a Z critical value (\({Z_{crit}}\)). The Z critical value is the z-score that defines the start of the rejection region.

Exercise 4.5.1.1

Use the qnorm() command to convert a significance level of 0.05 to an associated Z critical value on the lower tail of the Standard Normal Distribution

Reminder: The qnorm() command was introduced in Guided tutorial 3, and can be used to convert a probability into a Z-score (now referred to as a Z critical value) from a Standard Normal Distribution.

qnorm(..., mean = ..., sd =...)
qnorm(0.05, mean = 0, sd =1)

Exercise 4.5.1.2

Use the qnorm() command to convert a significance level of 0.05 to an associated Z critical value on the upper tail of the Standard Normal Distribution
qnorm(..., mean = ..., sd =...)
qnorm(0.95, mean = 0, sd =1)

Step 2. Transform the sample mean into a Z score sometimes referred to as the Z statistic or Z calc (\({Z_{calc}}\)), using the Z-score formula for Sampling Distributions (introduced in Guided Tutorial 4)

\[ \begin{aligned} Zscore =\frac{\overline{X} - \mu_{\overline{X}}}{\frac{\sigma}{\sqrt{n}}} \end{aligned} \] Pre and Post-natal depression
A quick reminder about the post-natal depression example. We know the pre-natal population mean depression score to be 34. We draw a random sample of 10 post-natal depression patients, and obtain a sample mean (\(\bar{X}\)) depression score of 40, and standard deviation of 10.

Exercise 4.5.1.3

Use the code chunk below to calculate the Z-statistic for depression using the Z score formula above

Reminder: See Guided Tutorial 4 for more examples on how to use the Z-score formula

(40-34)/(10/sqrt(10))

Step 3. Reach a decision by comparing \({Z_{calc}}\) against \({Z_{crit}}\). If \({Z_{calc}}\) falls within the rejection region, the null hypothesis is rejected for the alternate hypothesis.

In the case of pre and post-natal depression, the \({Z_{calc}}\) is 1.89 and the \({Z_{crit}}\) is 1.64. See graphical display of \({Z_{crit}}\) and \({Z_{calc}}\) values below.

Exercise 4.5.1.4

Alternate Step 3: In the previous step we based our decision on how \({Z_{calc}}\) differs from \({Z_{crit}}\). In the case of depression the \({Z_{calc}}\) > \({Z_{crit}}\), which suggests that it lies in the rejection region.

Another alternative is to convert the \({Z_{calc}}\) to a p-value which can be compared to the set \(\alpha\). If the p-value falls within the rejection region, i.e. p-value is smaller than the \(\alpha\), then the null hypothesis is rejected for the alternate hypothesis.

Exercise 4.5.1.5

Use the code chunk below to convert the Z calc into a p-value

Reminder: The pnorm() can be used to convert a Z-score to p-value (see Guided Tutorial 3 for examples)

1- pnorm(q=1.89, mean = 0, sd =1)

In the distribution presented below, the yellow area under the curve depicts the derived probability (i.e. p-value), whilst the entire area under the curve to the right of the vertical line depicts the size of the probability set by alpha.

Exercise 4.5.1.6

Important: Notice how comparing the \({Z_{calc}}\) to the \({Z_{crit}}\) is an equivalent method to comparing the p-value to the \(\alpha\). Except that

  1. Z-scores are represented by x-axis values
  2. p-values are represented by size of the area under the curve

4.6 Hypothesis testing steps

  1. State the research question and hypotheses
  2. Define the significance level (\(\alpha\))
  3. Define the critical value (\({Z_{crit}}\))
  4. Calculate the statistic (e.g. for the sample mean, calculate the \({Z_{calc}}\))
  5. Reach a decision

OR

  1. State the research question and hypotheses
  2. Define the significance level (\(\alpha\))
  3. Calculate the statistic (e.g. for the sample mean, calculate the \({Z_{calc}}\))
  4. Convert the statistic into a p-value
  5. Reach a decision

5. Advanced Section

Thez.test() command from the BSDA package can be used to run a z-test, and produces several helpful estimates including z-score, p-value and confidence interval

the z.test() command consists of several arguments:

Argument Description
x data scores of variable x (in the form of numeric vector)
y data scores of variable y (in the form of numeric vector)
alternative indicates whether hypothesis is directional or non-directional(includes options: “greater”,“less” or “two.sided”)
mu value of the true population mean specified by the null hypothesis
sigma.x population standard deviation for variable x
sigma.y population standard deviation for variable y
conf.level confidence interval, restricted to lie between zero and one (z-score interval)

Exercise 5.1.1.1

Go and inspect the help file for the z.test() command for yourself

Reminder: Add a ? before the z.test() command in order to retrieve the help file

?z.test

At this stage we have only conducted one-sample z-tests, where sample estimates are compared to population parameters. However, two-sample z-tests can also be carried out, where estimates are compared between two samples

z.test() command provides the option for a one-sample z-test, by providing sample data for x variable only, or a two-sample z-test, by providing sample data for both the x and y variables.

Travel time
The average South African travels 60 minutes to work on average (\(\mu\)), with a known standard deviation of 30 minutes (\(\sigma\)). The travel time for a sample of 10 South African commuters is stored in the object vector “SAtravel.time”

Exercise 5.1.1.2

Use the z.test() command to test whether the mean travel time of a sample of 10 is drawn from the population travel time (i.e. where the sample mean is likely equal to the population mean). Assume a two-tailed test.

Hint 1: This would be considered a one-sample z-test as the test is only based on one sample (compared against the population)

Hint 2: the z.test() command calculates the sample mean from the raw scores. In other words, the raw scores need to be given to z.test() command rather than the calculated sample mean, as done in previous examples.

z.test(x=..., alternative = "two.sided", mu = ..., sigma.x = ...)
z.test(x=SAtravel.time, alternative = "two.sided", mu = 60 , sigma.x = 30)

Exercise 5.1.1.3

Hint: Look at the calculated p-value of the z-test conducted above in order to draw a conclusion

Free-form exercise

Move on to and complete Statistics Tutorial Assignment 5 on Amathuba (Activities | Assignments)

Other resources & references

See Tutorial 7 & 8 of Tredoux, & Durrheim, (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

Developed by: Marilyn Lake & Colin Tredoux, UCT

PSY2015F tut 5: Hypothesis testing & Z-tests