PSY2015F tut 4: Sampling Distribution

1. Introduction

Read Tutorial 6 of Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

1.1. What is the Sampling Distribution of the mean ?

In Tutorial 3, we considered the Normal Distribution as well as another variant, the Standard Normal Distribution.

The Sampling Distribution of the mean is an additional variant of the Normal Distribution. The Sampling Distribution of the mean looks very much like any other Normal Distribution, with its own mean (\(\mu\)) and standard deviation (\(\sigma\)), except that each value on the x-axis represents a sample mean rather than an individual score.

The Sampling Distribution of the mean is one of the most important concepts you will study in statistics, and is the foundation of inferential statistics

Reminder: Inferential statistics allow us to characterise a population.

Let’s look at an example, which illustrates the difference between the Sampling Distribution of the mean and the Normal Distribution

HIV-AIDS & CD4 count
The CD4 count, a type of white blood cell that plays a key role in immune system functioning, was recorded in a sample of HIV-AIDS patients. Table 1 below contains the CD4 counts of 5 individuals. The mean (\(\mu\)) CD4 count in the South African HIV-AIDS population is 110

Table 1

Participant number	CD4 count (cells/mm^3)
1	100
2	120
3	200
4	150
5	80

Exercise 1.1.1.1

Hint: More than one answer is correct, choose any correct answer.

Exercise 1.1.1.2

Calculate the sample mean CD4 count (based on the 5 CD4 scores provided) to verify whether the sample mean is equal to the population mean

Hint: First use the c() command to bind the values together and store them in a named object

...=c(...)
...(...)#Calculate the mean
...(...)#Calculate the standard deviation

CD4=c(100,120,200,150,80)
mean(CD4)#Calculate the mean
sd(CD4)#Calculate the standard deviation

HIV-AIDS & CD4 count continued
Now let’s say that we calculated the mean CD4 count for 5 different samples of 5 (i.e. we have 5 groups of 5 participants, and calculated the mean CD4 for each of these groups) Table 2 below contains the mean CD4 counts of 5 samples. Reminder: The mean (\(\mu\)) CD4 count in the South African HIV-AIDS population is 110

Table 2

Participant number	Average CD4 count (cells/mm^3)
1	120
2	110
3	140
4	100
5	135

Exercise 1.1.1.3

Calculate the mean CD4 count (of the 5 mean CD4 scores provided) to verify whether the sample mean is equal to the population mean

Hint: This exercise is very similar to exercise 1.1.1.2. You are now just calculating the mean of means

...=c(...)
...(...)#Calculate the mean
...(...)#Calculate the standard deviation

CD4.SDM=c(120,110,140,100,135)
mean(CD4.SDM)#Calculate the mean
sd(CD4.SDM)#Calculate the standard deviation

Exercise 1.1.1.3

Hint: More than one answer is correct, choose any correct answer.

Adult weights
Amongst adults, the population mean weight is \(\mu\) and standard deviation is \(\sigma\). See the Normal Distribution of individual weights, and the Sampling Distribution of the mean for sample means of adults’ weights, below

Now: Unlike the Normal Distribution, the Sampling Distribution of the mean produces estimates that better approximate the population parameters (i.e. the mean, and standard deviation are closer to the population mean and standard deviation values.

2. Sampling

We often don’t have access to the entire population to study. For example, let’s say that we are interested in studying the HIV-AIDS patient population in South Africa. Think about the resources it would take to get access and to study this entire population, it would be very costly, if not impossible.

If we can’t get access to the entire population, we often study a sample/s drawn from the population, and use the sample to make sense of the wider population. A sample is just a smaller group of individuals that are selected to represent some larger population. For example, in the HIV-AIDS patient case, 100 South African HIV-AIDS patients included in a study may be intended to represent the population of patients.

2.1 Sample size and Normal Distribution

The extent to which an empirical distribution (i.e. based on a sample) approximates a population distribution (i.e. theoretical) is influenced by the sample size.

Grade 5 reading ability in Western Cape
Reading ability of Grade 5 learners follows a normal distribution, with a population mean (\(\mu\)) of 139 words per minute, and a standard deviation (\(\sigma\)) of 35 words per minute.

Exercise 2.1.1.1

Use the rnorm() command to simulate 10000 reading ability scores from the population distribution and save the scores to a named object vector called “reading.ability”. Also remember to first set the seed (using set.seed() command), and set it to 12.

Reminder: For a reminder on how to use set.seed() and rnorm() commands, see exercise 1.4.2.2 of Guided tutorial 3

set.seed(...)
reading.ability = rnorm(..., ...,...)

set.seed(12)
reading.ability = rnorm(10000, 139, 35)

Exercise 2.1.1.2

Use the sample() command to 1) draw one random sample of 10 reading scores with replacement and 2) draw another random sample of 100 reading scores and 3) draw a third random sample of 1000 reading scores, using sampling with replacement

Reminder: For a reminder on how to use sample() command, see section 3 of of Guided tutorial 2

small.sample = sample(x=..., size=..., replace=...)
medium.sample = sample(x=..., size=..., replace=...)
large.sample = sample(x=..., size=..., replace=...)

small.sample = sample(x = reading.ability,size = 10,replace = TRUE)
medium.sample = sample(x = reading.ability,size = 100,replace = TRUE)
large.sample = sample(x = reading.ability,size = 1000,replace = TRUE)

Exercise 2.1.1.3

Plot the three distributions generated for exercise 2.1.1.2 alongside each other

Reminder 2: For a reminder of how to plot distributions alongside each other using the par() command, see section 3.4 of Guided tutorial 1

par(mfrow = c(1, 3))
hist(..., xlab="...", main="...")
hist(..., xlab="...", main="...")
hist(..., xlab="...", main="...")

par(mfrow = c(1, 3))
hist(small.sample, xlab= "Reading speed per minute", main="Reading ability in sample of 10")
hist(medium.sample, xlab= "Reading speed per minute", main="Reading ability in sample of 100")
hist(large.sample, xlab= "Reading speed per minute", main="Reading ability in sample of 1000")

Exercise 2.1.1.4

Hint: More than one answer is correct, choose any correct answer.

Hint: As you will have noticed from section 2.1 exercises, as the number of observations in a random sample increases, the sample mean is a better approximation of the population mean. This is referred to as the law of large numbers

2.2. Sample size and Sampling Distribution of the mean

We have discussed that when we sample from the population, the sample mean varies as a function of (i.e. is based on) the sample size. More specifically, the sample mean better approximates the population mean when the sample size increases.

We can extend this principle to the Sampling Distribution of the mean, where we now draw multiple samples of a particular size (instead of one sample), calculate the mean for each of these samples, and plot the means on distribution of their own, i.e. the Sampling Distribution of the mean.

Reminder: Sampling Distribution of the mean differs from the normal distribution in that it a distribution of means rather than distribution of raw scores.

Exercise 2.2.1.1

Run the code below, which draws 100 samples each of size 10 (i.e. sample size of 10 repeated 100 times) from the population reading ability, and plot the means in a histogram (i.e. produce a Sampling Distribution of the mean)

Important code features: Notice in the code chunk below that the sample() command is used to draw a sample of size 10 from the population (“reading ability” object). The mean() command is wrapped around the sample command, which calculates the mean of the sample that is generated. The replicate() command then repeats the specified command/s 100 times. More specifically, it calculates the mean for a randomly drawn sample of size 10, and replicates this process 100 times.

set.seed(2)
yourmeans <- replicate(100, mean(sample(x=reading.ability, size=10, replace = TRUE)))
hist(..., xlab="...", main = "...")

set.seed(2)
yourmeans <- replicate(100, mean(sample(x=reading.ability, size=10, replace = TRUE)))
hist(yourmeans, xlab= "Sampling Distribution means", main="Sampling Distribution of the mean for reading ability, based on sample size of 10")

Exercise 2.2.1.2

Complete the code provided to draw 100 samples each of size 100 (i.e. sample size of 100, repeated 100 times). Save the object under “yourmeans.2”

set.seed(2)
... <- replicate(..., mean(sample(x=reading.ability, size=.., replace = TRUE)))
hist(..., xlab="...", main= "...")

set.seed(2)
yourmeans <- replicate(100, mean(sample(x=reading.ability, size=100, replace = TRUE)))
hist(yourmeans, xlab= "Sampling Distribution means", main="Sampling Distribution of the mean for reading ability, based on sample size of 100")

Exercise 2.2.1.3

Use the code chunk below to calculate the mean of means for Sampling Distributions generated in Exercise 2.2.1.1 and 2.2.1.2.

mean(yourmeans)
mean(yourmeans.2)

Exercise 2.2.1.4

Use the code chunk below to calculate the standard deviation of the means for Sampling Distributions generated in Exercise 2.2.1.1 and 2.2.1.2.

sd(yourmeans)
sd(yourmeans.2)

Exercise 2.2.1.5

We have been writing about the Sampling Distribution as a collection of sample means generated from samples of a particular size. We can think of the theoretical Sampling Distribution of the mean as made up of sample means produced from infinitely large samples, which would be impossible to simulate, but theory tells us that the larger the sample, the closer the approximation of the empirical Sampling Distribution to the theoretical Sampling Distribution of the mean.

In the case of the theoretical Sampling distribution of the mean, the standard deviation measures sampling error of the mean. In other words, sampling error refers to the error associated with estimating a mean from a sample (rather than a population), where the larger the sample from which the means are generated, the smaller the sampling error.

Exercise 2.2.1.6

3. Central Limit Theorem (CLT)

3.1 Application of the Central Limit Theorem

While we were able to generate sample/empirical Sampling Distributions in section 2, where means were generated from samples of a particular size, we are more interested in a theoretical Sampling Distribution of the mean, where means are generated from infinitely large samples. However, it is impossible to simulate such samples, but we can apply what is referred to as the Central Limit Theorem

The Central Limit Theorem states the following:

Shape: The distribution of sample means (i.e. Sampling Distribution of the mean) will always approximate a normal distribution as the sample size increases, irrespective of the shape of population distribution.
Mean parameter(\(\mu\)): The mean (i.e. mean of means) of the Sampling Distribution will be equal to the mean of the population.

\[ \begin{aligned} \mu_{\overline{X}} = \mu \end{aligned} \]

\(\mu_{\overline{X}}\) = Mean of means in Sampling Distribution
\(\mu\) = Population mean

Spread parameter (\(\sigma\)): The standard deviation of the Sampling Distribution of the mean will be equal to the population standard deviation divided by the square root of sample size (n)

\[ \begin{aligned} \sigma_{\overline{X}} = {\frac{\sigma}{\sqrt{n}}} \end{aligned} \]

\(\sigma_{\overline{X}}\) = Standard deviation of the Sampling Distribution of the mean
\(\sigma\) = Standard deviation of the population
N = Total sample size

The alternate spread formula can be used, which defines spread in terms of variance or \(\sigma^{2}\) instead of standard deviation \(\sigma\)

\[ \begin{aligned} \sigma^2_{\overline{X}} = {\frac{\sigma^2}{n}} \end{aligned} \]

Central Limit Theorem applied to reading ability example:

The Population Distribution of Grade 5 reading ability in the Western Cape follows a normal distribution with mean of 139 words per minute and standard deviation of 35 words per minute. If the Sampling Distribution of the mean is based on samples of size 10, then

The mean of Sampling Distribution will equal the population mean (139)
The standard deviation of the Sampling Distribution will equal the population standard deviation divided by the square root of the sample size (35/sqrt(10)=11.06)

Depression and night shift work
Let’s say that the mean (\(\mu\)) depression of night shift workers is 30 on the Beck Depression Inventory (BDI), and the standard deviation is 10 (\(\sigma\)). Samples of 100 were drawn from the population

Exercise 3.1.1.1

Exercise 3.1.1.2

Use the code chunk below to calculate the expected standard deviation of the sampling distribution of depression amongst night shift workers

Hint: The r command for square root is “sqrt()”

10/sqrt(100)

Exercise 3.1.1.3

Hint: More than one answer is correct, choose any correct answer.

Job satisfaction of social scientists
In another example, let’s say that the mean (\(\mu\)) job satisfaction of social scientists is 130 as measured on the Job Satisfaction Survey (JSS), and the variance is 100 (\(\sigma^2\)). Samples of 2000 were drawn from the population

Exercise 3.1.1.4

Use the code chunk below to calculate the expected variance of the sampling distribution of job satisfaction amongst social scientists.

Hint: Take note of the difference in calculating the expected standard deviation versus expected variance of the Sampling Distribution

100/2000

3.2 Central Limit Theorem application to non-normal distributions

The major strength of the Central Limit Theorem is its applicability to both normal and non-normal population distributions.

The Central Limit Theorem states the following:

Regardless of the shape of the population distribution, the sampling distribution of the mean will approximate the Normal Distribution

This is a particularly useful extension of the theorem, given that…

Not all variables we wish to study are normally distributed (e.g. reaction time tasks)
We have an immense range of tools developed for use with normal distributions that can now be applied to other types of distributions

Continuous Performance Task (CPT) and reaction time Let’s say that the Continuous Performance Task (CPT), which is a measure of sustained attention, has a population mean reaction time of 0.5 seconds (\(\mu\)). Scores are stored in the object vector CPT.rtime

Exercise 3.2.1.1

Use the code chunk below to plot the distribution of CPT reaction times from the object “CPT.rtime”

hist(CPT.rtime, main= "Distribution of CPT reaction times", xlab= "Reaction time")

Exercise 3.2.1.2

Exercise 3.2.1.3

Complete the code chunk below to generate an empirical Sampling Distribution of CPT reaction times. Draw 200 samples each of size 1000 from the CPT.rtime object to produce the Sampling Distribution.

set.seed(2)
CPT.rtime.means <- replicate(.., mean(sample(x=..., size=..., replace = TRUE)))
hist(CPT.rtime.means, xlab="...", main = "...")

set.seed(2)
CPT.rtime.means <- replicate(200, mean(sample(x=CPT.rtime, size=1000, replace = TRUE)))
hist(CPT.rtime.means, xlab= "Reaction times", main="Sampling Distribution of the mean for CPT reaction times")

Exercise 3.2.1.4

Exercise 3.2.1.5

Confirm that the mean population reaction time is approximatley equal to the mean Sampling Distribution reaction time (i.e. test the Central Limit Theorem)

mean(CPT.rtime) 
...(...)

mean(CPT.rtime) 
mean(CPT.rtime.means)

4. Area under the curve in Sampling Distributions

Much like probability is derived from the area under the curve of Normal and Standard Normal Distributions, probabilities can be derived from the area under the curve of Sampling Distributions, EXCEPT that

instead of calculating probabilities based on individual scores, probabilities are calculated based on mean scores

Let’s take the example of reading ability of Grade 5 learners. Instead of asking what the probability is of a single randomly chosen 5th grader reading 150 words per minute or greater, we will now ask what the probability is of the mean of a sample of 5th graders reading speed being 150 words per minute or greater.

4.1. Mean score to probability

Mean scores from the Sampling Distribution can be converted to probabilities much like individual scores from the Normal Distribution can be converted to probabilities, using the pnorm() command. The only difference is the application of Central Limit Theorem.

Reminder: The mean of the Sampling Distribution is equal to that of the population Normal Distribution, and the standard deviation of the Sampling Distribution is equal to the standard deviation of the Normal Distribution divided by the square root of the sample size

Grade 5 reading ability in the Western Cape
Reading ability of Grade 5 learners follows a normal distribution, with a population mean (\(\mu\)) of 139 words per minute, and a standard deviation (\(\sigma\)) of 35 words per minute. We draw a random sample of 100 learners.

Exercise 4.1.1.1

Run the code chunk below, which calculates the probability of a sample of Grade 5 learners reading 90 words or less on average

se = 35/sqrt(100)  # se = standard error
pnorm(q=90, mean = 139, sd = se)

Note: You will notice that the calculated probability for exercise 4.1.1.1 is suffixed by “e-45”, which suggests that the probability is an exceptionally small number. “e-45” actually means (\(1\times10^{-45} = \frac{1}{10^{45}}\)), which is a very small number!

Exercise 4.1.1.2

Complete the code chunk below to calculate the probability of a sample of Grade 5 learners reading 150 words or less, on average.

pnorm(q=.., mean = ..., sd= .../sqrt(...))

pnorm(q=150, mean = 139, sd= 35/sqrt(100))

4.2 Probability to mean score

Probabilities from the Sampling Distribution can also be converted back to a mean score, much like with the Normal Distribution, using the qnorm() command.

Job satisfaction of social scientists
The mean (\(\mu\)) job satisfaction of social scientists is 130 as measured on the Job Satisfaction Survey (JSS), and the variance is 100 (\(\sigma^2\)). A random sample of 2000 social scientists was drawn from the population

Exercise 4.2.1.1

What would the mean score of the sample be if it occurred with p = 0.1?

Hint: Use the qnorm() function

standard.dev= sqrt(100)
qnorm(p=..., mean =..., sd=.../sqrt(...))

standard.dev= sqrt(100)
qnorm(p=0.10, mean =130, sd=standard.dev/sqrt(2000))

Exercise 4.2.1.2

What would the mean score of the sample be if it occurred with p = 0.9?

Hint: Use the qnorm() function

standard.dev= sqrt(...)
qnorm(p=..., mean =..., sd=standard.dev/sqrt(...))

standard.dev= sqrt(100)
qnorm(p=0.90, mean =130, sd=standard.dev/sqrt(2000))

4.3. Transform Sampling Distribution to Standard Normal Distribution (converting to Z-scores)

The Sampling Distribution can be converted to a Standard Normal Distribution by converting mean scores to z-scores. The Z-score formula is much the same as that for converting a Normal to a Standard Normal Distribution, except that

the random score, X, is replaced with a random mean score \({\overline{X}}\)
the population mean of the raw scores, \(\mu\), is replaced with the population mean of the Sampling Distribution, \(\mu_{\overline{X}}\).
the population standard deviation \(\sigma\), is replaced with the population standard deviation of the Sampling Distribution, \(\sigma_{\overline{X}}\), which is equivalent to \({\frac{\sigma}{\sqrt{n}}}\)

See two equivalent versions of Z-score formula for Sampling Distribution below:

\[ \begin{aligned} Zscore =\frac{\overline{X} - \mu_{\overline{X}}}{\sigma_{\overline{X}}} \ or \ Zscore =\frac{\overline{X} - \mu_{\overline{X}}}{\frac{\sigma}{\sqrt{n}}} \end{aligned} \] \({\overline{X}}\) = any mean score in the range of possible values of some Sampling Distribution
\({\mu_{\overline{X}}}\) = mean of the Sampling Distribution
\({\sigma_{\overline{X}}}\) = standard deviation of the Sampling Distribution

Exercise 4.3.1.1

Using the Z-score formula for the Sampling Distribution, convert a mean reading ability score from 145 words per minute into a z-score, assuming the sample consists of 100 5th Graders, and the standard deviation of the raw scores is 35

(...-139)/(35/sqrt(...))

(145-139)/(35/sqrt(100))

Exercise 4.3.1.2

Calculate the associated probability of obtaining a z-score equal to, or less than that derived in exercise 4.3.1.1

Hint: You are looking for the associated probability of a Standard Normal Distribution

zscore =(...-...)/(.../sqrt(...))
pnorm (zscore, mean =..., sd=...)

zscore =(145-139)/(35/sqrt(100))
pnorm (zscore, mean = 0, sd = 1)

5. Standard error and confidence interval

5.1 Standard error

The Central Limit Theorem suggests that the variance of Sampling Distributions can be controlled by the researcher. This is because when looking at the standard deviation of the Sampling Distribution, \({\frac{\sigma}{\sqrt{n}}}\), as the size of n increases, the standard deviation decreases.

This is why researchers typically want large samples, with the aim of

decreasing the standard deviation of the sampling distribution,
which thus increases the accuracy of the estimation of the population mean.

In turn, the standard deviation of the Sampling Distribution is commonly referred to as the standard error of the estimate or standard error for short. When we sample means, it is the standard error of the mean, but we could also obtain the standard error of some other estimator.

Exercise 5.1.1.1

Two sampling distributions of 5th grader reading ability are saved in the objects “yourmeans” and “yourmeans.2”, where the former mean scores were based on sample size of 10, whilst the latter mean scores were based on a sample size of 100. Plot the Sampling Distributions one below the other

Reminder: Remember to check that there are commas separating each argument in your code, and that all brackets are paired. Don’t let syntax errors get you down!

par(mfrow = c(..., ...))
hist(..., xlab=..., main=...xlim=c(110, 160))
hist(..., xlab=..., main=..., xlim=c(110, 160))

par(mfrow = c(2,1))
hist(yourmeans, xlab= "Reading speed", main = "Sampling Distribution of 5th Grader reading ability \nbased on sample size of 10", xlim=c(110, 160))
hist(yourmeans.2, xlab= "Reading speed", main = "Sampling Distribution of 5th Grader reading ability \nbased on sample size of 100", xlim=c(110, 160))

Exercise 5.1.1.2

5.2 Confidence limits/Intervals

When we use a sample to estimate characteristics of a population, there is always a degree of uncertainty in the estimation of parameters (e.g. the mean), which is represented by the standard error. The degree of uncertainty is small when the sample is representative of the population, and is demonstrated as the sample size increases in size.

One can account for this degree of uncertainty with the calculation of confidence limits/intervals. The confidence interval represents the range within which the true parameter lies. (Actually, this is not quite technically correct, but it is close enough in meaning for our purposes)

If one wished to calculate the 95% confidence interval for an estimate, we could say that amongst 95% of all samples, the true population parameter value falls within this interval.

For example, the 95% confidence interval for reading ability of 5th graders is between 132 and 146 words per minute (Note that the interval includes the population mean 139), which means that the true parameter value lies somewhere within this interval. See a graphical representation of the confidence interval below.

In order to set up confidence intervals, we use the Standard Normal Distribution. The figure below displays the 95% confidence interval within which the true average reading ability of 5th graders lies. In other words, the entire black area of the curve represents the 95% confidence interval. Notice that on the x-axis, z-scores are presented instead of scores.

The confidence interval is derived from the Z-score formula for the Sampling Distribution

\[ \begin{aligned} Z =\frac{\overline{X} - \mu_{\overline{X}}}{\frac{\sigma}{\sqrt{n}}} \ ,thus\ Z \frac{\sigma}{\sqrt{n}} = {\overline{X}} - {\mu}\ ,thus\ \mu = \overline{X} + Z \frac{\sigma}{\sqrt{n}} \end{aligned} \]

Confidence Interval formula

\[ \begin{aligned} \mu = \overline{X} \pm Z \frac{\sigma}{\sqrt{n}} \end{aligned} \]

Note: You should notice a \(\pm\) symbol used in the formula instead of the \({+}\) symbol. This is to because one of the z-scores is positive, whilst the other is negative

Let’s look at how to calculate the confidence interval in r using the 5th grader reading ability example:

Grade 5 reading ability in Western Cape
Reading ability of Grade 5 learners follows a normal distribution, with a population mean (\(\mu\)) of 139 words per minute, and a standard deviation (\(\sigma\)) of 35 words per minute. We draw a random sample of 100 learners.

Exercise 5.2.1.1

Run the code chunk below to generate lower and upper limits of a 95% confidence interval

Note: Notice that the mean and sd inputs have not been explicitly provided in the qnorm()command. This is because mean and standard deviation arguments are by default set to 0 and 1 respectively. See ?qnorm() help file

Reminder: You are working with probabilities to the left and the right of the z-score and, as such, must add half the probability of the left or right of the z-score to the confidence interval. See pp. 121,122 of Numbers, Hypotheses & Conclusions

error <- qnorm(0.975)*35/sqrt(100)

upper =139 + error
lower = 139 - error

c(lower, upper)

Exercise 5.2.1.2

Run the code chunk below to generate lower and upper limits of a 90% confidence interval

error <- qnorm(0.95)*35/sqrt(100)

upper =139 + error
lower = 139 - error

c(lower, upper)

HIV-AIDS & CD4 count
Based on a previous example given. For a sample of 5, the sample mean CD4 count obtained was 130, with a population standard deviation of 46.9.

Exercise 5.2.1.3

Using the HIV-AIDS and CD4 count example, calculate a 95% confidence interval

error <- qnorm(0.975)*.../sqrt(...)

upper = ... + error
lower = ... - error

c(lower, upper)

error <- qnorm(0.975)*46.9/sqrt(5)

upper = 130 + error
lower = 130 - error

c(lower, upper)

Exercise 5.2.1.4

Using the HIV-AIDS and CD4 count example again, now calculate a 90% confidence interval

error <- qnorm(...)*.../sqrt(...)

upper = ... + error
lower = ... - error

c(lower, upper)

error <- qnorm(0.95)*46.9/sqrt(5)

upper = 130 + error
lower = 130 - error

c(lower, upper)

6. Advanced Section

Confidence intervals are especially useful, because they provide an indication of the precision of the estimate, via the width of the confidence interval.

Wider confidence interval is indicative of less precision or greater variability in the estimate (estimate refers to the mean)
Narrower confidence interval is indicative of greater precision or less variability in the estimate

Let’s take a look back at the Job satisfaction example.

Job satisfaction of social scientists
The mean (\(\mu\)) job satisfaction of social scientists is 130 as measured on the Job Satisfaction Survey (JSS), and the variance is 100 (\(\sigma^2\)). A random sample of 2000 social scientists was drawn from the population

Exercise 6.1.1.1

Using the Job satisfaction example, calculate the 95% confidence interval

standdev=sqrt(...)

error <- qnorm(...)*standdev/sqrt(...)

upper = ... + error
lower = ... - error

c(lower, upper)

standdev=sqrt(100)

error <- qnorm(0.975)*standdev/sqrt(2000)

upper = 130 + error
lower = 130 - error

c(lower, upper)

Exercise 6.1.1.2

Now let’s assume that the variance in job satisfaction is actually 1000. Use the code chunk below to calculate the 95% confidence interval

standdev2=sqrt(1000)

error2 <- qnorm(0.975)*standdev2/sqrt(2000)

upper2 = 130 + error2
lower2 = 130 - error2

c(lower2, upper2)

Exercise 6.1.1.3

Exercise 6.1.1.4

Free-form exercise

Move on to and complete Statistics Tutorial Assignment 4 on Amathuba (Activities | Assignments)

Other resources & references

Developed by: Marilyn Lake & Colin Tredoux, UCT