PSY2015F tut 7: Correlation and Regression

1 Paired data

Read Tutorial 9 & 10 of Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

Fear of intimacy and relationship length
Researchers are interested in investigating the relationship between fear of intimacy and relationship length. Data on both variables is collected from a sample of 200. Data is stored in the external file called “FearIntimacyRelationship.csv”

Exercise 1.1.1.1

Import the “FearIntimacyRelationship.csv” dataset and inspect the first 6 rows of data

Hint 1: See Guided Tutorial 1 & 6 for a reminder on how to import external datasets

Hint 2: See Guided Tutorial 1 for how to inspect data using head() command

FearIntimacyRelationship = ...("...")

head(...)

FearIntimacyRelationship = read.csv("FearIntimacyRelationship.csv")

head(FearIntimacyRelationship)

You will see in the dataframe that for each participant there is a fear of intimacy score, as well as a length of relationship score. For example, participant 1 has a fear of intimacy score of 37.6 and a relationship length of 6.2.

Exercise 1.1.1.2

Run the code chunk below that produces a scatterplot, which maps length of relationship on fear of intimacy

To visualise the relationship between fear of intimacy and length of relationship, we can draw a scatterplot

plot(FearIntimacyRelationship$fear.intimacy, FearIntimacyRelationship$length.relationship, xlab ="Fear of Intimacy", ylab = "Relationship length", main = "Relationship between fear of intimacy and relationship length")

Note: Each dot on the plot represents a different participant

From the scatterplot we can see a trend based on the clustering pattern of data points - more specifically, it appears that as fear of intimacy score increases, relationship length decreases.

2. Regression line: Visual depiction

In section 1, we saw a negative relationship between fear of intimacy and relationship length.

We could add what is called a regression line, otherwise known as a line of best fit, to the scatterplot (see below), which best represents the relationship between the y-variable (in this case relationship length) and x-variable (fear of intimacy)

plot(FearIntimacyRelationship$fear.intimacy, FearIntimacyRelationship$length.relationship, xlab ="Fear of Intimacy", ylab = "Relationship length", main = "Relationship between fear of intimacy and relationship length")
abline(lm(FearIntimacyRelationship$length.relationship~FearIntimacyRelationship$fear.intimacy),col = "purple")

Age and Body mass index (BMI)
Researchers want to assess the relationship between age and body mass index (BMI) amongst a sample of 50 participants. Data is stored in the external file called “AgeBMI.csv”, and variables are named “BMI” and “Age”

Exercise 2.0.1.2

Import the “AgeBMI.csv” dataset and plot the relationship between “Age” (x variable) and “BMI” (y variable) on a scatteplot

Hint: Look back at section one on how to plot a scatterplot

Reminder: Remember to add appropriate labels to your plot

AgeBMI = ...("...")

plot(AgeBMI$..., AgeBMI$...,xlab ="...", ylab = "...", main ="...")

AgeBMI = read.csv("AgeBMI.csv")

plot(AgeBMI$Age, AgeBMI$BMI,xlab ="Age", ylab = "BMI", main= "Relationship between Age and BMI")

Exercise 2.0.1.3

Plot the relationship between “Age” (x variable) and “BMI” (y variable) on a scatterplot again, adding in a regression line

Note: A line of best fit can be added to a scatterplot using the abline() and lm() commands.

plot(....$..., ....$...,xlab ="...", ylab = "...", main ="...")

abline(lm(AgeBMI$Age~AgeBMI$BMI))

plot(AgeBMI$Age, AgeBMI$BMI,xlab ="Age", ylab = "BMI", main= "Relationship between Age and BMI")

abline(lm(AgeBMI$Age~AgeBMI$BMI))

Exercise 2.0.1.4

2.1 Direction of relationship

As was previously illustrated, relationships between variables can differ in direction, which include:

Positive relationship - as one variable increases so does the other
Negative relationship - as one variable increases the other decreases
No relationship - there is no pattern in the way two variables vary

Exercise 2.1.1.1

Exercise 2.1.1.2

2.2 Form of relationship (Linearity)

You should notice that all the regression lines previously presented are all represented by a straight line. This is because the use of a straight regression line suggests a linear relationship between variables.

Linear relationship is one where any change in the x-variable leads to a constant change in the y-variable
- For example: Every additional year one grows older, BMI is predicted to increase by 0.15.
- Implication: Despite whether the increase in age is from 20 to 21, or 50 to 51 years older, the incremental increase in BMI is equal to 0.15

Note: In this course we will be largely focusing on linear relationships

There are other relationship forms…

A nonlinear relationship is one in which change in x-variable leads to a varying change in the y-variable (see plot directly below)
- For example: Heroin and euphoria. Heroin use may initially generate strong feelings of euphoria, but with continued use can sharply decrease feelings of euphoria as tolerance to the drug develops. In other words, the drop in euphoria with continued use does not occur at a constant rate
- Implication: From the 1st to 2nd month of heroin use, the drop in feelings of euphoria is greater than from the 5th to 6th month of heroin use

Exercise 2.2.1.1

See plots displayed below

2.3. Strength of relationship

Whilst the regression line or “line of best fit” can tell us

direction of the relationship (positive vs negative) - by whether the line is upwards (positive) or downwards (negative) facing, and
form of the relationship (linear vs nonlinear) - whether the line is straight (linear) or curved (nonlinear)

the regression line can also tell us

strength of the relationship (weak vs strong) - by the closeness of the points to the regression line
- closer to the line = stronger linear relationship
- Less close to the line = weaker linear relationship

Exercise 2.3.1.1

See plots displayed below

Note: The closer the points to the line, the stronger the relationship

3. Regression equation: Theory

Regression has been discussed in terms of its visual depiction, in the form of the regression line, but it will now be discussed more in terms of underlying theory.

Simple linear regression is a descriptive method, but can also have inferential methods applied to it. Linear regression aims to

infer the relationship/association between two continuous variables
predict the change in the y-variable as the function of a unit change in the x-variable

Linear regression is expressed in the general regression equation form

\[ \begin{aligned} y = bx + a \end{aligned} \] where
\(y\) = y-variable, known as the dependent, outcome, or criterion variable
\(x\) = x-variable, known as the independent, or predictor variable
\(b\) = coefficient (coefficient defined as a constant value that is multiplied by the variable attached to it)
\(c\) = intercept (a single constant value)

Note: It is also common to write the equation above in other, equivalent forms, e.g.,

\[ \begin{aligned} y = mx + c \end{aligned} \\ \begin{aligned} y = a + bx \end{aligned} \]

3.1. Information from regression equation

Linear regression equation example: Age and Body mass index (BMI)
See regression equation below, which estimates the relationship between age and BMI

\[ \begin{aligned} BMI = 0.15*Age+24.2 \ ~~~~OR~~~~\ y = 0.15*x + 24.2 \end{aligned} \]

\(x\) = \(Age\)

\(y\) = \(BMI\)

Note: Notice how the regression equation above differs from the general form, where the components in the general form are substituted for values and variable names.

How do we interpret this regression equation?

For every year one gets older (i.e. for every 1 year), there is a predicted 0.15 increase in BMI

The regression equation is a formula

Exercise 3.1.1.1

Calculate the predicted BMI for individuals of age 40

Hint: Substitute “Age” in the regression equation with the value 40, and calculate predicted BMI based on the entire regression equation

0.15*...+ 24.2

0.15*40 + 24.2

Exercise 3.1.1.2

Exercise 3.1.1.3

Use the code chunk below to check that the x- and y-coordinate values below satisfy the equation (i.e. substitute into the formula)

\[ \begin{aligned} y = 0.15*x + 24.2 \end{aligned} \]

x_coordinate_age	y_coordinate_bmi
24	27.80
25	27.95
30	28.70
32	29.00
36	29.60
41	30.35
45	30.95

0.15*24 + 24.2
0.15*25 + 24.2
0.15*30 + 24.2
0.15*32 + 24.2
0.15*36 + 24.2
0.15*41 + 24.2
0.15*45 + 24.2

You should notice that x- and y-coordinates provided do in fact satisfy the regression equation

Exercise 3.1.1.4

The x- and y-coordinates displayed above are saved in the dataframe called xycoord, under the variable names xcoord.age and ycoord.BM”. Plot the coordinates in a scatterplot

plot(...$xcoord.age, ...$ycoord.BMI, xlab="...", ylab="...", main ="...")

plot(xycoord$xcoord.age, xycoord$ycoord.BMI, xlab="Age", ylab="BMI", main ="Regression line")

Exercise 3.1.1.5

Hint: More than one answer is correct, choose any correct answer.

Exercise 3.1.1.6

An additional x and y-coordinate were provided. Use the regression equation to evaluate whether they satisfy the regression equation

x_coordinate_age	y_coordinate_bmi
28	42.2

0.15*28 + 24.2

When the x- y-coordinates are plotted with the additional coordinates, we see that they do not satisfy the regression equation (i.e. one set of coordinates does not fall on the regression line), see graphic below

3.2. Altering aspects of regression equations

When the estimates \(b\) and \(c\) are altered in the regression equation, this alters the appearance of the regression line

\[ \begin{aligned} y =bx+c \end{aligned} \]

Exercise 3.2.1.1

The plot below visually depicts various regression lines with altered \(b\) coefficient values

Exercise 3.2.1.2

The plot below visually depicts various regression lines with altered \(c\) intercept values

The \(c\) or intercept value tells us where the the regression line cuts the y-axis

3.3 Producing the regression equation

In order to calculate components of the regression equation based on the raw data, the lm() command, in conjunction with the summary() command, can be used

Fear of intimacy and relationship length
Researchers are interested in investigating the relationship between fear of intimacy (named “fear.intimacy”) and relationship length (named “length.relationship”). Data on both variables is collected from a sample of 200.

Exercise 3.3.1.1

Run the code chunk below which runs a simple linear regression, where relationship length is regressed on fear of intimacy, and inspect the output

Note 1: y-variable (Dependent variable) is always regressed on x-variable (independent variable)

Note 2: ~ sign indicates that a variable is being “regressed on” another.

mod2=lm(length.relationship~fear.intimacy, data = FearIntimacyRelationship)

summary(mod2)

Based on the output obtained, the regression equation is as follows:

\[ \begin{aligned} length.relationship =-0.33*fear.intimacy + 21.37 \end{aligned} \]

Very important: The regression is directional! You can only use the regression equation above to predict relationship length (y-values) from fear of intimacy scores (x-values. You cannot predict fear of intimacy from relationship length, and would need to re-specify the regression model if you wanted to.

Another very important point: One cannot automatically infer a causal relationship between variables modelled using regression. For example, in the case of the relationship between fear of intimacy and relationship length, it could be that increased fear of intimacy leads to shorter relationship length, but it is also possible that shorter relationship lengths leads to greater fear of intimacy. In this case we can only say that these two variables are associated/related to one another

4. Ordinary Least Squares (OLS) estimation

With any relationship that we choose to investigate, there is always some degree of error (recall sampling error from Guided Tutorial 5).

Because of this, the straight regression line is an approximation of the relationship between two variables.

Ordinary Least Squares (OLS) estimation is a method that is used to determine the regression line or “line of best fit” that minimises the total error between observed values and predicted values, where all predicted values lie on the regression line

The image below depicts the distance between observed and predicted values on the regression line (i.e. error). OLS method takes the sum of squared vertical differences between observed and predicted values.

Why “squared” difference in particular? This is because the distances can be both positive and negative in sign, and squaring the distances gets ‘rid’ of the negative sign - if we simply added negative and positive differences, they would add to 0.

Exercise 4.1.1.1

Exercise 4.1.1.2

5. Standard Error of the Estimate (SE)

5.1 Calculating Standard Error of Estimate (SE)

The Standard Error of Estimate captures the amount of error in the regression model, or can alternatively be viewed as a measure of the accuracy in the regression model, where the lower the SE, the more accurate/less error in the regression model. The SE is based on the OLS method mentioned in the previous section (section 4), with some additional features from point 3 onwards:

Square the distances \((y - y')^2\) between observed (\(y\)) and predicted (\(y'\)) values - remember why we need to square the differences?
Take the sum of the squared distances (\(\sum(y - y')^2\))
Calculate the average sum of squared distances (\(\frac{\sum(y - y')^2}{N}\)) - An average is taken because the sum of squared distances is influenced by sample size (as sample size increases, so does \(\sum(y - y')^2\))
Square root the value obtained in 3 - we are undoing the earlier squaring here, converting a measure of variance to a standard deviation - remember how variance is converted to standard deviation?

Thus, the Standard Error of Estimate can be described as attempting, in principle, to compute an average of the vertical distances between each observed value and predicted values along the regression line (but uses squaring and square-rooting to deal with the difficulties we mentioned earlier).

See population SE formula below:

\[ \begin{aligned} \sigma_{y.x} = \sqrt{\frac{\sum(y - y')^2}{N}} \end{aligned} \]

\(y\) = observed values (each observed value is entered into the formula)
\(y'\) = predicted values (each corresponding predicted value is entered into the formula)
\(N\) = population size

Note: You should notice that the formula for population SE is much like the standard deviation formula. The SE can be regarded as the standard deviation of distances from the regression line. Note also that in R the standard error of estimate of a regression equation is called the “residual standard error”, and if you read that as the “standard error of the residuals” the connection between the formula above and the term in R, i.e., residual standard error, may be clearer to you.

See sample SE formula below

\[ \begin{aligned} s_{y.x} = \sqrt{\frac{\sum(y - y')^2}{n-2}} \end{aligned} \]

Unlike the population SE, when you are dealing with an entire population, the formula for the sample SE differs somewhat.

\(N\) is replaced with \(n-2\), where \(n\) refers to the size of the sample. This is because \(n-2\) represents the degrees of freedom (see Guided tutorial 6)

5.2 Evaluating Standard Error of Estimate (SE)

In order to evaluate whether a SE is large or small, we need to compare it to something.

The SE (\(s_{y.x}\)) is compared against the standard deviation of y (\(s_{y}\)), Why?

SE (\(s_{y.x}\)) uses the regression line to predict y for every x-value
We can think of computing the standard deviation of y (\(s_{y}\)) as a regression type of calculation, in that there is a regression line through \(\bar{Y}\), and we find the “average” of the distances of the y values from this line. Another way of thinking about this is to ask ‘what is the best prediction we could make of y, if we did not have information about x?’ The answer to that is that we could guess \(\bar{Y}\), the mean of y, and that would not be a good prediction, but the best we could do in the absence of knowing anything about the predictor, x.

Run the chunk below, and you will see what we mean. Note that sd(y) is very similar to the standard error of estimate in the output; the differences arise because for the SE of estimate we divide by (n-2), and for the sd by (n-1). Remember that in the output of lm in R, what is called the ‘residual standard error’ is what is more generally called the standard error of estimate, \(s_{y.x}\) In the plot, the red line is the line through \(\bar{Y}\), and represents the equation \(Y = \bar{Y}\). The distance between each point and the red line can be measured by \(S_Y\). The blue line that is shown in the plot is the line of best fit, and you can see that it does a better job of fitting the data than the red line (which is the best we can do in the absence of x)

x <- rnorm(50)
y <- x + rnorm(50,1,1) # x represents the sequence number of the y values
plot(x,y)
abline(a = mean(y), b= 0, col = "red")
abline(lm(y~x),col = "blue")
summary(lm(y ~ x))
sd(y)

We look at the ratio of \(s_{y.x}\) and \(s_{y}\) to make sense of the SE

\[ \begin{aligned} \frac{s_{y.x}}{s_{y}} \end{aligned} \]

If the SE (\(s_{y.x}\)) is smaller than the standard deviation of y (\(s_{y}\)), we can say that the regression model improves the ability to predict y-values over what we would achieve in the absence of the model

Exercise 5.2.1.1

6. Covariance & correlation

6.1. Covariance

Covariance is an important aspect of both regression and correlation (which we will discuss later). Covariance measures the extent to which two variables co-vary

Covariance formula:

\[ \begin{aligned} cov_{xy} = {\frac{\sum(x - \overline{x})(y - \overline{y})}{n-1}} \end{aligned} \]

\(x\) = x-variable values
\(\overline{x}\) = x-variable average
\(y\) = y-variable values
\(\overline{y}\) = y-variable average
\(n\) = sample size

Exercise 6.1.1.1

Complete the code chunk below to calculate the covariance between “length.relationship” and “fear.intimacy”, and save the output in an r object named “cov.out”

Note 1: The cov() command can be used to calculate covariance

Note 2: The order in which you insert variables into the cov() command, has no effect on the covariance it (i.e. order is not important here)

cov.out=cov(FearIntimacyRelationship$..., FearIntimacyRelationship$...)
cov.out

cov.out=cov(FearIntimacyRelationship$length.relationship, FearIntimacyRelationship$fear.intimacy)
cov.out

6.2 Covariance and regression coefficients

Covariance is important because it is used to calculate the regression coefficient/s (\(m\)) in the regression equation

Recall the general form of the regression equation below:

\[ \begin{aligned} y = bx+a \end{aligned} \] \(b\) is the regression coefficient, and captures the relationship between x- (independent) and y- (dependent) variables.

It is calculated using the following formula:

\[ \begin{aligned} b = \frac{cov_{xy}}{s_{x}^2} \end{aligned} \]

\(s_{x}^2\) = variance of x

Exercise 6.2.1.1

Calculate the regression coeffient of length.relationship regressed on fear.intimacy using the formula directly above

Hint: The covariance was previously saved in the object called “cov.out”

Note: The order in which you insert variables into the cov() command, has no effect on the covariance it (i.e. order is not important here)

.../var(FearIntimacyRelationship$...)

cov.out/var(FearIntimacyRelationship$fear.intimacy)

6.3 Correlation

Correlation is another way of quantifying the degree of accuracy, or error, of the regression equation (although a correlation/s can be estimated independently of a regression model).

Correlation is best described as quantifying the strength of the relationship between two variables

The covariance is used to calculate correlation - as covariance increases, so does correlation.

See correlation formula below, otherwise referred to as the product moment correlation coefficient or pearson product moment:

\[ \begin{aligned} r = \frac{cov_{xy}}{s_{x}s_{y}} \end{aligned} \] \(r\) = Pearson product moment/correlation coefficient
\(s_{x}s_{y}\) = product (i.e. multiplicative) of the standard deviations of both variables

Exercise 6.3.1.1

Calculate the correlation coefficient between length.relationship and fear.intimacy using the formula directly above

.../(sd(...$...)*sd(...$...))

cov.out/(sd(FearIntimacyRelationship$fear.intimacy)*sd(FearIntimacyRelationship$length.relationship))

Exercise 6.3.1.2

Calculate the correlation coefficient between length.relationship and fear.intimacy using the cor() command. Check that you get the same answer as that obtained in the previous exercise

Note 1: The cor() command can be used to calculate correlation, and its arguments are the same as those in the cov() command

cor(...$..., ...$...)

cor(FearIntimacyRelationship$fear.intimacy, FearIntimacyRelationship$length.relationship)

6.4 Correlation vs regression coefficient

The correlation is very similar to the regression coefficient found in the regression equation in that it can be interpreted as having

direction of relationship -> (e.g. an \(r\) of -0.7 indicates a negative relationship, whilst an \(r\) of 0.7 indicates a positive relationship)
strength of relationship -> (e.g. an \(r\) of 0.7 indicates a stronger relationship than an \(r\) of 0.3)

In the case of the relationship between fear of intimacy and relationship length, both \(r\) (-0.69) and \(b\) (-0.33) are shown in the figure below.

Despite the similarities between correlation and regression coefficients, there are some key differences:

the correlation coefficient is a standardised measure of the strength of the relationship between 2 variables, meaning that despite the units in which variables are measured, correlations can only range from -1 to 1. On the other hand, regression coefficients are expressed in the same unit of measurement and on the same scale as the variable
the correlation coefficient cannot be used to predict values, whilst the regression coefficient can

6.5 Correlation strength classification

A classification system can be used to describe the strength of a relationship based on the size of the correlation coefficient (see below)

7. Summary exercises

Age and Body mass index (BMI)
Researchers want to assess the relationship between age and body mass index (BMI) amongst a sample of 50 participants. Data is stored in the external file called “AgeBMI.csv”, and variables are named “BMI” and “Age” - but the data has been imported already into the dataframe AgeBMI

Exercise 7.1.1.1

Calculate the correlation coefficient between age and BMI using the cor() command.

Reminder: The dataset “AgeBMI.csv” has already been imported

cor(...$..., ...$...)

cor(AgeBMI$Age, AgeBMI$BMI)

Exercise 7.1.1.2

Exercise 7.1.1.3

Now run a linear regression model which regresses BMI on age, and save the output in the object “AgeBMI.mod”. Use the lm() command and inspect the output using the summary() command

Reminder: when running the lm() command, the y-variable is specified before the ~ and the x-variable follows.

AgeBMI.mod=lm(...~..., data =...)
summary(AgeBMI.mod)

AgeBMI.mod=lm(BMI~Age, data =AgeBMI)
summary(AgeBMI.mod)

Note 1: The “Residual standard error” presented in the output above represents the Standard error of the estimate

Note 2: To build the regression equation, pull out the values from the “Estimates” column under “Coefficients”. The regression equation will look like the following, see below.

\[ \begin{aligned} BMI = 0.26Age + 21 \end{aligned} \]

8. Advanced Section

Hypothesis testing can be applied to regression and correlation analyses much like we have seen it applied in the Z-test and t-test. In the case of regression and correlation coefficients:

The Null hypothesis (\(H_{0}\)), states that the coefficient is not different from zero. In other words, there is no relationship between two variables.
The Alternate hypothesis, (\(H_{1}\)), states that the coefficient is different from zero. In other words, there is a relationship between two variables.

Exercise 8.1.1.1

Exercise 8.1.1.2

Use the cor.test() command to test whether there is a relationship between age and BMI. Assume a non-directional hypothesis

cor.test(...$..., ...$..., alternative = "...")

cor.test(AgeBMI$Age, AgeBMI$BMI, alternative = "two.sided")

After the correlation or regression coefficient (depending on what one is interested in testing) is calculated, it can either be compared to a critical coefficient value, or one can obtain the associated p-value in order assess whether the coefficient lies within the rejection region. In the case of the analysis in R, the p value is computed directly, and so we don’t need to compare it to a critical coefficient value.

Exercise 8.1.1.3

Free-form exercise

Move on to and complete Statistics Tutorial Assignment 7 on Amathuba (Activities | Assignments)

Other resources & references

Developed by: Marilyn Lake & Colin Tredoux, UCT