Skip to Tutorial Content

1. Introduction to R, Rstudio & learnr

Welcome to R!

R is both a programming language as well as a free, open-source software that offers an environment to analyse data.

For this course, we will be operating R software through a useful integrated development environment (IDE) known as Rstudio, which you can access online through Rstudio cloud (Rstudio cloud weblink). Alternatively you can download both R (R download for Windows, Mac OS and Linux) and Rstudio (RStudio download) onto your personal laptop or computer for use off-campus.

Alongside this tutorial, you should read Tutorials 1, 2 & 3 in Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)

1.1. Important information: Statistics tutorial components

You are currently looking at a Guided learnr tutorial, which is a file that consists of both text (what you are currently reading) and several exercises that can be performed in R. Similarly, you will also be working with another file type, the R markdown that operates in the same way to a learnr tutorial, but affords more freedom to be altered by the user. You will be utilizing both kinds of files throughout this course.

You will be completing a total of 7 statistics tutorials in this course, and each tutorial will consist of two parts:

  1. A guided tutorial
  • This is a learnr tutorial that will introduce you to a new topic area
  1. A free-form R markdown exercise to be downloaded from Amathuba (Activities | Assignments)
  • This is an application exercise that will be completed using the R markdown feature, which applies what you have learnt in the learnr tutorial

You will upload one file to amathuba for each tutorial, namely your free-form R markdown exercise.

1.2. Guided tutorial features

Guided tutorials in learnr have two important features that allow us to handle data; the R code chunk and the R code. You will see an example of a code chunk directly in exercise 1.2.1, which is the entire block/chunk in which code can be executed. R code refers to the set of words and symbols used within the R code chunk that is applied to the data, to transform it.

Let’s take a look at an example!

Trait anxiety. We have a dataset called affect, which captures scores on personality, mood and trait anxiety measures.

Exercise 1.2.1

Click the “run code” button in the right-hand corner of the R code chunk, to see the dataset called affect , which is also an object.

affect

In the code chunk below, the code seeks to plot a histogram of trait anxiety scores, under the variable entitled traitanx

Exercise 1.2.2

Run the following R code chunk, and check that a histogram of trait anxiety scores is produced.

hist(affect$traitanx)

For exercise 1.2.2, we provided the R code hist(affect$traitanx) that acts as an instruction for R to generate a histogram of anxiety scores.

R code is the language we use to communicate with R about how to transform our data

1.3. Layout of RStudio environment

Like learnr, Rstudio also allows us to run code in order to transform data, by using R markdown documents. In addition, R studio has several other helpful features including the R environment, R console and R files/help.

Exercise 1.3.1

For an introduction to Rstudio features, watch the video below.

Note: You’ll be working with Rstudio for the free-form tutorial

1.4. R programming language

R is a programming language, which means that R utilizes its own set of words and symbols to instruct R to complete certain tasks, this is also referred to R code

All R code R consists of 3 major components:

  1. R objects
  • R objects store information, such as data, and are named in order to allow for easy access to stored contents. For example, in Exercise 1.2.1, Personality, mood and trait anxiety data is stored in an R object called affect.

Exercise 1.4.1

Take a look at the data stored inside the R object named “affect” (In this case, the only code one needs to provide is the name of the R object, in order to obtain it’s contents)

Note:The R object called “affect” consists of 4 columns of data, each displaying a different measured variable

affect

Exercise 1.4.2

Insert and run the appropriate code in the code chunk below, which allows you to look inside the object named “survey”

Hint 1: Type the name of the R object in which the data is stored

Hint 2: R is case sensitive (e.g. SURVEY is different from survey)

survey

The name of an R object can easily be changed in R

Exercise 1.4.3

Run code chunk below which changes the name of the object from “affect” to “emotion”

Note: This code takes an existing named object (on the right-hand side of the equals sign), and applies a new name to the object (given on the left-hand side of the equals sign)

emotion = affect

Run the following code chunk to take a look at the newly named object, “emotion”

emotion
  1. R commands/functions
  • R commands/functions allow us to apply operations to data that transform it in some way. Commands are usually given a name that describes the kind of operation it achieves, and is followed by an open and closed parenthesis (). head() is one example of a R command, which prints the first 6 lines of data/pieces of information in a R object. Other examples include: mean(), median(), sd()

Exercise 1.4.4

Run the code chunk below, which uses the head() command to display the first 6 rows of data in the R object called “emotion” .

Note: Notice how the output differs in exercise 1.4.4 relative to exercise 1.4.1

head(emotion)
  1. R arguments
  • R argument/s refer to an important input/s within the R command that may need to be specified by the user, or is otherwise set by default, but which is required in order for the R command to be run. For example, one can only run the head() command when it is applied to an R object.

Exercise 1.4.5

Try and run the code chunk below with the R object emotion removed.

head()

The error message produced tells us that “x” is missing, with no default. “x” in this case is the R object, more specifically, a dataframe.

Every R command consists of arguments, and an easy way to learn more about the arguments is to look at the associated help files for the command.

NB: Add ? before a command to retrieve the help file for that command

Exercise 1.4.6

  • Retrieve the help file for the head() command, and read the Arguments section of the help file. The help may appear in RStudio, not in this browser.
?head()

1.5. R as an interactive calculator

R also acts as an interactive calculator for arithmetic operations (as well as more advanced operations)

Exercise 1.5.1

Run the following arithmetic operation in the R code chunk below

5 + 2

Hint: Type the sum into the code chunk

5 + 2

Exercise 1.5.2

Now try division

10/2

## [1] 5

Exercise 1.5.3

Let’s try a more difficult multiplication operation that would be difficult to do in one’s head

134*223

134*223

Useful symbols/operators:

  • - (minus)
  • * (multiplication)
  • / (division)
  • = (equals to)
  • != (does not equal to)

1.6. R scripts/markdowns

Whilst R gives you the power to execute the multiplication operation of 134*223, you may want to save the answer to use later. R scripts and markdowns are text files that give you the freedom to save your R code in order to run later. It also allows you to add descriptions about what your code is doing (commonly referred to as comments), using the #. Let’s look at a video that shows us how R scripts and markdowns work

2. Inspecting data in R

R easily allows us to visually inspect our data (remember that this is referred to as an object in R), both in the form of plots and dataframes.

Influence of media on attitudes. The dataframe named media contains data from a study that investigated the influence of media on attitudes. Each column of the dataframe represents a different variable, see table below for descriptions of each variable

Column names Description
cond Experimental Condition: 0 low media importance, 1 high media importance
pmi Presumed media influence (based on the average of two items)
import Importance of the issue
reaction Subjects rated agreement about possible reactions to the story (mean of 4 items)
gender 1 = male, 2 = female
age Age in years

Note: If three dots (…) are presented in any of the code chunks below, assume that some component of the code is intentionally missing. Replace the three dots with the appropriate object/command or argument.

Exercise 2.0.1

Complete and run the code chunk below to present the first 6 rows of the “media” dataframe using the head() command

Note: Replace the three dots (…) with the name of the dataframe of interest

Hint: See exercise 1.4.4 for another example

head(...)
head(media)

Another way of inspecting the underlying structure of a dataframe (i.e. to see the variables present in the dataframe) without looking at the raw data is to use the str()command

Exercise 2.0.2

Complete the code chunk below to inspect the underlying structure of the “media” dataframe

str(media)

One can also look at a brief summary of the central tendency (e.g. mean and median) and spread/variability (min and max) of variables in a dataframe using the summary() command

Exercise 2.0.3

Complete the code chunk below that provides a summary all the variables in the “media” dataframe

summary(media)

Exercise 2.0.4

Plot a histogram of the presumed media influence (variable name is pmi).

Note: the $ symbol allows you to pull out a column of data from a dataframe, which is an R object. In other words, you will need to specify both the dataframe name, followed by the name of specific column of data you would like to plot

Hint 1: media is the name of the R object and pmi is the name of the column

Hint 2: See exercise 1.2.1 for an example of how to plot a column of data as a histogram

hist(...$...)
hist(media$pmi)

What does this histogram tell us? Considering that higher pmi scores represent greater presumed media influence, the distribution of the data depicted in the histogram would suggest that the presumed influence of the media is considerably high amongst participants, given the higher frequency of high pmi scores relative to low pmi scores

2.1. Reading in dataframes

In order to inspect data, we first need to import or “read in” the data into R, in the form of dataframes.

Sometimes the data already exists within R, as part of an R package. If this is the case, we use the data() function to read the data into R.

Exercise 2.1.1

Check that the example below presents only first 6 rows of data

head(affect)

Exercise 2.1.2

Complete the code to present the first 10 rows of data

Hint1: head() command contains an argument called “x”, which asks for the name of the dataframe, whilst the other argument “n”, asks for the the number of rows in the data frame to display. The default of of rows that is presented is set to 6 unless changed.

Hint2: Pay attention to how things like commas, fonts and colours can be used to separate arguments

head(x=...,n=...)
head(x= affect, n=10)

When the data does not already exist in R, there are R commands available for us to use in order to read-in data files. The commands will differ depending on the type of file format. For example, read_excel() allows us to read in excel spreadsheets with the .cv extension.

Child IQ scores on WISC. We have an external csv dataset called wiscsem, which captures children’s scores on the Weschler Intelligence scale (WISC) for childrens.

Exercise 2.1.3

Run the code in the code chunk below that reads in a .csv file containing children’s scores on the WISC. The file is named “wiscsem.csv”

wiscsem <-read.csv("wiscsem.csv")

Also take note of the assignment operator <-, which allows us to assign values/data to named R objects. In this case, we are reading in excel spreadsheet data and assigning/storing it in the R object called wiscsem (more specifically a dataframe in this case).

Hint: the <- operator can be interchanged with =. A shortcut for the ‘<-’ operator on Windows is “alt -” and on MAC is “option -”

Exercise 2.1.4

Read in the wiscsem.csv file and store it into an R object called wiscsem, using the = operator.

Hint: See example 2.1.3 for how to read in the wiscem.csv file

wiscsem = read.csv("wiscsem.csv")

Exercise 2.1.5

Check the structure of the dataframe wiscsem, using the str function

str(...)
str(wiscsem)

3. Measures of central tendency (CT)

When we collect information/data from a large group of individuals, we may be interested in summarizing the data in order to make sense of it on a group level, rather than just at an individual level.

Child IQ scores on WISC. For example, the Weschler Intelligence scale (WISC) was used to measure vocabulary (along with other subscales) in 175 children. Vocabulary along with other intelligence scores are stored in the dataframe named wiscsem. For descriptions of all variables contained in wiscsem, see below

Note: For all intelligence subscales, higher scores reflect higher intelligence

Column names Description
client Child identification number
agemate Age in years
info Information
comp Comprehension
arith Arithmetic knowledge
simil Similarities
vocab Vocabulary
digit Digit span
pictcomp Picture comprehension
parang Picture arrangement
block Block design
object Object assembly
coding Coding

Exercise 3.0.1

Run the code chunk below and look at the pattern of vocabulary scores of all children participating in this study

Note: Note that we can add labels to the x-axis of the histogram, using xlab argument, and the heading for the graph using the main argument. Notice that the descriptive title is placed within quotations (““)

hist(wiscsem$vocab, xlab = "Vocabulary scores", main =" Distribution of vocabulary scores on the WISCSEM measure for children")

From the histogram, we can see that there was a low frequency of children who scored very high or low on the vocabulary measure, but the majority of children appear to score approximately half-way in between.

In this example, because the majority of children’s vocabulary scores lie in the center, measures of central tendency (CT) can be used to quantify the extent to which children’s score lie in the center. There are several measures of CT,the most popular being the mean(), median() and mode()

3.1. Mean

The mean or average measures CT by dividing the sum of total data points/scores by the total number of data points/scores. See page 48-52 of Numbers, Hypotheses & Conclusions (3rd Ed)

Let’s look at some examples.

Suppose we measure children on a spatial awareness task, where scores range from 0-10 and higher scores represent better spatial awareness capabilities.

We have 3 children’s spatial awareness scores:

5, 8, 2

Exercise 3.1.1

Calculate the mean spatial awareness score

\[\frac{1}{n} \sum_{}^{} x\]

(5+8+2)/3

Alternatively, we can use the R command mean() to calculate the average

Exercise 3.1.2

Check to see the mean vocabulary score for children on the WISC, using the mean() command

Hint: First insert the name of the dataframe, followed by the name of the variable/column of interest, separated by a dollar sign ($)

mean(...$...)
mean(wiscsem$vocab)

3.2. Median

The median is another measure of CT, which identifies the middle value within a data sample, thus splitting the sample into a lower and upper half. See page 52-55 of Numbers, Hypotheses & Conclusions (3rd Ed)

Exercise 3.2.1

Check to see the median vocabulary score for children on the WISC, using the median() command

median(...$...)
median(wiscsem$vocab)

3.3. Mode

The mode is a measure of CT that identifies the value that appears most often in a data sample.

Exercise 3.3.1

Check to see the mode vocabulary score for children on the WISC, using the mode() function

mode(...$...)
mode(wiscsem$vocab)

Let’s now visualize all three measures of CT in a histogram

You should notice that values for each measure of CT are not identical in this case, and this is due to two reasons

  1. Each measure of CT is computed differently
  2. The current sample size is not large enough. With a sufficiently large sample size different CT estimates will converge to the same value

3.4. Central tendency (CT) measures and variable types

Use of certain measures of CT may be more appropriate depending on the type of variable in question.

Variable types include:

Nominal

Variable that is represented by categories that have no inherent numeric value, and is best measured using the mode For example: gender, eye colour, media genre preference, country of origin. All these variables have no numerical answer, but knowing which is the most common (i.e. mode) can tell you something about your data.

Ordinal

Variable that is represented by categories that can be ordered in some logical way, and is best measured using the median For example: levels of satisfaction, levels of income, severity levels of illness. Again, these variables have no “correct” numerical answer, but they are scaled and therefore knowing the middle (i.e. median) will give you more information about your data.

Scale/continuous

Variable that is quantified by numeric values, and is best measured using the mean For example weight, test scores, temperature. These variables do have a numerical answer that could tell you things like high/low observations in your data, and what the average (i.e. mean) across your data is.

4. Measures of variability

Whilst we may be interested in summarizing where the majority of data points lie, with help from measures of CT, we may also be interested in knowing to what extent data points deviate from each other

Take a look back at the affect dataframe (see section 1. Introduction to R, Rstudio & learnr), the object consists of the following variables:

Column names Description
traitanx trait anxiety scores
BDI depression scores based on the Beck Depression Inventory (BDI)
posaffect positive affect/emotion scores, higher scores represent higher positive affect
negaffect negative affect/emotion scores, higher scores higher negative affect

Exercise 4.0.1

Run the code chunk below and look at the differences in the distribution of scores for positive and negative affect.

Hint: par() command allows us to present two graphs at one time. The “mfrow” argument allows you to specify the number of rows and columns of graphs, where number of rows are specified before the number of columns. In this example, we specified 1 row and 2 columns

par(mfrow= c(1,2)) 
hist(affect$posaffect, xlab = "Positive affect", main ="Distribution of positive affect scores")
hist(affect$negaffect, xlab = "Negative affect", main ="Distribution of negative affect scores")

You should be able to notice that the empirical (data) distributions for positive and negative affect look somewhat different from each other. Notably for negative affect, there appears to be a somewhat smaller variation in scores from zero (which appears to be the mean), than for positive affect, where scores deviate more considerably from zero.

4.1. Variance & standard deviation

One way of measuring variability in data is through variance. Variance is measured as the average of squared differences between data points and the mean. Standard Deviation is another measure of variability that is calculated as the square root of variance. See page 55-61 of Numbers, Hypotheses & Conclusions (3rd Ed)

Remember that we have 3 children’s spatial awareness scores (presented below):

5, 8, 2

And we previously obtained a mean of 5

Exercise 4.1.1

Calculate the variance associated with spatial awareness scores

\[ \frac{\sum_{}^{} (x-\overline{x})^2}{n -1}\]

Note: ^ operator allows us to raise by some power. In this case we are raising by the power of 2

(((5-5)^2)/3)+(((8-5)^2)/3)+(((2-5)^2)/3)

Exercise 4.1.2

Now square root the variance you just calculated in order to obtain the standard deviation

Note: sqrt() command allows us to square root a value

var.spatial=(((5-5)^2)/3)+(((8-5)^2)/3)+(((2-5)^2)/3)
sqrt(var.spatial)

Like with measures of CT, we also have R commands at our disposal to measure variability, including the var() and sd() R commands

Exercise 4.1.3

Complete the code chunk below and check to see that you obtain the same variance of spatial awareness scores as calculated above, but now using the var() command

Note 1: c() command allows us to combine values, which we can then store into a R object

Note 2: A new object named spatial.aware has been created in the code chunk below, which stores 3 data points. One can just insert the name of the object as an input into the var() command

spatial.aware= c(5,8,2)
var(...)
spatial.aware= c(5,8,2)
var(spatial.aware)

Note: Notice that the “spatial.aware” object is not a dataframe, because it does not consist of several columns of data, but is rather made of of just one column of data. The implication here is that the dollar sign ($) is redundant (i.e. both the column of data and the name of the dataframe is named spatial.aware)

Exercise 4.1.4

Now try and complete and calculate the standard deviation of spatial awareness scores, using the sd() command.

Remember that the spatial awareness scores are:

5,8, and 2

Hint: Look at exercise 4.1.3 exercise for guidance

... = c(...,...,...)
sd(...)
spatial.aware= c(5,8,2)
sd(spatial.aware)

4.2. Range & Interquartile range (IQR)

Other measures of variability include the range and interquartile range. The range is measured as the variation between the maximum and minimum data points and is just calculated as the difference between the two. The interquartile range is measured as the variation between the 1st and 3rd quartiles (i.e. variation associated with the middle 50% of data).See page 55-61 of Numbers, Hypotheses & Conclusions (3rd Ed)

Exercise 4.2.1

Complete the following code below and calculate the range of vocabulary scores for children on the WISC

Hint 1: First insert the name of the dataframe, followed by the name of the variable/column of interest, separated by a dollar sign ($)

Hint 2: The name of the column of data containing the vocabulary scores is “vocab”

range(...$...)
range(wiscsem$vocab)

Exercise 4.2.2

Now calculate the interquartile range of vocabulary scores for children on the WISC

IQR(...$...)
IQR(wiscem$vocab)

5.Advanced section

5.1. Variable types in R

When reading data into R (either as single variables or entire data frames with multiple variables), we will need to specify variable types.

A survey was conducted with statistics students from the University of Adelaide, where several variables were measured, including: gender, age and smoking frequency

Note: There are more variables contained in the dataframe than those already mentioned

Exercise 5.1.1

Inspect the “survey” data frame to see how variables have been measured after being imported into R (i.e. types of data), using the str() command

Hint: str() allows us to identify how each variable has been classified (i.e. its variable type). See exercise 2.0.2

str(...)
str(survey)

All variables (i.e. each column of data) are classified as numeric or scale variables, which is indicated by the shorthand num

However, some variables in the survey data frame should not be classified as numeric, including Gender

We have already discussed that variables can come in the form of nominal, ordinal and scale/continuous variables, but how do we represent these in R

Variable types Associated variable types in R
Scale/continuous numeric() or integer()
Nominal factor() or character()
Ordinal factor() or character()

If we sought to alter the Gender variable from a scale to a nominal variable, we could utilize the as.factor() command.

Exercise 5.1.2

Run code chunk below to convert the Gender variable type from numeric to factor, and check that the variable type has successfully been changed to factor

Note: Within the survey data frame, self-reported male and females are assigned the placeholder values 1 and 2 respectively

Hint: Use str() command to check variable type.

survey$Gender <- ifelse(survey$Gender == "1", "Male", "Female")
survey$Gender<-as.factor(survey$Gender)

...(...$...)
survey$Gender<-as.factor(survey$Gender)
str(survey$Gender)

Instead of representing self-reported male and female genders using the placeholder values, 1 and 2 respectively, each participant could be named either male or female

Exercise 5.1.3

*Check to see that the placeholder values for Gender (1 and 2) have been replaced by character strings (male and female)

survey$Gender <- ifelse(survey$Gender == "1", "Male", "Female")
survey$Gender<-as.factor(survey$Gender)

...(...$...)
str(survey$Gender)

Exercise 5.1.4

Complete the code chunk in order to look at the structure (i.e. variable types) of the wiscsem data frame, and alter the vocabulary scores to an integer.

Hint 1: Check the underlying structure first

Hint 2: If not already of type integer convert the variable type to integer

...(...$...)

...$... = as...(...$...)
str(wiscsem$vocab)

wiscsem$vocab = as.integer(wiscsem$vocab)

Free-form exercise

Move on to and complete Statistics Tutorial Assignment 1 on Amathuba (Activities | Assignments)

Other resources & references

WISC-R subscale data from Tabachnick, B. G., & Fidell, L. S. (1996). Using Multivariate Statistics (3rd ed.). New York Harper Collins.

Developed by: Marilyn Lake & Colin Tredoux

PSY2015F tut 1: Central tendency and variability