1. Introduction to R, Rstudio & learnr
Welcome to R!
R is both a programming language as well as a free, open-source software that offers an environment to analyse data.
For this course, we will be operating R software through a useful integrated development environment (IDE) known as Rstudio, which you can access online through Rstudio cloud (Rstudio cloud weblink). Alternatively you can download both R (R download for Windows, Mac OS and Linux) and Rstudio (RStudio download) onto your personal laptop or computer for use off-campus.
Alongside this tutorial, you should read Tutorials 1, 2 & 3 in Tredoux, C.G. & Durrheim, K.L. (2018). Numbers, hypotheses and conclusions. 3rd edition. Cape Town: Juta Publishers (but the 2nd edition could also be used)
1.1. Important information: Statistics tutorial components
You are currently looking at a Guided learnr tutorial, which is a file that consists of both text (what you are currently reading) and several exercises that can be performed in R. Similarly, you will also be working with another file type, the R markdown that operates in the same way to a learnr tutorial, but affords more freedom to be altered by the user. You will be utilizing both kinds of files throughout this course.
You will be completing a total of 7 statistics tutorials in this course, and each tutorial will consist of two parts:
- A guided tutorial
- This is a learnr tutorial that will introduce you to a new topic area
- A free-form R markdown exercise to be downloaded from Amathuba (Activities | Assignments)
- This is an application exercise that will be completed using the R markdown feature, which applies what you have learnt in the learnr tutorial
You will upload one file to amathuba for each tutorial, namely your free-form R markdown exercise.
1.2. Guided tutorial features
Guided tutorials in learnr have two important features that allow us to handle data; the R code chunk and the R code. You will see an example of a code chunk directly in exercise 1.2.1, which is the entire block/chunk in which code can be executed. R code refers to the set of words and symbols used within the R code chunk that is applied to the data, to transform it.
Let’s take a look at an example!
Trait anxiety. We have a dataset called
affect, which captures scores on personality, mood and trait anxiety measures.
Exercise 1.2.1
Click the “run code” button in the right-hand corner of the R
code chunk, to see the dataset called affect ,
which is also an object.
affect
In the code chunk below, the code seeks to plot a histogram of trait
anxiety scores, under the variable entitled traitanx
Exercise 1.2.2
Run the following R code chunk, and check that a histogram of trait anxiety scores is produced.
hist(affect$traitanx)
For exercise 1.2.2, we provided the R code
hist(affect$traitanx) that acts as an instruction for R to
generate a histogram of anxiety scores.
R code is the language we use to communicate with R about how to transform our data
1.3. Layout of RStudio environment
Like learnr, Rstudio also allows us to run code in order to transform data, by using R markdown documents. In addition, R studio has several other helpful features including the R environment, R console and R files/help.
Exercise 1.3.1
For an introduction to Rstudio features, watch the video below.
Note: You’ll be working with Rstudio for the free-form tutorial
1.4. R programming language
R is a programming language, which means that R utilizes its own set of words and symbols to instruct R to complete certain tasks, this is also referred to R code
All R code R consists of 3 major components:
- R objects
- R objects store information, such as data, and are named in order to
allow for easy access to stored contents. For example, in Exercise
1.2.1, Personality, mood and trait anxiety data is stored in an R object
called
affect.
Exercise 1.4.1
Take a look at the data stored inside the R object named “affect” (In this case, the only code one needs to provide is the name of the R object, in order to obtain it’s contents)
Note:The R object called “affect” consists of 4 columns of data, each displaying a different measured variable
affect
Exercise 1.4.2
Insert and run the appropriate code in the code chunk below, which allows you to look inside the object named “survey”
Hint 1: Type the name of the R object in which the data is stored
Hint 2: R is case sensitive (e.g. SURVEY is different from survey)
survey
The name of an R object can easily be changed in R
Exercise 1.4.3
Run code chunk below which changes the name of the object from “affect” to “emotion”
Note: This code takes an existing named object (on the right-hand side of the equals sign), and applies a new name to the object (given on the left-hand side of the equals sign)
emotion = affect
Run the following code chunk to take a look at the newly named object, “emotion”
emotion
- R commands/functions
- R commands/functions allow us to apply operations to data that
transform it in some way. Commands are usually given a name that
describes the kind of operation it achieves, and is followed by an open
and closed parenthesis
().head()is one example of a R command, which prints the first 6 lines of data/pieces of information in a R object. Other examples include:mean(),median(),sd()
Exercise 1.4.4
Run the code chunk below, which uses the head() command to display the first 6 rows of data in the R object called “emotion” .
Note: Notice how the output differs in exercise 1.4.4 relative to exercise 1.4.1
head(emotion)
- R arguments
- R argument/s refer to an important input/s within the R command that
may need to be specified by the user, or is otherwise set by default,
but which is required in order for the R command to be run. For example,
one can only run the
head()command when it is applied to an R object.
Exercise 1.4.5
Try and run the code chunk below with the R object emotion removed.
head()
The error message produced tells us that “x” is missing, with no default. “x” in this case is the R object, more specifically, a dataframe.
Every R command consists of arguments, and an easy way to learn more about the arguments is to look at the associated help files for the command.
NB: Add ? before a command to retrieve
the help file for that command
Exercise 1.4.6
- Retrieve the help file for the head() command, and read the Arguments section of the help file. The help may appear in RStudio, not in this browser.
?head()
1.5. R as an interactive calculator
R also acts as an interactive calculator for arithmetic operations (as well as more advanced operations)
Exercise 1.5.1
Run the following arithmetic operation in the R code chunk below
5 + 2
Hint: Type the sum into the code chunk
5 + 2
Exercise 1.5.2
Now try division
10/2
## [1] 5
Exercise 1.5.3
Let’s try a more difficult multiplication operation that would be difficult to do in one’s head
134*223
134*223
Useful symbols/operators:
-(minus)*(multiplication)/(division)=(equals to)!=(does not equal to)
1.6. R scripts/markdowns
Whilst R gives you the power to execute the multiplication operation
of 134*223, you may want to save the answer to use later. R scripts and
markdowns are text files that give you the freedom to save your R code
in order to run later. It also allows you to add descriptions about what
your code is doing (commonly referred to as comments), using the
#. Let’s look at a video that shows us how R scripts and
markdowns work
2. Inspecting data in R
R easily allows us to visually inspect our data (remember that this is referred to as an object in R), both in the form of plots and dataframes.
Influence of media on attitudes. The dataframe named
mediacontains data from a study that investigated the influence of media on attitudes. Each column of the dataframe represents a different variable, see table below for descriptions of each variable
| Column names | Description |
|---|---|
| cond | Experimental Condition: 0 low media importance, 1 high media importance |
| pmi | Presumed media influence (based on the average of two items) |
| import | Importance of the issue |
| reaction | Subjects rated agreement about possible reactions to the story (mean of 4 items) |
| gender | 1 = male, 2 = female |
| age | Age in years |
Note: If three dots (…) are presented in any of the code chunks below, assume that some component of the code is intentionally missing. Replace the three dots with the appropriate object/command or argument.
Exercise 2.0.1
Complete and run the code chunk below to present the first 6 rows of the “media” dataframe using the head() command
Note: Replace the three dots (…) with the name of the dataframe of interest
Hint: See exercise 1.4.4 for another example
head(...)
head(media)
Another way of inspecting the underlying structure of a dataframe
(i.e. to see the variables present in the dataframe) without looking at
the raw data is to use the str()command
Exercise 2.0.2
Complete the code chunk below to inspect the underlying structure of the “media” dataframe
str(media)
One can also look at a brief summary of the central tendency
(e.g. mean and median) and spread/variability (min and max) of variables
in a dataframe using the summary() command
Exercise 2.0.3
Complete the code chunk below that provides a summary all the variables in the “media” dataframe
summary(media)
Exercise 2.0.4
Plot a histogram of the presumed media influence (variable name is pmi).
Note: the $ symbol allows you to pull out a column of data from a dataframe, which is an R object. In other words, you will need to specify both the dataframe name, followed by the name of specific column of data you would like to plot
Hint 1: media is the name of the R object and pmi is the name of the column
Hint 2: See exercise 1.2.1 for an example of how to plot a column of data as a histogram
hist(...$...)
hist(media$pmi)
What does this histogram tell us? Considering that higher pmi scores represent greater presumed media influence, the distribution of the data depicted in the histogram would suggest that the presumed influence of the media is considerably high amongst participants, given the higher frequency of high pmi scores relative to low pmi scores
2.1. Reading in dataframes
In order to inspect data, we first need to import or “read in” the data into R, in the form of dataframes.
Sometimes the data already exists within R, as part of an R package.
If this is the case, we use the data() function to read the
data into R.
Exercise 2.1.1
Check that the example below presents only first 6 rows of data
head(affect)
Exercise 2.1.2
Complete the code to present the first 10 rows of data
Hint1: head() command contains an argument called “x”, which asks for the name of the dataframe, whilst the other argument “n”, asks for the the number of rows in the data frame to display. The default of of rows that is presented is set to 6 unless changed.
Hint2: Pay attention to how things like commas, fonts and colours can be used to separate arguments
head(x=...,n=...)
head(x= affect, n=10)
When the data does not already exist in R, there are R commands
available for us to use in order to read-in data files. The commands
will differ depending on the type of file format. For example,
read_excel() allows us to read in excel spreadsheets with
the .cv extension.
Child IQ scores on WISC. We have an external csv dataset called
wiscsem, which captures children’s scores on the Weschler Intelligence scale (WISC) for childrens.
Exercise 2.1.3
Run the code in the code chunk below that reads in a .csv file containing children’s scores on the WISC. The file is named “wiscsem.csv”
wiscsem <-read.csv("wiscsem.csv")
Also take note of the assignment operator <-, which
allows us to assign values/data to named R objects. In this case, we are
reading in excel spreadsheet data and assigning/storing it in the R
object called wiscsem (more specifically a dataframe in
this case).
Hint: the <- operator can
be interchanged with =. A shortcut for the ‘<-’ operator
on Windows is “alt -” and on MAC is “option -”
Exercise 2.1.4
Read in the wiscsem.csv file and store it into an R object called wiscsem, using the = operator.
Hint: See example 2.1.3 for how to read in the wiscem.csv file
wiscsem = read.csv("wiscsem.csv")
Exercise 2.1.5
Check the structure of the dataframe wiscsem, using the str function
str(...)
str(wiscsem)
3. Measures of central tendency (CT)
When we collect information/data from a large group of individuals, we may be interested in summarizing the data in order to make sense of it on a group level, rather than just at an individual level.
Child IQ scores on WISC. For example, the Weschler Intelligence scale (WISC) was used to measure vocabulary (along with other subscales) in 175 children. Vocabulary along with other intelligence scores are stored in the dataframe named
wiscsem. For descriptions of all variables contained inwiscsem, see below
Note: For all intelligence subscales, higher scores reflect higher intelligence
| Column names | Description |
|---|---|
| client | Child identification number |
| agemate | Age in years |
| info | Information |
| comp | Comprehension |
| arith | Arithmetic knowledge |
| simil | Similarities |
| vocab | Vocabulary |
| digit | Digit span |
| pictcomp | Picture comprehension |
| parang | Picture arrangement |
| block | Block design |
| object | Object assembly |
| coding | Coding |
Exercise 3.0.1
Run the code chunk below and look at the pattern of vocabulary scores of all children participating in this study
Note: Note that we can add labels to the x-axis of
the histogram, using xlab argument, and the heading for the
graph using the main argument. Notice that the descriptive
title is placed within quotations (““)
hist(wiscsem$vocab, xlab = "Vocabulary scores", main =" Distribution of vocabulary scores on the WISCSEM measure for children")
From the histogram, we can see that there was a low frequency of children who scored very high or low on the vocabulary measure, but the majority of children appear to score approximately half-way in between.
In this example, because the majority of children’s vocabulary scores
lie in the center, measures of central tendency (CT) can be used to
quantify the extent to which children’s score lie in the center. There
are several measures of CT,the most popular being the
mean(), median() and mode()
3.1. Mean
The mean or average measures CT by dividing the sum of total data points/scores by the total number of data points/scores. See page 48-52 of Numbers, Hypotheses & Conclusions (3rd Ed)
Let’s look at some examples.
Suppose we measure children on a spatial awareness task, where scores range from 0-10 and higher scores represent better spatial awareness capabilities.
We have 3 children’s spatial awareness scores:
5, 8, 2
Exercise 3.1.1
Calculate the mean spatial awareness score
\[\frac{1}{n} \sum_{}^{} x\]
(5+8+2)/3
Alternatively, we can use the R command mean() to
calculate the average
Exercise 3.1.2
Check to see the mean vocabulary score for children on the WISC, using the mean() command
Hint: First insert the name of the dataframe, followed by the name of the variable/column of interest, separated by a dollar sign ($)
mean(...$...)
mean(wiscsem$vocab)
3.2. Median
The median is another measure of CT, which identifies the middle value within a data sample, thus splitting the sample into a lower and upper half. See page 52-55 of Numbers, Hypotheses & Conclusions (3rd Ed)
Exercise 3.2.1
Check to see the median vocabulary score for children on the WISC, using the median() command
median(...$...)
median(wiscsem$vocab)
3.3. Mode
The mode is a measure of CT that identifies the value that appears most often in a data sample.
Exercise 3.3.1
Check to see the mode vocabulary score for children on the WISC, using the mode() function
mode(...$...)
mode(wiscsem$vocab)
Let’s now visualize all three measures of CT in a histogram

You should notice that values for each measure of CT are not identical in this case, and this is due to two reasons
- Each measure of CT is computed differently
- The current sample size is not large enough. With a sufficiently large sample size different CT estimates will converge to the same value
3.4. Central tendency (CT) measures and variable types
Use of certain measures of CT may be more appropriate depending on the type of variable in question.
Variable types include:
Nominal
Variable that is represented by categories that have no inherent numeric value, and is best measured using the mode For example: gender, eye colour, media genre preference, country of origin. All these variables have no numerical answer, but knowing which is the most common (i.e. mode) can tell you something about your data.
Ordinal
Variable that is represented by categories that can be ordered in some logical way, and is best measured using the median For example: levels of satisfaction, levels of income, severity levels of illness. Again, these variables have no “correct” numerical answer, but they are scaled and therefore knowing the middle (i.e. median) will give you more information about your data.
Scale/continuous
Variable that is quantified by numeric values, and is best measured using the mean For example weight, test scores, temperature. These variables do have a numerical answer that could tell you things like high/low observations in your data, and what the average (i.e. mean) across your data is.
4. Measures of variability
Whilst we may be interested in summarizing where the majority of data points lie, with help from measures of CT, we may also be interested in knowing to what extent data points deviate from each other
Take a look back at the affect dataframe (see section 1.
Introduction to R, Rstudio & learnr), the object consists of the
following variables:
| Column names | Description |
|---|---|
| traitanx | trait anxiety scores |
| BDI | depression scores based on the Beck Depression Inventory (BDI) |
| posaffect | positive affect/emotion scores, higher scores represent higher positive affect |
| negaffect | negative affect/emotion scores, higher scores higher negative affect |
Exercise 4.0.1
Run the code chunk below and look at the differences in the distribution of scores for positive and negative affect.
Hint: par() command allows us to present two graphs at one time. The “mfrow” argument allows you to specify the number of rows and columns of graphs, where number of rows are specified before the number of columns. In this example, we specified 1 row and 2 columns
par(mfrow= c(1,2))
hist(affect$posaffect, xlab = "Positive affect", main ="Distribution of positive affect scores")
hist(affect$negaffect, xlab = "Negative affect", main ="Distribution of negative affect scores")
You should be able to notice that the empirical (data) distributions for positive and negative affect look somewhat different from each other. Notably for negative affect, there appears to be a somewhat smaller variation in scores from zero (which appears to be the mean), than for positive affect, where scores deviate more considerably from zero.
4.1. Variance & standard deviation
One way of measuring variability in data is through variance. Variance is measured as the average of squared differences between data points and the mean. Standard Deviation is another measure of variability that is calculated as the square root of variance. See page 55-61 of Numbers, Hypotheses & Conclusions (3rd Ed)
Remember that we have 3 children’s spatial awareness scores (presented below):
5, 8, 2
And we previously obtained a mean of 5
Exercise 4.1.1
Calculate the variance associated with spatial awareness scores
\[ \frac{\sum_{}^{} (x-\overline{x})^2}{n -1}\]
Note: ^ operator allows us to raise by some power. In this case we are raising by the power of 2
(((5-5)^2)/3)+(((8-5)^2)/3)+(((2-5)^2)/3)
Exercise 4.1.2
Now square root the variance you just calculated in order to obtain the standard deviation
Note: sqrt() command allows us to square root a value
var.spatial=(((5-5)^2)/3)+(((8-5)^2)/3)+(((2-5)^2)/3)
sqrt(var.spatial)
Like with measures of CT, we also have R commands at our disposal to
measure variability, including the var() and
sd() R commands
Exercise 4.1.3
Complete the code chunk below and check to see that you obtain the same variance of spatial awareness scores as calculated above, but now using the var() command
Note 1: c() command allows us to combine values, which we can then store into a R object
Note 2: A new object named spatial.aware has been created in the code chunk below, which stores 3 data points. One can just insert the name of the object as an input into the var() command
spatial.aware= c(5,8,2)
var(...)
spatial.aware= c(5,8,2)
var(spatial.aware)
Note: Notice that the “spatial.aware” object is not a dataframe, because it does not consist of several columns of data, but is rather made of of just one column of data. The implication here is that the dollar sign ($) is redundant (i.e. both the column of data and the name of the dataframe is named spatial.aware)
Exercise 4.1.4
Now try and complete and calculate the standard deviation of spatial awareness scores, using the sd() command.
Remember that the spatial awareness scores are:
5,8, and 2
Hint: Look at exercise 4.1.3 exercise for guidance
... = c(...,...,...)
sd(...)
spatial.aware= c(5,8,2)
sd(spatial.aware)
4.2. Range & Interquartile range (IQR)
Other measures of variability include the range and interquartile range. The range is measured as the variation between the maximum and minimum data points and is just calculated as the difference between the two. The interquartile range is measured as the variation between the 1st and 3rd quartiles (i.e. variation associated with the middle 50% of data).See page 55-61 of Numbers, Hypotheses & Conclusions (3rd Ed)
Exercise 4.2.1
Complete the following code below and calculate the range of vocabulary scores for children on the WISC
Hint 1: First insert the name of the dataframe, followed by the name of the variable/column of interest, separated by a dollar sign ($)
Hint 2: The name of the column of data containing the vocabulary scores is “vocab”
range(...$...)
range(wiscsem$vocab)
Exercise 4.2.2
Now calculate the interquartile range of vocabulary scores for children on the WISC
IQR(...$...)
IQR(wiscem$vocab)
5.Advanced section
5.1. Variable types in R
When reading data into R (either as single variables or entire data frames with multiple variables), we will need to specify variable types.
A survey was conducted with statistics students from the University of Adelaide, where several variables were measured, including: gender, age and smoking frequency
Note: There are more variables contained in the dataframe than those already mentioned
Exercise 5.1.1
Inspect the “survey” data frame to see how variables have been measured after being imported into R (i.e. types of data), using the str() command
Hint: str() allows us to identify how each variable has been classified (i.e. its variable type). See exercise 2.0.2
str(...)
str(survey)
All variables (i.e. each column of data) are classified as
numeric or scale variables, which is
indicated by the shorthand num
However, some variables in the survey data frame should
not be classified as numeric, including
Gender
We have already discussed that variables can come in the form of nominal, ordinal and scale/continuous variables, but how do we represent these in R
| Variable types | Associated variable types in R |
|---|---|
| Scale/continuous | numeric() or integer() |
| Nominal | factor() or character() |
| Ordinal | factor() or character() |
If we sought to alter the Gender variable from a scale
to a nominal variable, we could utilize the as.factor()
command.
Exercise 5.1.2
Run code chunk below to convert the Gender variable type from numeric to factor, and check that the variable type has successfully been changed to factor
Note: Within the survey data frame,
self-reported male and females are assigned the placeholder values
1 and 2 respectively
Hint: Use str() command to check variable type.
survey$Gender <- ifelse(survey$Gender == "1", "Male", "Female")
survey$Gender<-as.factor(survey$Gender)
...(...$...)
survey$Gender<-as.factor(survey$Gender)
str(survey$Gender)
Instead of representing self-reported male and female genders using the placeholder values, 1 and 2 respectively, each participant could be named either male or female
Exercise 5.1.3
*Check to see that the placeholder values for Gender (1 and 2) have been replaced by character strings (male and female)
survey$Gender <- ifelse(survey$Gender == "1", "Male", "Female")
survey$Gender<-as.factor(survey$Gender)
...(...$...)
str(survey$Gender)
Exercise 5.1.4
Complete the code chunk in order to look at the structure (i.e. variable types) of the wiscsem data frame, and alter the vocabulary scores to an integer.
Hint 1: Check the underlying structure first
Hint 2: If not already of type integer convert the variable type to integer
...(...$...)
...$... = as...(...$...)
str(wiscsem$vocab)
wiscsem$vocab = as.integer(wiscsem$vocab)
Free-form exercise
Move on to and complete Statistics Tutorial Assignment 1 on Amathuba (Activities | Assignments)
Other resources & references
WISC-R subscale data from Tabachnick, B. G., & Fidell, L. S. (1996). Using Multivariate Statistics (3rd ed.). New York Harper Collins.
Developed by: Marilyn Lake & Colin Tredoux