Data visualisation is a crucial step in any data analysis workflow. Whether you’re exploring new data for the first time or verifying the results of transformations, plotting can help you quickly spot errors, outliers, or unexpected patterns. Effective figures also help interpret and communicate your findings—bridging the gap between raw numbers and meaningful insights.
In practice, you’ll often find yourself visualising data multiple times:
Below are some common types of plots you’ll encounter and when they’re typically used:
Watch the recording from this session from 3 March 2025 and checkout the related slides within our Google Drive.
We will start by showing you how to build a basic plot with theggplot()
function. The ggplot()
function
creates a blank plot to which you add different functions to build your
data visualisation using the +
operator.
There are two main arguments to the ggplot() function
,
these are:
data
: the name of the data frame we are working
with
mapping
: the columns of the data frame we are plotting.
This is created using the aes()
command.
aes()
stands for “aesthetic” and is used to specify what
we want to show in the plot. We will start by looking at plots of simple
time trends using the malaria incidence data for all ages. First, we
filter the pf_incidence_national
data frame forage_group == "total"
.
total_incidence <-
filter(pf_incidence_national, age_group == 'total')
We then start with the ggplot command. We want to use the total incidence dataset and wish to plot the year on the x-axis and the incidence on the y-axis so the command is as follows:
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))
This wont yet produce a figure as we need to tell ggplot what sort of
figure we would like this to be. We will start by making a scatter plot
using geom_point()
. We add layers to the ggplot object
using the +
operator.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point()
We can then control how the points look by adding commands to thegeom_point()
function, such as the size, colour and
transparency (alpha) of the points.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=5, colour = "blue", alpha = 0.5)
We can also change the type of the point by using theshape =
command.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=4, colour = 'blue', alpha = 0.5, shape = 18)
# Available shapes (don't run)
shapes
The general structure of the ggplot()
command remains
similar for different plots.This means that it is easy to change the
type of an established plot, or create a different plot using a similar
code structure.
We can change the type of plot by replacing geom_point()
with other options such as a line graph: geom_line()
, or
smoothed conditional means: geom_smooth()
.
# Line plot
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_line(size=2, colour = "red", alpha = 0.5)
# Smoothed line of fit
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_smooth(colour = "yellow")
The size, colour and alpha commands are similarly used to control the
appearance of the line but instead of using shape =
to
control the appearance of the points we can change the style of the line
in a line plot by specifying linetype =
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_line(size=2, colour = "red", alpha = 0.5, linetype = 'dashed')
# Available line types (not run)
lty
Additionally, you can layer the types of plots together, e.g. plotting the points and lines on the same graph.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_line(size=1, colour = "red", alpha = 0.5)+
geom_point(size=5, colour = 'blue', shape = 18)
Task 6
- Import the dataset
pf_incidence_national
we created in the demo from the data folder and filter to just contain data for under 5’s- Create a line graph plotting the incidence of malaria (y-axis) by date (x-axis)
- Change this to be a green dotted line and increase the line thickness
Solution
pf_incidence_national <- read_csv("outputs/pf_incidence_national.csv") %>% filter(age_group=="u5") ggplot(pf_incidence_national)+ geom_line(mapping = aes(x = date_tested, y = incidence), color = 'green', size = 2)
Further commands can then be added to the ggplot() object to control
the appearance of the plot. We can label the axis and add a title usinglabs()
, and can select different themes for the background
appearance, such as the black and white theme: theme_bw()
,
or the dark theme: theme_dark()
.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=2, colour = "blue")+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()
We can control the x and y axis limits by using xlim()
or ylim()
. As our x-axis is a date we can control this
using the command scale_x_date()
. Otherwise if this was a
continuous variable we could use scale_x_continous()
to
control the scale. This command has additional options to control where
you want the breaks in the axis to be and to alter the data labels
# Option 1 using xlim() and ylim()
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=5, colour = 'blue')+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
ylim(0,1000)
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=5, colour = 'blue')+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_x_date(limit=c(as.Date("2018-01-01"),as.Date("2018-06-01")),
date_breaks = "1 month", date_labels = "%b")
Furthermore, we can control the text of the axis and title within thetheme()
command. theme()
provides a large
amount of control over the appearance of the plot. Common examples of
what we can do within this command is rotating the x-axis text using the
option angle =
, changing the font size usingsize =
and the font type using face =
.
ggplot(data = total_incidence, mapping = aes(x = date_tested, y = incidence))+
geom_point(size=5, colour = 'blue')+
labs(x = "Month",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_x_date(date_breaks = "1 month", date_labels = "%b")+
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5),
axis.title.x = element_text(size = 14,face = 'bold'),
axis.title.y = element_text(size = 14,face = 'bold'),
title = element_text(size = 16,face = 'bold'))
We will now look visualising multiple data trends, for example if you
had incidence data for multiple countries there are various ways you
could show this data. Here we will use the datasetpf_incidence_national.csv
we imported earlier, with
incidence for different age groups.
Firstly, we can incorporate the trends for different age_groups on
the same plot by using different colours. We use the colour
argument inside the aes
functions and assign it to the
country variable. This results in ggplot plotting a line for each
country in the dataset in a different colour, with a legend added to the
plot.
ggplot(data = pf_incidence_national, mapping = aes(x = date_tested, y = incidence, colour = age_group))+
geom_line(size = 3)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018",
colour = 'Age group')+
theme_bw()
We can control the colours in the plots by using thescale_color_manual()
function, and passing it the colours
you wish to use. You can use either the names of colours or the HEX
codes. We can also move the position of the legend to the bottom using
the legend.position = "bottom"
command withintheme()
and change the title of the legend withinlabs()
.
ggplot(data = pf_incidence_national, mapping = aes(x = date_tested, y = incidence, colour = age_group))+
geom_line(size = 3)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018",
colour = 'Age group')+
theme_bw()+
theme(legend.position = 'bottom')+
scale_colour_manual(values = c('pink', 'yellow', 'purple'))
Another way to look at multiple data trends in one plot is to uses facets. This creates separate plots for each “facet”, as part of the overall plot.
ggplot(data = pf_incidence_national, mapping = aes(x = date_tested, y = incidence))+
geom_line(size=2, colour = "blue")+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
facet_wrap(~age_group)
There are various options in facet_wrap()
. If you have
largely varying scales between the facets you are plotting, you can plot
these on different scales using either free_y = TRUE
, orfree_x = TRUE
as options inside offacet_wrap()
. You can also specify the number of rows or
columns you want in the plot, e.g. if you had 6 facets you could plot it
as nrow = 2, ncol = 3
ornrow = 3, ncol = 2
.
ggplot(data = pf_incidence_national, mapping = aes(x = date_tested, y = incidence))+
geom_line(size=2, colour = "blue")+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
facet_wrap(~age_group, scales = 'free_y', nrow = 3)
Task 7
- Using the
pf_incidence_national
dataset create scatter plots the number of confirmed malaria cases(y-axis) by date tested (x-axis), with facets for age group.- Change the colour of the points based on the age group
- Change the axis labels and add a title to the plot
- Change the plot theme to black and white
Solution
pf_incidence_national <- read_csv("outputs/pf_incidence_national.csv") ggplot(pf_incidence_national)+ geom_line(mapping = aes(x = date_tested, y = incidence, color = >age_group))+ facet_wrap(~age_group)+ labs(x = "Date", y = "Incidence per 1000 PYO", title = "Incidence >Rate")+ theme_bw()
Similarly to using the colour
option to plot your data
by categorical variables, we can use this options to plot continuous
data on a colour scale. For example, if we wanted colour the points on
our plot of incidence by year based on the total population we can addcolour = population
to the aesthetic command.
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()
There are then a few options for how we can control the colour scale
of a continuous variable. Firstly we can use thescale_colour_gradient()
command to specify the low and high
colours of a gradient from which to build the colour ramp. Secondly
ggpplot allows us to use colour scales from ColorBrewer (https://colorbrewer2.org). This has a wide range of
sequential, diverging and qualitative colour schemes you can use for a
rnage of data and projects. For continuous data this is specified using
the command scale_colour_distiller()
and using the optionpalette =
to specify the colour palette to use.
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_colour_gradient(low = "yellow", high = "red")
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_colour_distiller(palette = 'Greens')
An additional option for- continuous colour scales are the viridis
scales. These are well designed colour scales which are excellent for
graphs and maps. They are colourful, appear uniform in both colour and
black-and-white and are easy to view for persons with common forms of
colour blindness. There are various options withing this package and
they can be used within ggplot through the commandscale_colour_viridis_c()
.
# Default viridis colour scale
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_colour_viridis_c()
# Inferno option
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_colour_viridis_c(option = 'inferno')
# Reversing the direction
ggplot(total_incidence, aes(x = date_tested, y = incidence, colour = pop))+
geom_point(size=5)+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018")+
theme_bw()+
scale_colour_viridis_c(option = 'inferno', direction = -1)
Finally, we can use ggplot to have plot different data on different
layers of the plot. We do this by moving the data =
andmapping =
arguments from the ggpplot()
to the
specific layers such as geom_point()
. Here we plot a
scatter plot with different colours for each country, but we add a
smoothed mean line for all of the data.
ggplot()+
geom_point(data = filter(pf_incidence_national, age_group != 'total'),
aes(x = date_tested, y = incidence, colour = age_group), size=4)+
geom_line(data = filter(pf_incidence_national, age_group == 'total'),
aes(x = date_tested, y = incidence), colour = 'grey', size = 2, linetype = 'dashed')+
labs(x = "Date",
y = "Incidence/100,000/month",
title = "Malaria incidence in 2018",
colour = 'Age group')+
theme_bw()
ggplot
can also be used to create plots summarising the
data and incorporating statistical transformations.
Box plots are a good way of summarising continuous data by discrete
variables. For example, in this dataset we have the number of confirmed
malaria cases in different districts for each month. Inggplot
we can use box plots to easiy summarise and
visualise these data.
As in the previous section, we provide ggplot() with the x and y
variables in the aes()
argument and R will calculate the
size of the boxes and whiskers. Similarly, you can control the colour of
the lines and the fill colour of the plot by using thecolour
and fill
arguments.
To look at these plots we will use the “pf_incidence” dataset we
created earlier, with the confirmed cases, incidence, and population by
date and district. We will subset this data to look at all ages
initially. We introduced pipes earlier, these can also be used with
ggplot. Dates in ggplot are treated as continuous variables to here we
must specify group = date_tested
to group the data by
date.
filter(pf_incidence, age_group == 'total') %>%
ggplot(aes(x = date_tested, y = conf, group=date_tested))+
geom_boxplot(colour = 'black', fill = 'dark red')+
labs(x = "Date",
y = "Number of cases",
title = "Number of malaria cases by month")+
theme_bw()
Histograms are created using the geom_histogram()
function. This splits the data into bins using stat_bins()
and counts the number of occurrences in each bin. You can control the
number of bins or the width of the bins using bins
andbinwidth
respectively. Options such as fill
can also be used to control the fill colour of the bars, along with the
additional options already discussed.
filter(pf_incidence, age_group == 'total') %>%
ggplot(aes(x = conf))+
geom_histogram(bins = 20, fill = 'purple', colour = 'black')+
theme_bw()+
labs(x = "Number tested",
y = "Count",
title = "Histogram of malaria tests conducted")
For bar plots the geom_bar()
function is used. The
default transformation is to count the number of rows for each category,stat = "bin"
. There are other transformations we can use,
so if we want to make the heights of the bars to represent values in the
data (provided by the assigning the y aesthetic), we usestat = identity
. Here we plot a bar chart of the total
number of malaria tests conducted each month in our routine data.
filter(pf_incidence, age_group == 'total') %>%
ggplot(aes(x = date_tested, y = conf, group=date_tested))+
geom_bar(stat = "identity", fill = "blue")+
labs(x = "Number tested",
y = "Count",
title = "Bar chart of malaria tests conducted per month") +
theme_bw()
If we want to display different subgroups of data, such as number of
tests by age group, on a bar plot we can either create a stacked bar
plot with different colours representing the subgroups by using thefill = age_group
command, or by creating a side-by-side bar
plot by combining fill = age_group
with the commandposition = "dodge"
.
# Stacked bar chart
filter(pf_incidence, age_group != 'total') %>%
ggplot(mapping = aes(x = date_tested, y = conf, fill = age_group))+
geom_bar(stat = "identity")+
theme_bw()+
labs(x = "Number tested",
y = "Count",
title = "Bar chart of confirmed malaria cases per month, by age group",
fill = NULL)
# Side by side bar chart
filter(pf_incidence, age_group != 'total') %>%
ggplot(mapping = aes(x = date_tested, y = conf, fill = age_group))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(x = "Number tested",
y = "Count",
title = "Bar chart of confirmed malaria cases per month, by age group",
fill = NULL)
This is just an introduction to what you can achieve for ggplot. For further guidance look at the help pages and the cheat-sheet available athttps://www.rstudio.com/resources/cheatsheets.
There are a couple of different ways you can save plots in R. Firstly
you can save them by opening a png()
, jpeg()
or pdf()
depending on the file type you want to save. These
commands contain the output file path and the desired height and width
of the figure (optional). It is then followed by the figure, and closing
down the file with dev.off()
.
png("outputs/malaria_incidence_plot.png")
filter(pf_incidence, age_group != 'total') %>%
ggplot(mapping = aes(x = date_tested, y = conf, fill = age_group))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(x = "Number tested",
y = "Count",
title = "Bar chart of confirmed malaria cases per month, by age group",
fill = NULL)
dev.off()
Another option for saving plots is using ggsave()
. This
takes the file name, including the extension, that you wish to save and
the plot. Leaving the plot name empty will default to the last plot
created. You can also include commands for the desired height and width
of the figure if required. So saving the above plot would entail:
filter(pf_incidence, age_group != 'total') %>%
ggplot(mapping = aes(x = date_tested, y = conf, fill = age_group))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(x = "Number tested",
y = "Count",
title = "Bar chart of confirmed malaria cases per month, by age group",
fill = NULL)
ggsave("outputs/malaria_incidence_plot.png")
Task 8
- Import the
pf_incidence
dataset we created in demo 2.- Create a box plot of the confirmed malaria cases for children under 5 by month
- Using the
pf_incidence
dataset create a stacked bar chart of confirmed > malaria cases in children under 5 and over 5’s by monthSolution
pf_incidence <- read_csv("outputs/pf_incidence.csv") %>% mutate(month = month(date_tested, label = TRUE, abbr = TRUE), year = year(date_tested)) pf_incidence %>% filter(age_group == "u5") %>% ggplot()+ geom_boxplot(mapping = aes(x = month, y = conf))+ labs(y = "Confirmed Cases")+ theme_bw() ggplot(pf_incidence)+ geom_boxplot(mapping = aes(x = month, y = conf, fill = age_group))+ labs(y = "Confirmed Cases under 5")+ theme_bw()
Tip: You can automate generating many plots (e.g., for each country or each
variable) by writing a function or loop. For instance, you might loop through a list of
countries, create four different plots for each, and then save them all in a single PDF.
This can be done by opening a PDF device (pdf("plots.pdf")
), producing plots
in a loop, and then closing the device with dev.off()
, or by using
ggsave()
inside your loop.
Here are some recommended resources to explore and deepen your understanding of ggplot2
and data visualizations:
ggplot2
concepts and examples.