Annotated ggplot2 charts

Author

Pablo Leon

Published

February 1, 2024

About

This is a small tutorial on how to annotate custom ggplot charts. GGPLOT2 includes shadowed boxes, arrows and text to highlight specific data points in charts, helping to drive the story telling, communicating meaningful information using plots and visualization elements in R.

1. Download A&E data

The plot we will build and annotate will use NHS England A&E England Attendances data. We choose to download England Time Series data available from NHS England Statistics section main website.

1.1 NHS England A&E Attendances and Emergency Admissions data

We download from NHS England the existing publicly available A&E England Attendances and Emergency Admissions from thier main website: https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/

We use GET() function from {httr} library to download A&E data from NHS England URL

The Weekly and Monthly A&E Attendances and Emergency Admissions collection collects the total number of attendances in the specified period for all A&E types, including Urgent Treatment Centres, Minor Injury Units and Walk-in Centres, and of these, the number discharged, admitted or transferred within four hours of arrival.

1.2 Download tidy A&E data

We can make use of the {janitor} package to download A&E data and to obtain column names that ease further daat manipulation

Code
library(httr)
# CLEAN DATA SET
# Cleaned data set with all variable as Integers
url1<-'https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2023/10/Monthly-AE-Time-Series-September-2023.xls'
GET(url1, write_disk(tf <- tempfile(fileext = ".xls")))
Response [https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2023/10/Monthly-AE-Time-Series-September-2023.xls]
  Date: 2024-02-01 12:45
  Status: 200
  Content-Type: application/vnd.ms-excel
  Size: 404 kB
<ON DISK>  C:\Users\PABLOL~1\AppData\Local\Temp\RtmpygZwwP\file5f6829c985b.xls
Code
MONTHLY_AE <- read_excel(tf, 2L,
                         col_names = TRUE,
                         col_types = c("date","numeric","numeric","numeric",
                                       "numeric","numeric","numeric",
                                       "numeric","numeric","numeric",
                                       "numeric","numeric","numeric"),skip =13) %>% 
                        clean_names() %>% 
                        select(period,
                               type_1 = type_1_departments_major_a_e_2,
                               type_2 = type_2_departments_single_specialty_3,
                               type_3 = type_3_departments_other_a_e_minor_injury_unit_4) %>%
            # Apply right format to get AE attendances as Numeric values
                        mutate(
                          type_1_num = as.integer(type_1),
                          type_2_num = as.integer(type_2),
                          type_3_num = as.integer(type_3)) %>% 
            # Retain just formatted colmns 
                        select(period,type_1_num,type_2_num,type_3_num)

str(MONTHLY_AE)
tibble [155 x 4] (S3: tbl_df/tbl/data.frame)
 $ period    : POSIXct[1:155], format: "2010-11-01" "2010-12-01" ...
 $ type_1_num: int [1:155] 1065456 1070728 1061897 1007384 1167089 1140917 1168080 1117614 1167234 1091366 ...
 $ type_2_num: int [1:155] 53584 45395 51420 51153 57694 53787 56642 54549 55995 51774 ...
 $ type_3_num: int [1:155] 485550 531699 541588 493882 579483 592289 594166 561550 596844 569890 ...
Code
MONTHLY_AE
# A tibble: 155 x 4
   period              type_1_num type_2_num type_3_num
   <dttm>                   <int>      <int>      <int>
 1 2010-11-01 00:00:00    1065456      53584     485550
 2 2010-12-01 00:00:00    1070728      45395     531699
 3 2011-01-01 00:00:00    1061897      51420     541588
 4 2011-02-01 00:00:00    1007384      51153     493882
 5 2011-03-01 00:00:00    1167089      57694     579483
 6 2011-04-01 00:00:00    1140917      53787     592289
 7 2011-05-01 00:00:00    1168080      56642     594166
 8 2011-06-01 00:00:00    1117614      54549     561550
 9 2011-07-01 00:00:00    1167234      55995     596844
10 2011-08-01 00:00:00    1091366      51774     569890
# i 145 more rows

Check start and end period of this A&E Time series data:

[1] "2010-11-01 UTC"
[1] "2023-09-01 UTC"

This data set from NHS England, includes includes England Time Series data for Type 1, Type 2 and Type 3 Attendances covering the period that starts on 2010-11-01 and ends on 2023-09-01 as we can see in the table below:

2. Present tables using GT package

The gt package is all about making it simple to produce nice-looking display tables. We can use it to improve how we present the previous table:

First we will filter the original data set to include just 2023 data

Code
MONTHLY_AE_sub <- MONTHLY_AE %>% 
filter(period >= '2023-01-01')
MONTHLY_AE_sub
# A tibble: 9 x 4
  period              type_1_num type_2_num type_3_num
  <dttm>                   <int>      <int>      <int>
1 2023-01-01 00:00:00     632299      34327     607412
2 2023-02-01 00:00:00     598368      33612     588390
3 2023-03-01 00:00:00     681584      38023     663719
4 2023-04-01 00:00:00     677126      34498     638453
5 2023-05-01 00:00:00     740252      37249     698619
6 2023-06-01 00:00:00     835710      44292     748918
7 2023-07-01 00:00:00     840480      43664     739830
8 2023-08-01 00:00:00     782158      42366     712635
9 2023-09-01 00:00:00     789191      40887     721379
Code
library(gt)
# Turn previous table into a gt table
MONTHLY_AE_tbl <-  MONTHLY_AE %>% 
filter(period >= '2023-01-01')
ae_gt_tbl <- gt(MONTHLY_AE_tbl) %>% 
    tab_header(
    title = md("**A&E Attendances in England 2023**"),
    subtitle = md("By type (*Type I*,*Type II*,*Type III)*")
  ) %>% 
    tab_source_note(
    source_note = "Source: NHS England A&E Attendances and Emergency Admissions data"
  ) |>
  tab_source_note(
    source_note = md("England Time Series monthly data")
  ) %>% 
      fmt_number(
    columns = c(type_1_num,type_2_num,type_3_num),
    sep_mark = ",",
    decimals = 0
  )
ae_gt_tbl 
A&E Attendances in England 2023
By type (Type I,Type II,Type III)
period type_1_num type_2_num type_3_num
2023-01-01 632,299 34,327 607,412
2023-02-01 598,368 33,612 588,390
2023-03-01 681,584 38,023 663,719
2023-04-01 677,126 34,498 638,453
2023-05-01 740,252 37,249 698,619
2023-06-01 835,710 44,292 748,918
2023-07-01 840,480 43,664 739,830
2023-08-01 782,158 42,366 712,635
2023-09-01 789,191 40,887 721,379
Source: NHS England A&E Attendances and Emergency Admissions data
England Time Series monthly data

3. Subset Type I attendances

We will focus on Type I attendances for this plot

Code
library(gt)
# Turn previous table into a gt table
TypeI  <-  MONTHLY_AE %>% 
select(period,type_1_num)
TypeI
# A tibble: 155 x 2
   period              type_1_num
   <dttm>                   <int>
 1 2010-11-01 00:00:00    1065456
 2 2010-12-01 00:00:00    1070728
 3 2011-01-01 00:00:00    1061897
 4 2011-02-01 00:00:00    1007384
 5 2011-03-01 00:00:00    1167089
 6 2011-04-01 00:00:00    1140917
 7 2011-05-01 00:00:00    1168080
 8 2011-06-01 00:00:00    1117614
 9 2011-07-01 00:00:00    1167234
10 2011-08-01 00:00:00    1091366
# i 145 more rows

4. Building ggplot2 charts

We start by building a bare minimum ggplot2 line chart. This basic line chart includes ggplot2() function with aes() argument including x and y variables, and the geom.

There are four main parts of a basic ggplot2 visualization: the ggplot() function, the data parameter, the aes() function, and the geom. Get an accurate description from each of these elements from: https://www.sharpsightlabs.com/blog/ggplot2-tutorial/

The ggplot function The ggplot() function is the core function of ggplot2. It initiates plotting.

Code
Basic_plot <- ggplot(data = TypeI, aes(x = period, y = type_1_num)) +
  geom_line()
Basic_plot

4.1 Add title and subtitle

The first element we want to include in our chart is the tile and subtitle

Code
Basic_plot <- ggplot(data = TypeI, aes(x = period, y = type_1_num)) +
  geom_line() +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period")
Basic_plot

4.2 Change axis labels

We can use labs to change also x and y axis labels

Code
Basic_plot <- ggplot(data = TypeI, aes(x = period, y = type_1_num)) +
  geom_line() +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances")
Basic_plot

4.3 We might need to change Date format

After importing the data into R, date variables are defined as POSIX date, we can change them to standard Date R format.

Code
str(TypeI)
tibble [155 x 2] (S3: tbl_df/tbl/data.frame)
 $ period    : POSIXct[1:155], format: "2010-11-01" "2010-12-01" ...
 $ type_1_num: int [1:155] 1065456 1070728 1061897 1007384 1167089 1140917 1168080 1117614 1167234 1091366 ...
Code
TypeI_date_format <- TypeI %>% mutate(Datef = as.Date(period))
TypeI_date_format
# A tibble: 155 x 3
   period              type_1_num Datef     
   <dttm>                   <int> <date>    
 1 2010-11-01 00:00:00    1065456 2010-11-01
 2 2010-12-01 00:00:00    1070728 2010-12-01
 3 2011-01-01 00:00:00    1061897 2011-01-01
 4 2011-02-01 00:00:00    1007384 2011-02-01
 5 2011-03-01 00:00:00    1167089 2011-03-01
 6 2011-04-01 00:00:00    1140917 2011-04-01
 7 2011-05-01 00:00:00    1168080 2011-05-01
 8 2011-06-01 00:00:00    1117614 2011-06-01
 9 2011-07-01 00:00:00    1167234 2011-07-01
10 2011-08-01 00:00:00    1091366 2011-08-01
# i 145 more rows

4.4 Define date breaks for x axis

We have several years in our data set

Code
library(lubridate)
Years_in_data <- TypeI_date_format %>%  
                 select (Datef) %>% 
                 mutate(Year = year(Datef)) %>% 
                 select(Year) %>% 
                 distinct()
Years_in_data
# A tibble: 14 x 1
    Year
   <dbl>
 1  2010
 2  2011
 3  2012
 4  2013
 5  2014
 6  2015
 7  2016
 8  2017
 9  2018
10  2019
11  2020
12  2021
13  2022
14  2023
Code
Total_years <- count(distinct(Years_in_data))
Total_years
# A tibble: 1 x 1
      n
  <int>
1    14

There are 14 years of data in the original data set.

In the chart below, we can see the X axis only displays two years out of the 14 available in the data set. The way to fix it is by using scale_x_data() function defining labels as years ‘%Y’ with date_breaks defined as ‘1 year’ so X axis will display every single year in the data set.

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line() +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") 
Basic_plot_format

4.5 Apply theme to the plot

We can apply defined theme() setup to the chart. There are several to choose from:(theme_bw(), theme_dark(), them_grey(), theme_light(), theme_minimal(), theme_classic() ) . I will choose theme_minimal() as this will provide a lower inks ratio outcome.

We apply then theme_minimal() to the existing ggplot chart

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line() +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal()
Basic_plot_format

4.6 Finally we apply colour to the line chart

We apply inside geom_line() function two parameters colour to add colour to the existing line and also we include size parameter equal to 1 to increase the size of the line used in the chart:

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal()
Basic_plot_format

Then we can save this plot as an initial basic chart

Code
ggsave("plots/01_AE_Attendances_basic_plot.png", width = 6, height = 4)

Another variation would be to include a line chart geom_line() and a dot chart geom_point() combined

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  geom_point(shape=21, color="black", fill="#69b3a2", size=1.5) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal()
Basic_plot_format

Code
ggsave("plots/02_AE_Attendances_line_point_plot.png", width = 6, height = 4)

5. Add annotations to plot

The next step will include several annotations to the chart to help the storytelling part when using this chart.

5.1 Using annotate() function

From {ggplot2} package we use the annotate() function to create an annotation layer in the chart. This function is highly versatile as it allows us to include both text and geoms such as lines and arrows.

We then only require one extra function geom_segment() also from ggplot2 to draw straight vertical or horizontal reference lines in the chart area.

5.1.1 Adding rectangle annotation

We can start by adding a rectangle to the previous chart

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  geom_point(shape=21, color="black", fill="#69b3a2", size=1.5) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal() +
 
    annotate('rect',
    xmin = as.Date("2018-11-01"), xmax = as.Date("2019-11-01"),
    ymin = 700000, ymax = 1100000, 
    alpha = 0.5,
    fill = 'grey40',
    col = 'black'
  )
Basic_plot_format

5.1.2 Adding text annotation

We can start by adding a rectangle to the previous chart Adding a text annotation:

COVID pandemic

  • COVID-19 was confirmed to be present in the UK by the end of January 2020

  • A legally-enforced Stay at Home Order, or lockdown, was introduced on 23 March 2021, banning all non-essential travel and contact with other people, and shut schools, businesses, venues and gathering places.

  • A third wave of daily infections began in July 2021 due to the arrival and rapid spread of the highly transmissible SARS-CoV-2 Delta variant.[62]

So we can establish the COVID-19 pandemic period roughly from 01 January 2020 to 01 August 2021

Code
Basic_plot_format <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  geom_point(shape=21, color="black", fill="#69b3a2", size=1.5) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal() +
 
    annotate('rect',xmin = as.Date("2020-01-01"), xmax = as.Date("2021-08-01"),ymin = 500000, ymax = 1100000, 
    alpha = 0.1 , fill = 'blue',col = 'black') +
    annotate('text',x = as.Date("2020-09-01"),y = 1120000,label = "COVID-19")
  
Basic_plot_format

5.1.3 Adding arrows to highlight single values

We can also add a curved line with an arrow at the end pointing to specific values. We need to include the arrow function to the curve annotation created earlier: arrow = arrow(length = unit(0.4,'cm')))

Note that the annotate() function is a good alternative that can reduces the code length for simple cases.

6. Annotated ggplot chart including Arrows and rectangles

This would be the annotated chart including shadowed rectangle are and an arrow with text to highlight specific data points in the plot area.

Code
Curved_line_annotation <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  geom_point(shape=21, color="black", fill="#69b3a2", size=1.5) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal() +
 
    annotate('rect',xmin = as.Date("2020-01-01"), xmax = as.Date("2021-08-01"),ymin = 500000, ymax = 1100000, 
    alpha = 0.1 , fill = 'blue',col = 'black') +
    annotate('text',x = as.Date("2020-09-01"),y = 1120000,label = "COVID-19") +
    annotate('curve',
    x = as.Date("2018-10-01"),
    xend = as.Date("2019-12-01"),
    y = 620000,
    yend = 520000,
    linewidth = 1, 
    curvature = 0.5,
    # Adding an Arrow to my curvature
    arrow = arrow(length = unit(0.4,'cm'))) +
    annotate('text',x = as.Date("2018-06-27"),y = 700000,label = "Lowest \n Attendances level")
  
Curved_line_annotation

Code
ggsave("plots/07_AE_Attendances_annotations_plot.png", width = 6, height = 4)

7. Including vertical reference lines

In this new section we will include three vertical lines for each of the lock downs, the functions below can also be used to draw horizontal reference lines.

Set of geoms used in charts to annotate different sections:

Draw lines across the chart:

Code
Curved_line_annotation <- ggplot(data = TypeI_date_format, aes(x = Datef, y = type_1_num)) +
  geom_line(color="#69b3a2", size =1) +
  geom_point(shape=21, color="black", fill="#69b3a2", size=1.5) +
  labs(title = "Type I A&E Attendances in England",
       subtitle = "2010-2023 Period",
       x = "Date",
       y = "Total Attendances") + 
 scale_x_date(date_labels="%Y",date_breaks  ="1 year") +
 theme_minimal() +
 annotate('rect',xmin = as.Date("2020-01-01"), xmax = as.Date("2021-08-01"),ymin = 500000, ymax = 1100000, 
    alpha = 0.1 , fill = 'blue',col = 'black') +
    annotate('text',x = as.Date("2020-09-01"),y = 1130000,label = "COVID-19", color="red",size = 5) +
    annotate('curve',
    x = as.Date("2018-10-01"),
    xend = as.Date("2019-12-01"),
    y = 620000,
    yend = 520000,
    linewidth = 1, 
    curvature = 0.5,
    # Adding an Arrow to my curvature
    arrow = arrow(length = unit(0.2,'cm'))) +
    annotate('text',x = as.Date("2018-06-27"),y = 700000,label = "Lowest \n Attendances level") +
    #ADD LOCKDOWN DATES REFERENCE LINES
    # Replacing vline by geom_segment (it allows controlling for reference lines lenght)
    # First lockdown  26 March 2020
    geom_segment(aes(x = as.Date("2020-03-26"), y = 500000, yend = 1100000,
                     xend = as.Date("2020-03-26"), colour = "segment"),
                     linewidth=1.5,color="orange") +
      # Second Lockdown 5 November 2020
    geom_segment(aes(x = as.Date("2020-11-05"), y = 500000, yend = 1100000,
                     xend = as.Date("2020-11-05"), colour = "segment"),
                     linewidth=1.5,color="orange") +
    # Third lockdown 6 January 2021
    geom_segment(aes(x = as.Date("2021-01-06"), y = 500000, yend = 1100000,
                     xend = as.Date("2021-01-06"), colour = "segment"),
                     linewidth=1.5,color="orange") +
    # Text annotations for each lockdown

    annotate('text',x = as.Date("2019-07-20"),y = 1200000,label = "First lockdown") +
    annotate('text',x = as.Date("2019-05-26"),y = 1164000,label = "01/08/2020") +
    annotate('text',x = as.Date("2022-10-20"),y = 1100000,label = "Second lockdown") +
    annotate('text',x = as.Date("2023-04-20"),y = 1070000,label = "05/11/2020") +
    annotate('text',x = as.Date("2022-10-20"),y = 980000,label = "Third lockdown") +
    annotate('text',x = as.Date("2023-04-20"),y = 950000,label = "06/01/2021") +
  # Adding arrows to lockdowns labels 
  # First lockdown arrow 
   annotate('curve',x = as.Date("2019-06-01"),xend = as.Date("2020-02-20"),
            y = 1140000,yend = 1090000,
    linewidth = 1,curvature = 0.6,arrow = arrow(length = unit(0.2,'cm'))) +
  # Second lockdown arrow
     annotate('curve',x = as.Date("2022-01-01"),xend = as.Date("2020-11-01"),y = 1080000,yend = 970000,
    linewidth = 1,curvature = 0.6,arrow = arrow(length = unit(0.2,'cm'))) +
  # Third lockdown arrow
   annotate('curve',x = as.Date("2022-01-01"),xend = as.Date("2021-02-01"),y = 990000,yend = 910000,
    linewidth = 1,curvature = 0.6,arrow = arrow(length = unit(0.2,'cm'))) 
Curved_line_annotation

7.1 List of lockdown relevant dates:

First national lockdown (March to June 2020)

England was in national lockdown between late March and June 2020. Lockdown measures legally came into force for the first time on 26 March 2020.

First nationwide lockdown was legally effective from 1:00pm on 26th March 2020.

Initially, all “non-essential” high street businesses were closed and people were ordered to stay at home, permitted to leave for essential purposes only, such as buying food or for medical reasons. Starting in May 2020, the laws were slowly relaxed. People were permitted to leave home for outdoor recreation (beyond exercise) from 13 May. On 1 June, the restriction on leaving home was replaced with a requirement to be home overnight, and people were permitted to meet outside in groups of up to six people.

Second national lockdown (November 2020)

On 5 November 2020, national restrictions were reintroduced in England. During the second national lockdown, non-essential high street businesses were closed, and people were prohibited from meeting those not in their “support bubble” inside. People could leave home to meet one person from outside their support bubble outdoors.

Third national lockdown (January to March 2021)

Following concerns that the four-tier system was not containing the spread of the Alpha variant, national restrictions were reintroduced for a third time on 6 January 2021.

The rules during the third lockdown were more like those in the first lockdown. People were once again told to stay at home. However, people could still form support bubbles (if eligible) and some gatherings were exempted from the gatherings ban (for example, religious services and some small weddings were permitted).

Reference for National lockdown dates: https://commonslibrary.parliament.uk/research-briefings/cbp-9068/