R Basics - Data Transformation

When working with data we often find a need to transform it in various ways. This might be for reasons of presentation, to make data compatible with a function that requires the input to be in a specific format, or to make it easier to work with. R provides plenty of options for data transformation and the rich package system means that there is typically a ready made solution for most of the problems that you might encounter.

Wide and long format transformations

Consider the following dataset;

dat <- data.frame(
    first_name = c("Olivia", "Amelia", "George"),
    post_code = c("AB12CD", "EF34GH", "IJ56KL"),
    hair_colour = c("brown", "brown", "blond"),
    eye_colour = c("green", "blue", "blue")
)

print(dat)
#'   first_name post_code hair_colour eye_colour
#' 1     Olivia    AB12CD       brown      green
#' 2     Amelia    EF34GH       brown       blue
#' 3     George    IJ56KL       blond       blue

The data has one row per individual and records their name, post code, hair colour, and eye colour. Each observation is held in a column and the layout could be described as being a wide format.

The same data could also be stored in a long format where the post code, hair colour, and eye colour are all stored in a single column of values, with another column used to identify what the value signifies.

Wide format

first_name	post_code	hair_colour	eye_colour
Olivia	AB12CD	brown	green
Amelia	EF34GH	brown	blue
George	IJ56KL	blond	blue

Long format

first_name	measure	value
Olivia	post_code	AB12CD
Amelia	post_code	EF34GH
George	post_code	IJ56KL
Olivia	hair_colour	brown
Amelia	hair_colour	brown
George	hair_colour	blond
Olivia	eye_colour	green
Amelia	eye_colour	blue
George	eye_colour	blue

Wide to Long transformation

Whilst there are a number of functions available to assist in the wide to long transformation, pivot_longer from the tidyr package offers one of the simpler API’s. tidyr is a widely used package and the pivot_longer function is performant with small to mid sized data.

You can install tidyr from CRAN if you don’t already have it installed.

install.packages("tidyr")

To use pivot_longer to transform our data we need to specify at least the data and cols arguments. data allows us to specify the object that we want to transform, and cols specifies the columns to pivot into the longer format.

tidyr::pivot_longer(data = dat, cols = c(post_code, hair_colour, eye_colour))
#' # A tibble: 9 × 3
#'   first_name name        value 
#'   <chr>      <chr>       <chr> 
#' 1 Olivia     post_code   AB12CD
#' 2 Olivia     hair_colour brown 
#' 3 Olivia     eye_colour  green 
#' 4 Amelia     post_code   EF34GH
#' 5 Amelia     hair_colour brown 
#' 6 Amelia     eye_colour  blue  
#' 7 George     post_code   IJ56KL
#' 8 George     hair_colour blond 
#' 9 George     eye_colour  blue

We can also use negation in the cols argument; in our example we actually want to pivot all columns except first_name, so we can also write the function like this;

tidyr::pivot_longer(data = dat, cols = !first_name)
#' # A tibble: 9 × 3
#'   first_name name        value 
#'   <chr>      <chr>       <chr> 
#' 1 Olivia     post_code   AB12CD
#' 2 Olivia     hair_colour brown 
#' 3 Olivia     eye_colour  green 
#' 4 Amelia     post_code   EF34GH
#' 5 Amelia     hair_colour brown 
#' 6 Amelia     eye_colour  blue  
#' 7 George     post_code   IJ56KL
#' 8 George     hair_colour blond 
#' 9 George     eye_colour  blue

You will have noticed that the return of pivot_longer is a tibble. This generally won’t be a problem, but we can keep our data in a data.frame by using as.data.frame() to wrap our call to pivot_longer(), for example as.data.frame(tidyr::pivot_longer(data, cols)).

Typically, we might want to specify the names of the new name and value columns rather than using the default values. This can be achieved with the names_to and values_to arguments.

as.data.frame(
    tidyr::pivot_longer(
        data = dat,
        cols = !first_name,
        names_to = "measure", values_to = "value"
        )
    )
#'   first_name     measure  value
#' 1     Olivia   post_code AB12CD
#' 2     Olivia hair_colour  brown
#' 3     Olivia  eye_colour  green
#' 4     Amelia   post_code EF34GH
#' 5     Amelia hair_colour  brown
#' 6     Amelia  eye_colour   blue
#' 7     George   post_code IJ56KL
#' 8     George hair_colour  blond
#' 9     George  eye_colour   blue

The function has worked well so far, but what if we had another column containing numeric values that we also wanted to include in the transformation?

dat$age <- c(64, 25, 56)

as.data.frame(
    tidyr::pivot_longer(
        data = dat,
        cols = !first_name,
        names_to = "measure", values_to = "value"
        )
    )
#' Error in `pivot_longer_spec()`:
#' ! Can't combine `post_code` <character> and `age` <double>.
#' Run `rlang::last_error()` to see where the error occurred.

Our post_code, hair_colour, and eye_colour columns were all character vectors, whereas our newly added age column is numeric. tidyverse functions usually try to avoid implicit type coercions, so whilst you might have expected age to be coerced into a character, we have instead received an error.

We can use the values_transform argument to specify the as.character function, which will then be applied to all columns in the transformation.

dat$age <- c(64, 25, 56)

dat <- as.data.frame(
    tidyr::pivot_longer(
        data = dat,
        cols = !first_name,
        names_to = "measure", values_to = "value",
        values_transform = as.character
        )
    )

print(dat)
#'    first_name     measure  value
#' 1      Olivia   post_code AB12CD
#' 2      Olivia hair_colour  brown
#' 3      Olivia  eye_colour  green
#' 4      Olivia         age     64
#' 5      Amelia   post_code EF34GH
#' 6      Amelia hair_colour  brown
#' 7      Amelia  eye_colour   blue
#' 8      Amelia         age     25
#' 9      George   post_code IJ56KL
#' 10     George hair_colour  blond
#' 11     George  eye_colour   blue
#' 12     George         age     56

Note that we wrote values_transform = as.character and not values_transform = as.character(). The reasons for this are to do with how the pivot_longer function is written internally, but for now it is sufficient to have this pointed out.

pivot_longer offers a range of arguments allowing the user to exert varying levels of control over the transformation that takes place. You can view the documentation by running ?pivot_longer.

Long to Wide transformation

pivot_wider from the tidyr package is used for transforming data from a long to wide format. Having used pivot_longer, it should feel somewhat familiar.

We can use pivot_wider to return our data.frame, dat back to a wide format. dat currently looks like this;

first_name	measure	value
Olivia	post_code	AB12CD
Olivia	hair_colour	brown
Olivia	eye_colour	green
Olivia	age	64
Amelia	post_code	EF34GH
Amelia	hair_colour	brown
Amelia	eye_colour	blue
Amelia	age	25
George	post_code	IJ56KL
George	hair_colour	blond
George	eye_colour	blue
George	age	56

The minimum arguments that we need to provide to transform our data with pivot_wider are names_from (the column values will be converted to the column names) and values_from (the values that will populate the new columns). We also need to pass our data.frame as the data argument.

as.data.frame(
    tidyr::pivot_wider(
        data = dat,
        names_from = measure,
        values_from = value
        )
)
#'   first_name post_code hair_colour eye_colour age
#' 1     Olivia    AB12CD       brown      green  64
#' 2     Amelia    EF34GH       brown       blue  25
#' 3     George    IJ56KL       blond       blue  56

pivot_wider also offers a range of additional arguments allowing the user to exert control over the transformation that takes place. You can view the documentation by running ?pivot_wider.

Joins

Joining two datasets together is a powerful technique, typically used to combine rows from two or more tables, based on a related column between them.

Let’s use an example of 2 tables; employees and skills. Both tables contain details of a unique identifier (uid), which is used to identify individual employees.

employees also contains details of an employees job role (role) and length of service in years (service).

skills contains 2 columns, python and r, used to record whether an employee knows those languages. Managers don’t do anything practical (obviously) so nobody bothered to record their programming skills, hence they don’t appear in the skills data.

employees <- data.frame(
    uid = c("001", "002", "003", "004", "005"),
    role = c("data scientist", "data scientist", "data engineer", "data scientist", "manager"),
    service = c(1,3,4,5,3)
)

skills <- data.frame(
    uid = c("001", "002", "003", "004"),
    python = c(TRUE, TRUE, FALSE, TRUE),
    r = c(TRUE, FALSE, TRUE, TRUE)
)

We can use the merge function to perform a wide range of different joins. Whatever the type of join required, much of the syntax remains the same. The x and y arguments are used to specify the two data.frame’s to be joined and by is used to specify the columns which the merge will be performed on.

Inner join

Inner joins are used to join data but keep only the rows where the merge column value exists in both x and y.

merge(x = employees, y = skills, by = c("uid"))
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE

Employee ‘005’ doesn’t appear in the returned data.frame as “005” only existed in x.

Outer join

The outer join is used to join the datasets whilst keeping all records from x and y. We set the all argument to TRUE to perform this join.

merge(x = employees, y = skills, by = c("uid"), all = TRUE)
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE
#' 5 005        manager       3     NA    NA

All records from x and y are included in the return, however, as employee ‘005’ did not exist in y the values for that employee in the python and r column have been set to NA.

Left join

A left join returns all rows from the left table (x) and matched rows from the right table (y).

merge(x = employees, y = skills, by = c("uid"), all.x = TRUE)
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE
#' 5 005        manager       3     NA    NA

This time the return includes employee ‘005’ as the row existed in the left table (x) and again the values in the python and r columns have been set to NA as there was no row for the employee in the right table (y).

Right join

A right join returns matched rows from the left table (x) and all rows from the right table (y).

merge(x = employees, y = skills, by = c("uid"), all.y = TRUE)
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE

This time the return does not include employee ‘005’ as the row existed only in the left table (x).

Summary

The merge() function provides a simple and intuitive API. In the examples above the x, y, and by arguments remain the same across the different types of join. The table below demonstrates the additional arguments required to achieve each type of join.

inner join	outer join	left join	right join

all = FALSE	all = TRUE	all.x = TRUE	all.y = TRUE

A tidy approach

The tidyverse package dplyr provides functions for performing joins on a pair of data.frames.

Inner join

dplyr::inner_join(x = employees, y = skills, by = c("uid"))
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE

Outer join

dplyr::full_join(x = employees, y = skills, by = c("uid"))
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE
#' 5 005        manager       3     NA    NA

Left join

dplyr::left_join(x = employees, y = skills, by = c("uid"))
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE
#' 5 005        manager       3     NA    NA

Right join

dplyr::right_join(x = employees, y = skills, by = c("uid"))
#'   uid           role service python     r
#' 1 001 data scientist       1   TRUE  TRUE
#' 2 002 data scientist       3   TRUE FALSE
#' 3 003  data engineer       4  FALSE  TRUE
#' 4 004 data scientist       5   TRUE  TRUE

Next steps

The data transformation techniques discussed here are powerful and effective techniques, whilst being simple to use. They are particularly useful in analytical based projects.

Try out some of the tasks below to put the theory into practice.

The below code can be used to generate 2 data.frame’s, transfers and cl_goals. transfers contains data pertaining to the most expensive association football transfers (as of Summer 2021). cl_goals contains data of the 50 all time champions league top scorers (as of 2nd November 2022).

transfers <- structure(list(player_uid = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L), measure = c("Player", "FromClub", 
"ToClub", "Position", "FeeEuro", "FeePound", "Player", 
"FromClub", "ToClub", "Position", "FeeEuro", "FeePound", 
"Player", "FromClub", "ToClub", "Position", "FeeEuro", 
"FeePound", "Player", "FromClub", "ToClub", "Position", 
"FeeEuro", "FeePound", "Player", "FromClub", "ToClub", 
"Position", "FeeEuro", "FeePound", "Player", "FromClub", 
"ToClub", "Position", "FeeEuro", "FeePound", "Player", 
"FromClub", "ToClub", "Position", "FeeEuro", "FeePound", 
"Player", "FromClub", "ToClub", "Position", "FeeEuro", 
"FeePound", "Player", "FromClub", "ToClub", "Position", 
"FeeEuro", "FeePound", "Player", "FromClub", "ToClub", 
"Position", "FeeEuro", "FeePound"), value = c("Neymar", 
"Barcelona", "Paris Saint-Germain", "Forward", "222", "£198", 
"Kylian Mbappé", "Monaco", "Paris Saint-Germain", "Forward", 
"180", "£163", "Philippe Coutinho", "Liverpool", "Barcelona", 
"Midfielder", "145", "£105", "João Félix", "Benfica", "Atlético Madrid", 
"Forward", "126", "£104.10", "Antoine Griezmann", "Atlético Madrid", 
"Barcelona", "Forward", "120", "£107", "Jack Grealish", "Aston Villa", 
"Manchester City", "Midfielder", "117", "£100", "Paul Pogba", 
"Juventus", "Manchester United", "Midfielder", "105", "£89", 
"Ousmane Dembélé", "Borussia Dortmund", "Barcelona", "Forward", 
"105", "£97", "Gareth Bale", "Tottenham Hotspur", "Real Madrid", 
"Forward", "100", "£86", "Cristiano Ronaldo", "Real Madrid", 
"Juventus", "Forward", "100", "£88")), row.names = c(2L, 3L, 
4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 16L, 17L, 18L, 19L, 
20L, 21L, 23L, 24L, 25L, 26L, 27L, 28L, 30L, 31L, 32L, 33L, 34L, 
35L, 37L, 38L, 39L, 40L, 41L, 42L, 44L, 45L, 46L, 47L, 48L, 49L, 
51L, 52L, 53L, 54L, 55L, 56L, 58L, 59L, 60L, 61L, 62L, 63L, 65L, 
66L, 67L, 68L, 69L, 70L), class = "data.frame")

cl_goals <- structure(list(Rank = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
10L, 12L, 12L, 14L, 15L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 
24L, 25L, 25L, 25L, 25L, 25L, 30L, 30L, 30L, 33L, 33L, 33L, 33L, 
37L, 37L, 37L, 40L, 40L, 42L, 42L, 42L, 45L, 45L, 45L, 45L, 45L, 
45L, 45L), Player = c("Cristiano Ronaldo", "Lionel Messi", 
"Robert Lewandowski", "Karim Benzema", "Raúl", "Ruud van Nistelrooy", 
"Thomas Müller", "Thierry Henry", "Alfredo Di Stéfano ", 
"Andriy Shevchenko", "Zlatan Ibrahimović", "Eusébio ", 
"Filippo Inzaghi", "Didier Drogba", "Mohamed Salah", "Neymar", 
"Alessandro Del Piero", "Sergio Agüero", "Kylian Mbappé", 
"Ferenc Puskás", "Edinson Cavani", "Gerd Müller", 
"Fernando Morientes", "Arjen Robben", "Samuel Eto'o", "Antoine Griezmann", 
"Wayne Rooney", "Kaká", "Francisco Gento", "David Trezeguet", 
"Roy Makaay", "Patrick Kluivert", "Erling Haaland", "Jean-Pierre Papin", 
"Edin Džeko", "Ryan Giggs", "Sadio Mané", "Luis Suárez", 
"Rivaldo", "Mario Gómez", "Raheem Sterling", "Mário Jardel", 
"Robin van Persie", "Hernán Crespo", "José Altafini ", 
"Marco Simone ", "José Augusto ", "Giovane Élber", 
"Gonzalo Higuaín", "Luís Figo", "Paul Scholes"), Goals = c(140L, 
129L, 91L, 86L, 71L, 56L, 53L, 50L, 49L, 48L, 48L, 46L, 46L, 
44L, 43L, 43L, 42L, 41L, 40L, 36L, 35L, 34L, 33L, 31L, 30L, 30L, 
30L, 30L, 30L, 29L, 29L, 29L, 28L, 28L, 28L, 28L, 27L, 27L, 27L, 
26L, 26L, 25L, 25L, 25L, 24L, 24L, 24L, 24L, 24L, 24L, 24L), 
    Apps = c(183L, 161L, 111L, 146L, 142L, 73L, 138L, 112L, 58L, 
    100L, 124L, 65L, 81L, 92L, 77L, 80L, 89L, 79L, 59L, 41L, 
    70L, 35L, 93L, 110L, 78L, 85L, 85L, 86L, 89L, 58L, 61L, 71L, 
    23L, 37L, 68L, 145L, 61L, 73L, 73L, 44L, 79L, 46L, 59L, 65L, 
    28L, 46L, 56L, 69L, 83L, 103L, 124L)), class = "data.frame", row.names = c(NA, 
-51L))

The transfers dataset is in a long format, consisting of 3 columns, player_uid, measure, and value. Transform the dataset into a wide format, the transformed dataset should contain the columns;

player_uid
Player
FromClub
ToClub
Position
FeeEuro
FeePound

Join the transfers and cl_goals datasets using the Player column to create a new dataset named tg containing only the players that are present in both datasets.

Using the tg dataset you have created;

How many players are included in tg?
What is the combined number of Champions League appearences made by players in tg?
Which player scored the most goals per Champions League appearence?
Considering the fee paid for each player in euro’s, calculate the spend per Champions League goal and assign the result to a column named spend_per_goal. Which player had the lowest spend per Champions League goal?

hint: some columns may not be numeric, as.numeric() can be used to change type

Answers

1. 4
2. 407
3. Cristiano Ronaldo
4. Cristiano Ronaldo

PREVIOUSR Basics - Exploring A Dataset

NEXTR Basics - Grouping