When working with data we often find a need to transform it in various ways. This might be for reasons of presentation, to make data compatible with a function that requires the input to be in a specific format, or to make it easier to work with. R provides plenty of options for data transformation and the rich package system means that there is typically a ready made solution for most of the problems that you might encounter.
Wide and long format transformations
Consider the following dataset;
The data has one row per individual and records their name, post code, hair colour, and eye colour. Each observation is held in a column and the layout could be described as being a wide format.
The same data could also be stored in a long format where the post code, hair colour, and eye colour are all stored in a single column of values, with another column used to identify what the value signifies.
Wide format
first_name | post_code | hair_colour | eye_colour |
---|---|---|---|
Olivia | AB12CD | brown | green |
Amelia | EF34GH | brown | blue |
George | IJ56KL | blond | blue |
Long format
first_name | measure | value |
---|---|---|
Olivia | post_code | AB12CD |
Amelia | post_code | EF34GH |
George | post_code | IJ56KL |
Olivia | hair_colour | brown |
Amelia | hair_colour | brown |
George | hair_colour | blond |
Olivia | eye_colour | green |
Amelia | eye_colour | blue |
George | eye_colour | blue |
Wide to Long transformation
Whilst there are a number of functions available to assist in the wide to long transformation, pivot_longer
from the tidyr
package offers one of the simpler API’s. tidyr
is a widely used package and the pivot_longer
function is performant with small to mid sized data.
You can install tidyr
from CRAN if you don’t already have it installed.
To use pivot_longer
to transform our data we need to specify at least the data
and cols
arguments. data
allows us to specify the object that we want to transform, and cols
specifies the columns to pivot into the longer format.
We can also use negation in the cols
argument; in our example we actually want to pivot all columns except first_name
, so we can also write the function like this;
You will have noticed that the return of pivot_longer
is a tibble
. This generally won’t be a problem, but we can keep our data in a data.frame
by using as.data.frame()
to wrap our call to pivot_longer()
, for example as.data.frame(tidyr::pivot_longer(data, cols))
.
Typically, we might want to specify the names of the new name
and value
columns rather than using the default values. This can be achieved with the names_to
and values_to
arguments.
The function has worked well so far, but what if we had another column containing numeric values that we also wanted to include in the transformation?
Our post_code
, hair_colour
, and eye_colour
columns were all character vectors, whereas our newly added age
column is numeric. tidyverse
functions usually try to avoid implicit type coercions, so whilst you might have expected age
to be coerced into a character, we have instead received an error.
We can use the values_transform
argument to specify the as.character
function, which will then be applied to all columns in the transformation.
Note that we wrote values_transform = as.character
and not values_transform = as.character()
. The reasons for this are to do with how the pivot_longer
function is written internally, but for now it is sufficient to have this pointed out.
pivot_longer
offers a range of arguments allowing the user to exert varying levels of control over the transformation that takes place. You can view the documentation by running ?pivot_longer
.
Long to Wide transformation
pivot_wider
from the tidyr
package is used for transforming data from a long to wide format. Having used pivot_longer
, it should feel somewhat familiar.
We can use pivot_wider
to return our data.frame
, dat
back to a wide format. dat
currently looks like this;
first_name | measure | value |
---|---|---|
Olivia | post_code | AB12CD |
Olivia | hair_colour | brown |
Olivia | eye_colour | green |
Olivia | age | 64 |
Amelia | post_code | EF34GH |
Amelia | hair_colour | brown |
Amelia | eye_colour | blue |
Amelia | age | 25 |
George | post_code | IJ56KL |
George | hair_colour | blond |
George | eye_colour | blue |
George | age | 56 |
The minimum arguments that we need to provide to transform our data with pivot_wider
are names_from
(the column values will be converted to the column names) and values_from
(the values that will populate the new columns). We also need to pass our data.frame
as the data
argument.
pivot_wider
also offers a range of additional arguments allowing the user to exert control over the transformation that takes place. You can view the documentation by running ?pivot_wider
.
Joins
Joining two datasets together is a powerful technique, typically used to combine rows from two or more tables, based on a related column between them.
Let’s use an example of 2 tables; employees
and skills
. Both tables contain details of a unique identifier (uid
), which is used to identify individual employees.
employees
also contains details of an employees job role (role
) and length of service in years (service
).
skills
contains 2 columns, python
and r
, used to record whether an employee knows those languages. Managers don’t do anything practical (obviously) so nobody bothered to record their programming skills, hence they don’t appear in the skills
data.
We can use the merge
function to perform a wide range of different joins. Whatever the type of join required, much of the syntax remains the same. The x
and y
arguments are used to specify the two data.frame
’s to be joined and by
is used to specify the columns which the merge will be performed on.
Inner join
Inner joins are used to join data but keep only the rows where the merge column value exists in both x
and y
.
Employee ‘005’ doesn’t appear in the returned data.frame
as “005” only existed in x
.
Outer join
The outer join is used to join the datasets whilst keeping all records from x
and y
. We set the all
argument to TRUE
to perform this join.
All records from x
and y
are included in the return, however, as employee ‘005’ did not exist in y
the values for that employee in the python
and r
column have been set to NA
.
Left join
A left join returns all rows from the left table (x
) and matched rows from the right table (y
).
This time the return includes employee ‘005’ as the row existed in the left table (x
) and again the values in the python
and r
columns have been set to NA
as there was no row for the employee in the right table (y
).
Right join
A right join returns matched rows from the left table (x
) and all rows from the right table (y
).
This time the return does not include employee ‘005’ as the row existed only in the left table (x
).
Summary
The merge()
function provides a simple and intuitive API. In the examples above the x
, y
, and by
arguments remain the same across the different types of join. The table below demonstrates the additional arguments required to achieve each type of join.
inner join | outer join | left join | right join |
---|---|---|---|
all = FALSE | all = TRUE | all.x = TRUE | all.y = TRUE |
A tidy approach
The tidyverse
package dplyr
provides functions for performing joins on a pair of data.frames
.
Inner join
Outer join
Left join
Right join
Next steps
The data transformation techniques discussed here are powerful and effective techniques, whilst being simple to use. They are particularly useful in analytical based projects.
Try out some of the tasks below to put the theory into practice.
The below code can be used to generate 2 data.frame
’s, transfers
and cl_goals
. transfers
contains data pertaining to the most expensive association football transfers (as of Summer 2021). cl_goals
contains data of the 50 all time champions league top scorers (as of 2nd November 2022).
The transfers
dataset is in a long format, consisting of 3 columns, player_uid
, measure
, and value
. Transform the dataset into a wide format, the transformed dataset should contain the columns;
- player_uid
- Player
- FromClub
- ToClub
- Position
- FeeEuro
- FeePound
Join the transfers
and cl_goals
datasets using the Player
column to create a new dataset named tg
containing only the players that are present in both datasets.
Using the tg
dataset you have created;
-
How many players are included in
tg
? -
What is the combined number of Champions League appearences made by players in
tg
? -
Which player scored the most goals per Champions League appearence?
-
Considering the fee paid for each player in euro’s, calculate the spend per Champions League goal and assign the result to a column named
spend_per_goal
. Which player had the lowest spend per Champions League goal?
hint: some columns may not be numeric, as.numeric()
can be used to change type
Answers
1. 42. 407
3. Cristiano Ronaldo
4. Cristiano Ronaldo