Data often contains information pertaining to a multitude of groups. Producing summary or aggregated statistics for these groups is a common task. Being able to efficiently perform grouping operations is a powerful tool.
There are many ways to produce summary statistics and aggregations using R, however, the one of most intuitive ways to achieve this is to use the tidyverse
package dplyr
. The dplyr
package offers functions for aggregating and summarising data which are simple to use and that avoid some of the pitfalls found in alternative approaches.
We can use the built in iris
dataset to explore grouping in R. For a quick reminder of the contents of iris
, lets use the head()
function.
We can create a copy of iris
to work with named df
.
Grouping data
Using the dplyr
package to generate groups to allow for the production of summary statistics and aggregations involves two functions; group_by()
and ungroup()
.
group_by()
is used to create groups, and ungroup()
to remove them.
group_by() - single grouping column
The group_by()
function accepts various arguments, you can check these out by running ?group_by
. In its most basic usage though, we only need to specify our data (df
) and the column or columns that we want the data to be grouped by.
Let’s start by grouping df
using the Species
column.
Note that whilst using group_by()
has converted our data into a tibble (a tidyverse
implementation of the data.frame
), no changes have been made to the data itself. The output of print(df)
does show that we have 3 groups based on Species
; # Groups: Species [3]
.
To see the effects of group_by()
we can check the attributes of our data.
You can see that the $groups
attribute has been added to record our groupings and an additional element has been added to the $class
attribute, grouped_df
.
CAUTION: Whilst our data remains grouped we need to take care when using functions that behave differently when passed a grouped_df
. Typically this will be functions from tidyverse
packages, though there is nothing to prevent authors of other packages utilising the grouped_df
class. If we don’t want our output to be effected by the groupings then the grouping will need to be removed explicitly using ungroup()
.
summarise()
summarise()
, from the dplyr
package, creates a new data frame which will contain a row for each combination of grouping variables that exists in the data, or if there are no grouping variables, a single row summarising all of the observations in the input. The resulting data frame contains a column for each grouping variable and additional columns for each of the summary statistics that have been specified.
We can use summarise()
to work out the mean petal length (Petal.Length
) for each Species
.
We are not limited to adding one summary statistic at a time, if we want the mean and median values for Petal.Length
, we can add both in the same function call.
n()
It is often useful to have a count of the number of observations within a group. The dplyr
function n()
allows us to do this.
ungroup()
Once we have finished working with the groupings that we created we need to remove them. Not removing the groupings can cause issues further down the line if you forget that they are present. To remove the current groupings we need to use the dplyr
function ungroup()
.
Using ungroup()
doesn’t appear to change anything in our data, similarly to group_by()
. However, if we review the attributes of df
again, we see that the $groups
attribute has now been removed and grouped_df
is no longer an element of the $class
attribute.
group_by() - multiple grouping column
Grouping can also be applied across multiple columns. Lets add a new column to df
which we use for grouping. The new column, Big.Petal
, will add a logical value based on whether Petal.Length
and Petal.Width
are above average.
We can now group our data by Species
and Big.Petal
.
This grouping allows us to calculate a mean Sepal.Length
value by both Species
and Big.Petal
.
group_by - keep all columns
summarise()
works well, but it reduces the number of rows down to one row per distinct combination of the grouping variables, and drops columns that aren’t grouping variables. This may not always be our desired behaviour. If we want to add the mean Sepal.Length
value by Species
and Big.Petal
to our data ‘as is’, we need to use another dplyr
function; mutate()
. mutate()
is used to add new columns.
Next steps
Grouping is a powerful and effective way to add summary statistics to your data and the dplyr
package offers easy to use functions to achieve this.
Try some of the tasks below to put the theory into practice.
Create a data.frame
named df
using the below code.
Reveal data generation code
The data contains details of the daily counts of positive COVID-19 cases for the 1st 7 days of February 2021, broken down by age group and sex. The data is available from the PHS Scotland open data platform.
-
Group the data by sex and calculate the mean and median number of first infections.
-
Group the data by sex and age group and calculate the mean and median of both first infections and reinfections.
-
Group the data by sex and age group and calculate the totals of both first infections and reinfections.
Answers
Your outputs should resemble the examples below;1. 2. 3.