Data often contains information pertaining to a multitude of groups. Producing summary or aggregated statistics for these groups is a common task. Being able to efficiently perform grouping operations is a powerful tool.
There are many ways to produce summary statistics and aggregations using R, however, the one of most intuitive ways to achieve this is to use the tidyverse
package dplyr
. The dplyr
package offers functions for aggregating and summarising data which are simple to use and that avoid some of the pitfalls found in alternative approaches.
We can use the built in iris
dataset to explore grouping in R. For a quick reminder of the contents of iris
, lets use the head()
function.
head(iris)
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#' 1 5.1 3.5 1.4 0.2 setosa
#' 2 4.9 3.0 1.4 0.2 setosa
#' 3 4.7 3.2 1.3 0.2 setosa
#' 4 4.6 3.1 1.5 0.2 setosa
#' 5 5.0 3.6 1.4 0.2 setosa
#' 6 5.4 3.9 1.7 0.4 setosa
We can create a copy of iris
to work with named df
.
df <- iris
Grouping data
Using the dplyr
package to generate groups to allow for the production of summary statistics and aggregations involves two functions; group_by()
and ungroup()
.
group_by()
is used to create groups, and ungroup()
to remove them.
group_by() - single grouping column
The group_by()
function accepts various arguments, you can check these out by running ?group_by
. In its most basic usage though, we only need to specify our data (df
) and the column or columns that we want the data to be grouped by.
Let’s start by grouping df
using the Species
column.
df <- dplyr::group_by(df, Species)
print(df)
#' # A tibble: 150 × 5
#' # Groups: Species [3]
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#' <dbl> <dbl> <dbl> <dbl> <fct>
#' 1 5.1 3.5 1.4 0.2 setosa
#' 2 4.9 3 1.4 0.2 setosa
#' 3 4.7 3.2 1.3 0.2 setosa
#' 4 4.6 3.1 1.5 0.2 setosa
#' 5 5 3.6 1.4 0.2 setosa
#' 6 5.4 3.9 1.7 0.4 setosa
#' 7 4.6 3.4 1.4 0.3 setosa
#' 8 5 3.4 1.5 0.2 setosa
#' 9 4.4 2.9 1.4 0.2 setosa
#' 10 4.9 3.1 1.5 0.1 setosa
#' # … with 140 more rows
#' # ℹ Use `print(n = ...)` to see more rows
Note that whilst using group_by()
has converted our data into a tibble (a tidyverse
implementation of the data.frame
), no changes have been made to the data itself. The output of print(df)
does show that we have 3 groups based on Species
; # Groups: Species [3]
.
To see the effects of group_by()
we can check the attributes of our data.
attributes(df)
#' $class
#' [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#'
#' $row.names
#' [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#' [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#' [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#' [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#' [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#' [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
#' [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#' [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#' [145] 145 146 147 148 149 150
#'
#' $names
#' [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#'
#' $groups
#' # A tibble: 3 × 2
#' Species .rows
#' <fct> <list<int>>
#' 1 setosa [50]
#' 2 versicolor [50]
#' 3 virginica [50]
You can see that the $groups
attribute has been added to record our groupings and an additional element has been added to the $class
attribute, grouped_df
.
CAUTION: Whilst our data remains grouped we need to take care when using functions that behave differently when passed a grouped_df
. Typically this will be functions from tidyverse
packages, though there is nothing to prevent authors of other packages utilising the grouped_df
class. If we don’t want our output to be effected by the groupings then the grouping will need to be removed explicitly using ungroup()
.
summarise()
summarise()
, from the dplyr
package, creates a new data frame which will contain a row for each combination of grouping variables that exists in the data, or if there are no grouping variables, a single row summarising all of the observations in the input. The resulting data frame contains a column for each grouping variable and additional columns for each of the summary statistics that have been specified.
We can use summarise()
to work out the mean petal length (Petal.Length
) for each Species
.
dplyr::summarise(df, Mean.Petal.Length = mean(Petal.Length))
#' # A tibble: 3 × 2
#' Species Mean.Petal.Length
#' <fct> <dbl>
#' 1 setosa 1.46
#' 2 versicolor 4.26
#' 3 virginica 5.55
We are not limited to adding one summary statistic at a time, if we want the mean and median values for Petal.Length
, we can add both in the same function call.
dplyr::summarise(
df,
Mean.Petal.Length = mean(Petal.Length),
Median.Petal.Length = median(Petal.Length)
)
#' # A tibble: 3 × 3
#' Species Mean.Petal.Length Median.Petal.Length
#' <fct> <dbl> <dbl>
#' 1 setosa 1.46 1.5
#' 2 versicolor 4.26 4.35
#' 3 virginica 5.55 5.55
n()
It is often useful to have a count of the number of observations within a group. The dplyr
function n()
allows us to do this.
dplyr::summarise(df, observations = dplyr::n())
#' # A tibble: 3 × 2
#' Species observations
#' <fct> <int>
#' 1 setosa 50
#' 2 versicolor 50
#' 3 virginica 50
ungroup()
Once we have finished working with the groupings that we created we need to remove them. Not removing the groupings can cause issues further down the line if you forget that they are present. To remove the current groupings we need to use the dplyr
function ungroup()
.
df <- dplyr::ungroup(df)
Using ungroup()
doesn’t appear to change anything in our data, similarly to group_by()
. However, if we review the attributes of df
again, we see that the $groups
attribute has now been removed and grouped_df
is no longer an element of the $class
attribute.
attributes(df)
#' $class
#' [1] "tbl_df" "tbl" "data.frame"
#'
#' $row.names
#' [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#' [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#' [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#' [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#' [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#' [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
#' [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#' [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#' [145] 145 146 147 148 149 150
#'
#' $names
#' [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
group_by() - multiple grouping column
Grouping can also be applied across multiple columns. Lets add a new column to df
which we use for grouping. The new column, Big.Petal
, will add a logical value based on whether Petal.Length
and Petal.Width
are above average.
df$Big.Petal <- df$Petal.Length > mean(df$Petal.Length) & df$Petal.Width > mean(df$Petal.Width)
print(df)
#' # A tibble: 150 × 6
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species Big.Petal
#' <dbl> <dbl> <dbl> <dbl> <fct> <lgl>
#' 1 5.1 3.5 1.4 0.2 setosa FALSE
#' 2 4.9 3 1.4 0.2 setosa FALSE
#' 3 4.7 3.2 1.3 0.2 setosa FALSE
#' 4 4.6 3.1 1.5 0.2 setosa FALSE
#' 5 5 3.6 1.4 0.2 setosa FALSE
#' 6 5.4 3.9 1.7 0.4 setosa FALSE
#' 7 4.6 3.4 1.4 0.3 setosa FALSE
#' 8 5 3.4 1.5 0.2 setosa FALSE
#' 9 4.4 2.9 1.4 0.2 setosa FALSE
#' 10 4.9 3.1 1.5 0.1 setosa FALSE
#' # … with 140 more rows
#' # ℹ Use `print(n = ...)` to see more rows
We can now group our data by Species
and Big.Petal
.
df <- dplyr::group_by(df, Species, Big.Petal)
This grouping allows us to calculate a mean Sepal.Length
value by both Species
and Big.Petal
.
dplyr::summarise(df, Mean.Sepal.Length = mean(Sepal.Length))
#' # A tibble: 4 × 3
#' # Groups: Species [3]
#' Species Big.Petal Mean.Sepal.Length
#' <fct> <lgl> <dbl>
#' 1 setosa FALSE 5.01
#' 2 versicolor FALSE 5.43
#' 3 versicolor TRUE 6.08
#' 4 virginica TRUE 6.59
group_by - keep all columns
summarise()
works well, but it reduces the number of rows down to one row per distinct combination of the grouping variables, and drops columns that aren’t grouping variables. This may not always be our desired behaviour. If we want to add the mean Sepal.Length
value by Species
and Big.Petal
to our data ‘as is’, we need to use another dplyr
function; mutate()
. mutate()
is used to add new columns.
df <- dplyr::mutate(df, Mean.Sepal.Length = mean(Sepal.Length))
print(df)
#' # A tibble: 150 × 7
#' # Groups: Species, Big.Petal [4]
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species Big.Petal Mean.Se…¹
#' <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <dbl>
#' 1 5.1 3.5 1.4 0.2 setosa FALSE 5.01
#' 2 4.9 3 1.4 0.2 setosa FALSE 5.01
#' 3 4.7 3.2 1.3 0.2 setosa FALSE 5.01
#' 4 4.6 3.1 1.5 0.2 setosa FALSE 5.01
#' 5 5 3.6 1.4 0.2 setosa FALSE 5.01
#' 6 5.4 3.9 1.7 0.4 setosa FALSE 5.01
#' 7 4.6 3.4 1.4 0.3 setosa FALSE 5.01
#' 8 5 3.4 1.5 0.2 setosa FALSE 5.01
#' 9 4.4 2.9 1.4 0.2 setosa FALSE 5.01
#' 10 4.9 3.1 1.5 0.1 setosa FALSE 5.01
#' # … with 140 more rows, and abbreviated variable name ¹Mean.Sepal.Length
#' # ℹ Use `print(n = ...)` to see more rows
Next steps
Grouping is a powerful and effective way to add summary statistics to your data and the dplyr
package offers easy to use functions to achieve this.
Try some of the tasks below to put the theory into practice.
Create a data.frame
named df
using the below code.
Reveal data generation code
df <- structure(list(Date = c(20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L), Sex = c("Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male"), AgeGroup = c("0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus"), FirstInfections = c(45L, 454L, 28L, 44L, 203L, 176L,
132L, 37L, 25L, 28L, 36L, 395L, 19L, 46L, 162L, 166L, 114L, 41L,
22L, 17L, 41L, 449L, 26L, 50L, 192L, 162L, 121L, 45L, 34L, 20L,
33L, 407L, 22L, 48L, 183L, 148L, 72L, 20L, 20L, 5L, 31L, 375L,
28L, 45L, 153L, 150L, 91L, 28L, 18L, 13L, 29L, 369L, 27L, 39L,
154L, 143L, 95L, 35L, 25L, 12L, 27L, 362L, 25L, 38L, 160L, 146L,
93L, 22L, 20L, 17L, 37L, 356L, 18L, 30L, 150L, 149L, 87L, 30L,
19L, 10L, 33L, 360L, 31L, 51L, 141L, 127L, 119L, 39L, 22L, 35L,
34L, 349L, 21L, 32L, 160L, 132L, 107L, 33L, 35L, 9L, 27L, 323L,
28L, 34L, 140L, 113L, 74L, 28L, 15L, 12L, 30L, 267L, 16L, 28L,
104L, 105L, 42L, 15L, 7L, 4L, 23L, 280L, 18L, 29L, 127L, 112L,
71L, 17L, 10L, 15L, 31L, 246L, 15L, 20L, 106L, 92L, 57L, 18L,
19L, 2L), Reinfections = c(0L, 5L, 0L, 1L, 3L, 1L, 2L, 1L, 0L,
1L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L,
2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
7L, 1L, 1L, 1L, 3L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 2L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 1L, 1L, 0L, 0L, 0L,
0L, 0L, 2L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 2L, 0L, 0L, 2L,
0L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -140L))
The data contains details of the daily counts of positive COVID-19 cases for the 1st 7 days of February 2021, broken down by age group and sex. The data is available from the PHS Scotland open data platform.
-
Group the data by sex and calculate the mean and median number of first infections.
-
Group the data by sex and age group and calculate the mean and median of both first infections and reinfections.
-
Group the data by sex and age group and calculate the totals of both first infections and reinfections.
Answers
Your outputs should resemble the examples below;1.
#' # A tibble: 2 × 3
#' Sex mean_first_inf median_first_inf
#' <chr> <dbl> <dbl>
#' 1 Female 94.4 40
#' 2 Male 84.7 35
#' # A tibble: 20 × 6
#' # Groups: Sex [2]
#' Sex AgeGroup mean_first_inf median_first_inf mean_reinf median_reinf
#' <chr> <chr> <dbl> <int> <dbl> <int>
#' 1 Female 0 to 14 32.4 31 0.143 0
#' 2 Female 0 to 59 372. 362 3 2
#' 3 Female 15 to 19 26.3 28 0.143 0
#' 4 Female 20 to 24 41.6 44 0.429 0
#' 5 Female 25 to 44 159. 153 1.14 1
#' 6 Female 45 to 64 141. 146 1.14 1
#' 7 Female 60+ 100. 93 0.571 0
#' 8 Female 65 to 74 30.9 28 0.286 0
#' 9 Female 75 to 84 20.6 20 0 0
#' 10 Female 85plus 20 17 0.286 0
#' 11 Male 0 to 14 32.9 33 0.286 0
#' 12 Male 0 to 59 341. 356 1 1
#' 13 Male 15 to 19 19.7 19 0 0
#' 14 Male 20 to 24 34.7 32 0 0
#' 15 Male 25 to 44 146. 154 0.429 0
#' 16 Male 45 to 64 134. 143 0.286 0
#' 17 Male 60+ 82 87 0.143 0
#' 18 Male 65 to 74 27.4 30 0.143 0
#' 19 Male 75 to 84 21 20 0 0
#' 20 Male 85plus 8.43 9 0 0
#' # A tibble: 20 × 4
#' # Groups: Sex [2]
#' Sex AgeGroup total_first_inf total_reinf
#' <chr> <chr> <int> <int>
#' 1 Female 0 to 14 227 1
#' 2 Female 0 to 59 2603 21
#' 3 Female 15 to 19 184 1
#' 4 Female 20 to 24 291 3
#' 5 Female 25 to 44 1116 8
#' 6 Female 45 to 64 986 8
#' 7 Female 60+ 701 4
#' 8 Female 65 to 74 216 2
#' 9 Female 75 to 84 144 0
#' 10 Female 85plus 140 2
#' 11 Male 0 to 14 230 2
#' 12 Male 0 to 59 2389 7
#' 13 Male 15 to 19 138 0
#' 14 Male 20 to 24 243 0
#' 15 Male 25 to 44 1019 3
#' 16 Male 45 to 64 935 2
#' 17 Male 60+ 574 1
#' 18 Male 65 to 74 192 1
#' 19 Male 75 to 84 147 0
#' 20 Male 85plus 59 0