Data often contains information pertaining to a multitude of groups. Producing summary or aggregated statistics for these groups is a common task. Being able to efficiently perform grouping operations is a powerful tool.
There are many ways to produce summary statistics and aggregations using R, however, the one of most intuitive ways to achieve this is to use the tidyverse package dplyr. The dplyr package offers functions for aggregating and summarising data which are simple to use and that avoid some of the pitfalls found in alternative approaches.
We can use the built in iris dataset to explore grouping in R. For a quick reminder of the contents of iris, lets use the head() function.
head(iris)
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#' 1 5.1 3.5 1.4 0.2 setosa
#' 2 4.9 3.0 1.4 0.2 setosa
#' 3 4.7 3.2 1.3 0.2 setosa
#' 4 4.6 3.1 1.5 0.2 setosa
#' 5 5.0 3.6 1.4 0.2 setosa
#' 6 5.4 3.9 1.7 0.4 setosaWe can create a copy of iris to work with named df.
df <- irisGrouping data
Using the dplyr package to generate groups to allow for the production of summary statistics and aggregations involves two functions; group_by() and ungroup().
group_by() is used to create groups, and ungroup() to remove them.
group_by() - single grouping column
The group_by() function accepts various arguments, you can check these out by running ?group_by. In its most basic usage though, we only need to specify our data (df) and the column or columns that we want the data to be grouped by.
Let’s start by grouping df using the Species column.
df <- dplyr::group_by(df, Species)
print(df)
#' # A tibble: 150 × 5
#' # Groups: Species [3]
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#' <dbl> <dbl> <dbl> <dbl> <fct>
#' 1 5.1 3.5 1.4 0.2 setosa
#' 2 4.9 3 1.4 0.2 setosa
#' 3 4.7 3.2 1.3 0.2 setosa
#' 4 4.6 3.1 1.5 0.2 setosa
#' 5 5 3.6 1.4 0.2 setosa
#' 6 5.4 3.9 1.7 0.4 setosa
#' 7 4.6 3.4 1.4 0.3 setosa
#' 8 5 3.4 1.5 0.2 setosa
#' 9 4.4 2.9 1.4 0.2 setosa
#' 10 4.9 3.1 1.5 0.1 setosa
#' # … with 140 more rows
#' # ℹ Use `print(n = ...)` to see more rowsNote that whilst using group_by() has converted our data into a tibble (a tidyverse implementation of the data.frame), no changes have been made to the data itself. The output of print(df) does show that we have 3 groups based on Species; # Groups: Species [3].
To see the effects of group_by() we can check the attributes of our data.
attributes(df)
#' $class
#' [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#'
#' $row.names
#' [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#' [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#' [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#' [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#' [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#' [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
#' [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#' [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#' [145] 145 146 147 148 149 150
#'
#' $names
#' [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#'
#' $groups
#' # A tibble: 3 × 2
#' Species .rows
#' <fct> <list<int>>
#' 1 setosa [50]
#' 2 versicolor [50]
#' 3 virginica [50]You can see that the $groups attribute has been added to record our groupings and an additional element has been added to the $class attribute, grouped_df.
CAUTION: Whilst our data remains grouped we need to take care when using functions that behave differently when passed a grouped_df. Typically this will be functions from tidyverse packages, though there is nothing to prevent authors of other packages utilising the grouped_df class. If we don’t want our output to be effected by the groupings then the grouping will need to be removed explicitly using ungroup().
summarise()
summarise(), from the dplyr package, creates a new data frame which will contain a row for each combination of grouping variables that exists in the data, or if there are no grouping variables, a single row summarising all of the observations in the input. The resulting data frame contains a column for each grouping variable and additional columns for each of the summary statistics that have been specified.
We can use summarise() to work out the mean petal length (Petal.Length) for each Species.
dplyr::summarise(df, Mean.Petal.Length = mean(Petal.Length))
#' # A tibble: 3 × 2
#' Species Mean.Petal.Length
#' <fct> <dbl>
#' 1 setosa 1.46
#' 2 versicolor 4.26
#' 3 virginica 5.55We are not limited to adding one summary statistic at a time, if we want the mean and median values for Petal.Length, we can add both in the same function call.
dplyr::summarise(
df,
Mean.Petal.Length = mean(Petal.Length),
Median.Petal.Length = median(Petal.Length)
)
#' # A tibble: 3 × 3
#' Species Mean.Petal.Length Median.Petal.Length
#' <fct> <dbl> <dbl>
#' 1 setosa 1.46 1.5
#' 2 versicolor 4.26 4.35
#' 3 virginica 5.55 5.55n()
It is often useful to have a count of the number of observations within a group. The dplyr function n() allows us to do this.
dplyr::summarise(df, observations = dplyr::n())
#' # A tibble: 3 × 2
#' Species observations
#' <fct> <int>
#' 1 setosa 50
#' 2 versicolor 50
#' 3 virginica 50ungroup()
Once we have finished working with the groupings that we created we need to remove them. Not removing the groupings can cause issues further down the line if you forget that they are present. To remove the current groupings we need to use the dplyr function ungroup().
df <- dplyr::ungroup(df)Using ungroup() doesn’t appear to change anything in our data, similarly to group_by(). However, if we review the attributes of df again, we see that the $groups attribute has now been removed and grouped_df is no longer an element of the $class attribute.
attributes(df)
#' $class
#' [1] "tbl_df" "tbl" "data.frame"
#'
#' $row.names
#' [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#' [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#' [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#' [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#' [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#' [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
#' [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#' [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#' [145] 145 146 147 148 149 150
#'
#' $names
#' [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"group_by() - multiple grouping column
Grouping can also be applied across multiple columns. Lets add a new column to df which we use for grouping. The new column, Big.Petal, will add a logical value based on whether Petal.Length and Petal.Width are above average.
df$Big.Petal <- df$Petal.Length > mean(df$Petal.Length) & df$Petal.Width > mean(df$Petal.Width)
print(df)
#' # A tibble: 150 × 6
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species Big.Petal
#' <dbl> <dbl> <dbl> <dbl> <fct> <lgl>
#' 1 5.1 3.5 1.4 0.2 setosa FALSE
#' 2 4.9 3 1.4 0.2 setosa FALSE
#' 3 4.7 3.2 1.3 0.2 setosa FALSE
#' 4 4.6 3.1 1.5 0.2 setosa FALSE
#' 5 5 3.6 1.4 0.2 setosa FALSE
#' 6 5.4 3.9 1.7 0.4 setosa FALSE
#' 7 4.6 3.4 1.4 0.3 setosa FALSE
#' 8 5 3.4 1.5 0.2 setosa FALSE
#' 9 4.4 2.9 1.4 0.2 setosa FALSE
#' 10 4.9 3.1 1.5 0.1 setosa FALSE
#' # … with 140 more rows
#' # ℹ Use `print(n = ...)` to see more rowsWe can now group our data by Species and Big.Petal.
df <- dplyr::group_by(df, Species, Big.Petal)This grouping allows us to calculate a mean Sepal.Length value by both Species and Big.Petal.
dplyr::summarise(df, Mean.Sepal.Length = mean(Sepal.Length))
#' # A tibble: 4 × 3
#' # Groups: Species [3]
#' Species Big.Petal Mean.Sepal.Length
#' <fct> <lgl> <dbl>
#' 1 setosa FALSE 5.01
#' 2 versicolor FALSE 5.43
#' 3 versicolor TRUE 6.08
#' 4 virginica TRUE 6.59group_by - keep all columns
summarise() works well, but it reduces the number of rows down to one row per distinct combination of the grouping variables, and drops columns that aren’t grouping variables. This may not always be our desired behaviour. If we want to add the mean Sepal.Length value by Species and Big.Petal to our data ‘as is’, we need to use another dplyr function; mutate(). mutate() is used to add new columns.
df <- dplyr::mutate(df, Mean.Sepal.Length = mean(Sepal.Length))
print(df)
#' # A tibble: 150 × 7
#' # Groups: Species, Big.Petal [4]
#' Sepal.Length Sepal.Width Petal.Length Petal.Width Species Big.Petal Mean.Se…¹
#' <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <dbl>
#' 1 5.1 3.5 1.4 0.2 setosa FALSE 5.01
#' 2 4.9 3 1.4 0.2 setosa FALSE 5.01
#' 3 4.7 3.2 1.3 0.2 setosa FALSE 5.01
#' 4 4.6 3.1 1.5 0.2 setosa FALSE 5.01
#' 5 5 3.6 1.4 0.2 setosa FALSE 5.01
#' 6 5.4 3.9 1.7 0.4 setosa FALSE 5.01
#' 7 4.6 3.4 1.4 0.3 setosa FALSE 5.01
#' 8 5 3.4 1.5 0.2 setosa FALSE 5.01
#' 9 4.4 2.9 1.4 0.2 setosa FALSE 5.01
#' 10 4.9 3.1 1.5 0.1 setosa FALSE 5.01
#' # … with 140 more rows, and abbreviated variable name ¹Mean.Sepal.Length
#' # ℹ Use `print(n = ...)` to see more rowsNext steps
Grouping is a powerful and effective way to add summary statistics to your data and the dplyr package offers easy to use functions to achieve this.
Try some of the tasks below to put the theory into practice.
Create a data.frame named df using the below code.
Reveal data generation code
df <- structure(list(Date = c(20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210201L, 20210201L,
20210201L, 20210201L, 20210201L, 20210201L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210202L, 20210202L, 20210202L, 20210202L, 20210202L, 20210202L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210203L, 20210203L, 20210203L, 20210203L,
20210203L, 20210203L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210204L, 20210204L,
20210204L, 20210204L, 20210204L, 20210204L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210205L, 20210205L, 20210205L, 20210205L, 20210205L, 20210205L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210206L, 20210206L, 20210206L, 20210206L,
20210206L, 20210206L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L, 20210207L, 20210207L,
20210207L, 20210207L, 20210207L, 20210207L), Sex = c("Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Male", "Male", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male",
"Male", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Male", "Male", "Male",
"Male", "Male", "Male", "Male", "Male", "Male", "Male"), AgeGroup = c("0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus", "0 to 14", "0 to 59", "15 to 19", "20 to 24", "25 to 44",
"45 to 64", "60+", "65 to 74", "75 to 84", "85plus", "0 to 14",
"0 to 59", "15 to 19", "20 to 24", "25 to 44", "45 to 64", "60+",
"65 to 74", "75 to 84", "85plus", "0 to 14", "0 to 59", "15 to 19",
"20 to 24", "25 to 44", "45 to 64", "60+", "65 to 74", "75 to 84",
"85plus"), FirstInfections = c(45L, 454L, 28L, 44L, 203L, 176L,
132L, 37L, 25L, 28L, 36L, 395L, 19L, 46L, 162L, 166L, 114L, 41L,
22L, 17L, 41L, 449L, 26L, 50L, 192L, 162L, 121L, 45L, 34L, 20L,
33L, 407L, 22L, 48L, 183L, 148L, 72L, 20L, 20L, 5L, 31L, 375L,
28L, 45L, 153L, 150L, 91L, 28L, 18L, 13L, 29L, 369L, 27L, 39L,
154L, 143L, 95L, 35L, 25L, 12L, 27L, 362L, 25L, 38L, 160L, 146L,
93L, 22L, 20L, 17L, 37L, 356L, 18L, 30L, 150L, 149L, 87L, 30L,
19L, 10L, 33L, 360L, 31L, 51L, 141L, 127L, 119L, 39L, 22L, 35L,
34L, 349L, 21L, 32L, 160L, 132L, 107L, 33L, 35L, 9L, 27L, 323L,
28L, 34L, 140L, 113L, 74L, 28L, 15L, 12L, 30L, 267L, 16L, 28L,
104L, 105L, 42L, 15L, 7L, 4L, 23L, 280L, 18L, 29L, 127L, 112L,
71L, 17L, 10L, 15L, 31L, 246L, 15L, 20L, 106L, 92L, 57L, 18L,
19L, 2L), Reinfections = c(0L, 5L, 0L, 1L, 3L, 1L, 2L, 1L, 0L,
1L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L,
2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
7L, 1L, 1L, 1L, 3L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 2L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 1L, 1L, 0L, 0L, 0L,
0L, 0L, 2L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 2L, 0L, 0L, 2L,
0L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -140L))The data contains details of the daily counts of positive COVID-19 cases for the 1st 7 days of February 2021, broken down by age group and sex. The data is available from the PHS Scotland open data platform.
-
Group the data by sex and calculate the mean and median number of first infections.
-
Group the data by sex and age group and calculate the mean and median of both first infections and reinfections.
-
Group the data by sex and age group and calculate the totals of both first infections and reinfections.
Answers
Your outputs should resemble the examples below;1.
#' # A tibble: 2 × 3
#' Sex mean_first_inf median_first_inf
#' <chr> <dbl> <dbl>
#' 1 Female 94.4 40
#' 2 Male 84.7 35
#' # A tibble: 20 × 6
#' # Groups: Sex [2]
#' Sex AgeGroup mean_first_inf median_first_inf mean_reinf median_reinf
#' <chr> <chr> <dbl> <int> <dbl> <int>
#' 1 Female 0 to 14 32.4 31 0.143 0
#' 2 Female 0 to 59 372. 362 3 2
#' 3 Female 15 to 19 26.3 28 0.143 0
#' 4 Female 20 to 24 41.6 44 0.429 0
#' 5 Female 25 to 44 159. 153 1.14 1
#' 6 Female 45 to 64 141. 146 1.14 1
#' 7 Female 60+ 100. 93 0.571 0
#' 8 Female 65 to 74 30.9 28 0.286 0
#' 9 Female 75 to 84 20.6 20 0 0
#' 10 Female 85plus 20 17 0.286 0
#' 11 Male 0 to 14 32.9 33 0.286 0
#' 12 Male 0 to 59 341. 356 1 1
#' 13 Male 15 to 19 19.7 19 0 0
#' 14 Male 20 to 24 34.7 32 0 0
#' 15 Male 25 to 44 146. 154 0.429 0
#' 16 Male 45 to 64 134. 143 0.286 0
#' 17 Male 60+ 82 87 0.143 0
#' 18 Male 65 to 74 27.4 30 0.143 0
#' 19 Male 75 to 84 21 20 0 0
#' 20 Male 85plus 8.43 9 0 0
#' # A tibble: 20 × 4
#' # Groups: Sex [2]
#' Sex AgeGroup total_first_inf total_reinf
#' <chr> <chr> <int> <int>
#' 1 Female 0 to 14 227 1
#' 2 Female 0 to 59 2603 21
#' 3 Female 15 to 19 184 1
#' 4 Female 20 to 24 291 3
#' 5 Female 25 to 44 1116 8
#' 6 Female 45 to 64 986 8
#' 7 Female 60+ 701 4
#' 8 Female 65 to 74 216 2
#' 9 Female 75 to 84 144 0
#' 10 Female 85plus 140 2
#' 11 Male 0 to 14 230 2
#' 12 Male 0 to 59 2389 7
#' 13 Male 15 to 19 138 0
#' 14 Male 20 to 24 243 0
#' 15 Male 25 to 44 1019 3
#' 16 Male 45 to 64 935 2
#' 17 Male 60+ 574 1
#' 18 Male 65 to 74 192 1
#' 19 Male 75 to 84 147 0
#' 20 Male 85plus 59 0