In Data Frames Part 1 we looked at creating data.frame
objects, accessing their columns, and creating subsets of them using their row and column indices. We also looked at some useful functions for interacting with data.frame
objects’s, including ncol()
, nrow()
, and names()
.
In the 2nd part we explore these key data structures in more detail.
Working with data frames
mtcars
is one of a variety of data sets that are included with R and are typically used in tutorials. We can view the first few rows of mtcars
to get a feel for the data set using the head()
function.
The tail()
function can also be used to inspect an object and works similarly to head()
, returning the last rows of a data.frame
. head()
and tail()
can also be used on individual columns, or in fact any vector, to return the first (or last) few elements.
You may have noticed that mtcars
doesn’t actually have a column containing the car names, rather they are stored as row names. We can check and confirm this using the row.names()
function.
Subset rows based on criteria
We know that we can access a data.frame
’s elements using the indices of the rows and columns that we want to include. You can revisit Data Frames Part 1 for a refresher.
The position of the row and column indices along with the ,
are perhaps the most important considerations at this stage, and are the most likely to cause confusion.
It is important to take extra care with the syntax initially, but as you gain experience from putting the theory into practice you will find that it quickly becomes more intuitive, for example, knowing that if we don’t include the comma and only pass one index (i.e. df[1]
), R will assume that we want a column.
As you become more confident with manipulating data.frame
objects in this way you can start to advance towards more powerful applications to create specific subsets of data.frame
’s.
Subset rows using a numeric vector
Lets say that we want to extract all of the rows from mtcars
for cars with over 200 horsepower.
To achieve our goal we could look at mtcars
and write down the row numbers where the value in the hp
column is greater than 200, which would be rows 7, 15, 16, 17, 24, 29, and 31.
Using the index system we are able to create a subset of mtcars
containing those rows.
Note that we have used the c()
function to create a vector containing the row numbers, and we have included the ,
even though we aren’t specifying any columns. If we try to select the rows without including the ,
, R will assume we want columns, and since we only have 11 columns will return an error.
Whilst this approach works, its clearly not very scalable. It requires manually counting the rows we want and would be impractical with larger or more complex data sets.
One way to approach the problem in a programmatic way is to use the which()
function.
which(mtcars$hp > 200)
has returned the indices of elements with values greater than 200 in the mtcars
hp
column. We have assigned the output of which()
to hp_200
, so we can now reuse those values elsewhere, in this case to subset our data.frame.
It is possible to skip the step of assigning the output from which()
altogether by using the function within [ ]
directly.
Subset rows using a logical vector
We can also subset mtcars
using a logical (TRUE or FALSE) vector. If we run mtcars$hp > 200
we get a logical vector in return, where TRUE
indicates that the vector element was greater than 200.
Logical vectors can be used to subset other vectors and data.frame
’s. We can explore this in action with a simple example.
The 1st, 2nd, and 3rd elements of logical_vector
are TRUE
, so the subset operation keeps the 1st, 2nd, and 3rd elements of my_vector
. We can use this approach with mtcars
to create the subset data.frame
of cars with over 200 hp.
R often provides us with lots of ways to do the same thing. Which approach you use will often come down to how the wider project is structured and choosing an approach is something that becomes more intuitive with experience. Considerations such as readability and how obvious the functionality of a piece of code is to a third party, or indeed yourself upon returning to it, are important.
For newcomers it is often best to work with the option you find most intuitive, but in essence all of these approaches give you the same output.
A tidy approach
Use of the tidyverse
is somewhat ubiquitous in R and the tidy style is now often taught before base R. You will notice that so far we haven’t looked at any tidy functions. Whilst the tidyverse
posits itself as being simpler to learn, this isn’t necessarily true. Building a strong understanding of base R will afford you greater flexibility and comprehension of the language when progressing to more advanced usage.
With that in mind, it can be useful to see how the examples we have looked at so far might translate into the tidyverse
. The example below demonstrates how we could produce a subset data.frame
of cars with over 200 hp from mtcars
in a tidy fashion.
You will notice the use of the pipe operator %>%
and that the hp
column is referred to simply as hp
rather than mtcars$hp
. The dplyr
functions try to be more intuitive by assuming hp
in this instance will be a column in mtcars
.
A (very) brief note about pipes: Pipes are commonly used in R to chain together multiple functions. There is also a native pipe operator in R |>
(available in versions 4.1.0 and above). In its simplest usage the pipe operator takes the object to its left and passes it to its right as the first argument.
In the example above, the pipe operator takes mtcars
and passes it to the first argument of dplyr::filter()
.
We can write the snippet without pipes, whilst still using dplyr::filter()
.
Base or tidy? A natural question to arise whilst learning R is whether to prefer base R or tidy R. As you become comfortable with the language you will likely develop a preference for approaching problems in a certain way and may find that you prefer one framework over another.
However, an either or approach is only likely to lead to problems down the line. Whilst learning the language it is preferable to be exposed to multiple approaches in order that you can understand code written by others and be aware of alternatives if you encounter a challenging problem.
Subset columns based on criteria
We can select a subset of columns from mtcars
in a few different ways.
Subset columns using a numeric vector
We can select columns using their indices. We don’t need to include ,
within [ ]
if we only want columns. However, it doesn’t cause any issues if we do include ,
, but it must come before the column indices. If we want to select the first 3 columns of mtcars
we can use a numeric vector.
You will have noticed in the example above that mtcars[, 1:3]
and mtcars[1:3]
return the same object, but mtcars[1:3, ]
doesn’t. The position of the ,
in mtcars[1:3, ]
causes R to extract the first 3 rows, rather than columns.
Subset columns using a character vector
We can use the column names to access specific columns of mtcars
. The names()
function can be really helpful here and allows us to check the name values that are available.
Lets use the column names to extract the mpg
, cyl
, and hp
columns from mtcars
.
This approach is effective and straight forward to use, making it an ideal approach for creating subsets of specific columns.
Subset columns using a logical vector
Sometimes you may want to select columns in a more dynamic way. Lets say we want a data.frame
consisting of any column from mtcars
with a name that is 2 characters long.
We can first get the number of characters in each of the column names and store the vector as name_len
.
We then convert the name_len
numeric vector to logical one.
Finally, we can use name_len
to select all of the columns from mtcars
with names 2 characters long.
As with our examples above, we can also avoid the intermediate steps by refactoring our code into a concise one-liner.
Subset rows and columns simultaneously based on criteria
We can combine what we have looked at so far to create a subset of both columns and rows from mtcars
. To extract the mpg
, cyl
, and hp
columns from mtcars
, but only keeping rows where hp
is greater than 200, we can combine the mtcars[mtcars$hp > 200,]
and mtcars[c("mpg", "cyl", "hp")]
code snippets we used earlier.
At this point we also need to remember the order in which we specify rows and columns inside [ ]
.
Lets give it a try, providing our row subset criteria followed by our column subset criteria.
We can also select rows based on the values in a column that we don’t want to actually include in our output. For example, if we want the mpg
, cyl
, and hp
columns from mtcars
, where carb
is greater than 4, we don’t actually need to include carb
in our new data.frame
.
A tidy approach
Examples of a tidyverse
approach to creating a subset of rows and columns simultaneously.
& AND | OR
The logical &
(AND) and |
(OR) operators are very important in any programming language, allowing us to build logic based on multiple conditions. The following examples demonstrate the returns generated by using the logical operators in different situations.
In one of the examples above we returned rows from mtcars
where the hp
value was greater than 200 and the carb
value was greater than 4. We can combine the separate criteria to return rows where both conditions are TRUE
using the &
operator.
We can also return the rows of mtcars
where either of the conditions are TRUE
.
Using a single column from a data.frame subset
We might not always want to create a new data.frame
, perhaps we simply want to know how many rows fulfil a certain criteria. We can use the nrow()
function for this.
We can also apply this approach to more complex problems. If we want to know the mean value of disp
where the value of carb
is greater than 2 and hp
is greater than 200, we could subset mtcars
and then use mean()
on the disp
column.
However, we don’t actually need to go through the assignment step, rather we can access the column directly by placing the $
operator at the end of our subset syntax.
Missing commas: There are a few common ‘gotchas’ that tend to come up in this topic area. When using [ ]
with data.frame
objects the most common cause of errors is a misplaced or missing ,
.
NA values: When using functions like sum()
and mean()
you may encounter situations where you have missing values within the vector you are applying the function to, which results in an output of NA
.
Many functions, including sum()
and mean()
, have an argument to allow missing values to be removed from consideration.
To check the arguments available for a function you can use ?
to access the functions documentation, for example ?sum
.
There are a few ways to check for missing values in a vector, but a simple one is to use any()
with is.na()
.
A tidy approach
Example of a tidyverse
approach to using a single column from a data.frame
subset.
Next steps
Manipulating data.frame
objects is typically a fundamental part of any analytical or statistical workflow using R. The techniques described above are highly flexible and can be easily adapted to form part of a solution to many problems you might encounter.
You will also be starting to see how we can build more complex code from simpler component parts, such as placing code to produce a subset of a data.frame
inside a function, for example sum(df[df$col_a == "x",]$col_b)
.
Practice and experimentation is key in regards to this topic and it is highly recommended that you play around with some data to fully explore it. You can also try some of the tasks below.
airquality
is a dataset that comes with R, and is available by default.
-
airquality
contains aMonth
andDay
column, both of which are numeric (hint January = 1, February = 2, etc). What was the average temperature (Temp
) during the month of July? -
What was the average wind speed (
Wind
) during the first 10 days of the month of June? -
Calculate the sum of Ozone values from days where the Wind was greater then 10 OR the Temp was greater than 90 (hint
Ozone
contains some missing values).
Answers
1. 83.903232. 10.84
3. 2273