Data frames are a fundamental part of R and the functionality they provide plays an integral role in many analysis and data science based workflows. Data frames are rectangular, 2 dimensional table structures, resembling rows and columns which makes them flexible and intuitive to work with.
Creating a data.frame
Many functions commonly used to read tabular data into R will by default return a data.frame
. We can also use the ``data.frame()
function to create a data.frame
with any number of columns. Imagine that you have the names, ages, and postcodes of 5 people. A data.frame
with 5 rows and 3 columns would be an ideal way to store this information.
The only constraint when creating a data.frame
is that the columns must be of the same length, otherwise an error is returned.
Each column of a data.frame
is actually a vector, so we can also construct a data.frame
from vectors, as long as they are of equal length.
data.frame columns
As data.frame
columns are vectors we can access them and use them inside functions that accept vectors. There are a few different ways to access a data.frame
column, but the 3 most common are;
- Using the
$
operator. - Using the index.
- Using the column name.
You will notice that options 2 & 3 returned the column in a different format. Using $
stripped the column of it’s data.frame
attributes, whereas the other methods retained them. You can check this by using typeof()
and attributes()
on df$age
and df["age"]
.
The output of typeof(df["age"])
may have surprised you slightly. Beneath the surface, data.frame
’s are actually list objects with each column forming an element of the list.
For now, we will focus on using the $
operator to access columns. The other options can be useful at times, though as a general rule try to avoid option 2 (df[2]
). Specifying column names explicitly makes your code simpler to comprehend,improves reproducibility, and helps to reduce errors if your data changes.
Accessing a column allows us to use it as we would any other vector. We can get the sum and mean of the age column with the appropriate functions.
We can also subset a column accessed with the $
operator using the index system.
data.frame subsets
We can also create subsets of data.frame
’s using indices. With data.frame
’s, two indices are provided, the first for the rows and the second for the columns. Blank indices are also acceptable, as long as they are separated with ,
.
Each individual element in a data.frame
has 2 indices.
The below examples show how the index system can be used to access various elements in our data.frame
.
Functions for data.frame’s
When working with vectors we can check their length using the length()
function. Given that all columns of a data.frame
are vectors of equal length we could use length()
on any column, however, it is more convenient to use the nrow()
function to identify the number of rows.
Some other useful functions for interacting with data.frame
’s include ncol()
, used to get the number of columns, and names()
, which provides the names of the data.frame
’s columns.
The names()
function can also be used to change the names of columns.
Next steps
data.frame
’s are one of the most useful data structures in R and the fact that this functionality is built into the language is one of the reasons that R is an excellent choice for statistics, analysis, and data science.
To put some of the key points above into practice, try the following tasks.
Try creating a data.frame
named “my_df” with 3 columns named “col_a”, “col_b”, and “col_c”.
“col_a” should contain the first 10 letters of the alphabet as individual elements.
“col_b” should contain the numbers 1 to 10.
“col_c” should contain the numbers 11 to 20, but in reverse.
Your resulting data.frame
should look like this;
-
What is the product of the sum of col_b and col_c?
-
What is the sum of all values in col_b and col_c?
-
What is the sum of all values in col_b and col_c, but considering only the first 5 rows of my_df?
Answers
1. 85252. 210
3. 105