In statistics we often talk about categorical variables. These are variables with a typically limited and potentially fixed number of possible values. An observation of a categorical variable is based on a qualitative property.
Some common examples of categorical variables are:
- Eye colour (“blue”, “brown”, “green”, etc).
- Rating a service (“good”, “ok”, “bad”).
- Blood group (“A”, “AB”, “O”, etc).
- A numeric value, but where options are limited, for example the result of a dice being rolled (1 to 6).
R was primarily designed with statistical applications in mind and the existence of the factor class is a direct result of this. R factors are intended for use with categorical variables.
Using factors in R
Creating factors
The primary function we will need to work with factors in R is factor()
.
Let’s say we asked 5 customers to rate their experience of a service and stored their answers in a character vector named satisfaction
. The customers could choose one of 5 values to rate their experience;
- very good
- good
- ok
- bad
- very bad
We can convert satisfaction
to a factor with the factor()
function. To test whether a vector is a factor we use is.factor()
.
Of course, we don’t need to create the character vector first and then convert to a factor, we can also create it directly.
Information: If we use typeof()
to check the type of our vector, we can see that it is an integer (surprise!).
typeof(satisfaction)
#' [1] "integer"
Whilst factors may appear to be strings based upon the elements that they contain, they also have a numerical representation, which is why `typeof(satisfaction)` returns `integer`. This is important as it can also lead to errors occurring if you haven't realised that factors are being used. This was a common cause of frustration for R users which was addressed by the R Core team and as of R version 4.0.0 many functions which used to set strings as factors by default ceased to do so, including the `data.frame()` function.
Those using a version of R prior to the 4.0.0 release should take care to pass the `stringsAsFactors = FALSE` argument inside relevant functions if factors are not desired. Alternatively, the behaviour can be turned off globally during a session by running `options(stringsAsFactors = FALSE)`.
Levels
If we use the print()
function, both the elements of our vector and the levels that have been automatically detected are printed to the console. We can also use the levels()
function to return the levels specifically. Levels are the unique elements that the factor contains, or might contain.
For simpler categorical variables the levels which have been detected by default may be sufficient. However, as we haven’t specified an order for our factor, if we try to sort it the result won’t be very meaningful.
R is not able to identify an order for our variable automatically, so sort()
has simply returned the elements of satisfaction on alphabetical order and functions like min()
and max()
will return errors if we try them.
This behaviour is likely preferable when dealing with a categorical variable with no implied order, for example hair colour, but for our customer satisfaction ratings we want our ratings to be ordered from worse to best. Lets specify an order of;
very_bad < bad < ok < good < very_good
We can use the factor()
function and specify our preferred order by passing a vector to the levels
argument and setting the ordered
argument to TRUE
. You will notice that we are able to specify levels
that don’t actually exist in our data, ‘very_bad’ for example. None of our 5 respondents actually rated the service as being ‘very bad’, but it was a valid option and therefore can be included as a level.
We can now sort satisfaction
.
We can also use functions like min()
and max()
with our vector too.
Reverse a factor
The forcats
package contains the function fct_rev()
which deserves an honourable mention here, it allows us to quickly reverse the order of our factor.
fct_reverse
can be particularly useful when using factors to order the axes of plots.
Conversions
Converting our factor, satisfaction
, to a character vector is straight forward using the as.character()
function. Note that converting a factor removes the levels.
Sometimes you may have a factor where the elements represent something numeric by nature, but that could still be considered a categorical variable, for example rolls of a dice. Let’s create a factor with 5 observations of dice rolls.
We can convert rolls
to a numeric vector.
However, care must be taken when using as.numeric()
with a factor. In the example above, the result appears correct, but this is only because our levels matched up to our vector elements. Let’s try the same thing, but without specifying the levels explicitly.
Rather than returning the elements of rolls
, converted to numeric type, we have instead received the index position of each element’s level. This is a simple issue to resolve, one can simply convert to a character then numeric as.numeric(as.character(rolls))
, but represents another reason why caution should be taken when working with factors.
Next steps
Factors are a unique and useful structure which are highly effective when working with categorical variables. However, there are some more complicated aspects to their usage. Try converting some simple vectors to factors and ordering them, before transforming to another type.