R is often used interactively and as users grow in confidence they will often begin to find it simpler and more convenient to explore new data with R directly rather than opening the data in another application. A good IDE such as RStudio can make using R in this way easier. Exploring data before proceeding to perform analysis or further processing allows for making informed decisions regarding how to structure a project and helps to identify the types of techniques which may work well with the data you have.
Getting an overview
The built in iris
dataset was used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper.
If we know very little about a dataset, the head()
and tail()
functions can be used to print the first or last 6 rows of a data.frame
to the console by default; though the n
argument of the functions also allows us to specify a number of rows.
The typeof()
function is useful for identifying the type of column.
The output of typeof(iris$Species)
may have surprised you. Whilst you may have expected the return "character"
, we got "integer"
. The Species
column is actually a factor class object. We can use class()
to check an objects class, and is.factor()
if we wish to check for a factor specifically.
A note on factors: R was primarily designed with statistical applications in mind and the existence of the factor class is a direct result of this. R factors are intended for use with categorical variables, eye colour for example. Whilst factors can appear to be strings they also have a numerical representation, which is why typeof(iris$Species)
returns integer
.
Factors can be useful when you want to apply an order to categorical vector. For example, I could ask some people to rate this explanation of factors as “good”, “bad”, or “average” and record the responses as a character vector.
There is no inherent order to the vector from R’s perspective, so if we apply the sort()
function it will sort the elements alphabetically.
However, we can use the factor class in R to add a custom order to the vector. To do so we use the factor
function and specify our preferred order with the levels
argument.
Caution: Factors can be useful, but they can also lead to errors occurring if you haven't realised that they are being used. This common cause of frustration was addressed by the R Core team and as of R version 4.0.0 many functions which used to set strings as factors by default ceased to do so, including the data.frame()
function.
Those using a version of R prior to the 4.0.0 release should take care to pass the stringsAsFactors = FALSE
argument inside relevant functions if factors are not desired. Alternatively, the behavior can be turned off globally during a session by running options(stringsAsFactors = FALSE)
.
Taking a closer look
Once you have an idea of the general structure and contents of a dataset it can be useful to learn more about specific columns. The summary()
function is particularly useful for taking a closer look at a dataset and can handle columns of different types.
fivenum()
is another useful function which quickly provides Tukey’s five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for a numeric vector.
Other helpful functions to try include min()
, max()
, range()
, mean()
, and median()
.
Next Steps
You can view a full list of the built in R datasets by running library(help="datasets")
. Try running some of the functions discussed above on different datasets.