iris |>
tidyr::pivot_longer(
cols = where(is.numeric), # using `where()` w/out calling dplyr / tidyselect
names_to = 'Variable',
values_to = 'Measure'
)1 Introduction
Have you ever wondered how where(), starts_with(), and other selection helpers work seamlessly across different tidyverse packages? I recently discovered something surprising: you can actually use these functions in dplyr, tidyr, and other packages that invokes <tidy-select> API, without explicitly loading them.
Here’s how it works:
Take note that I never load tidyselect and dplyr (the where() function in dplyr is just one of many re-exports). Yet, where() works perfectly. It doesn’t belong to / re-exported by tidyr, but you can use where(), if and only if the functions is invoking <tidy-select> API.
2 What Are These Functions Called?
These are officially called tidyselect helpers (or “selection language”). They’re part of the tidyselect package, which provides a domain-specific language (DSL) for selecting columns in data frames.
You might also hear them referred to as:
- Selection helper functions
-
<Tidy-select>helpers - Column selection helpers
3 The Complete Family of Selection Helpers
The tidyselect package can be divided into 3 categories of helpers.
-
Columns starting with a prefix
-
Columns ending with a suffix
-
Columns containing a literal string
-
Columns matching a regular expression
-
Columns following the number pattern
The where() function is similar to SQL WHERE, except it is functional that (should) returns a Boolean value that satisfies the condition.
These are functions that select columns based on their position in the data frame
-
This is equivalent to
relocate(iris, Species):
Offset: 2nd to the last
Offset: Multiple columns from the end
Noticed that I invoked most of <tidy-select> helpers, but never loaded dplyr or tidyselect, not once, just to use them.
4 “Data-Masking” Subset
Just like those <tidy-select> helpers, some functions found in dplyr, but doesn’t in tidyselect. These are all the functions that can be used within “data-masking” functions, such as dplyr::mutate() and dplyr::summarise(). Take a note of the term “within”, which means, you can’t use them outside from “data-masking” functions.
I call across(), if_any(), and if_all() as projection helpers because they correspond to the SELECT clause in SQL, except they both if_any(), and if_all() map over the selected columns and returns the Boolean vector, while the across() function modifies the selected columns. The pick() function, on the other hand serves as a complement of across() by extracting them as a data frame, however, this only applies to subset a data frame to be invoked within the operations in “data-masking” functions. All of them can make use of the <tidy-select> API, meaning you can apply selectors like starts_with() or everything() to specify which columns to project.
Here’s an example: Calculating mean and standard deviation
iris |>
dplyr::group_by(Species) |>
dplyr::summarise(
summary = list({
num = pick(where(is.numeric))
tibble::tibble(
vars = colnames(num),
mean = colMeans(num),
sd = apply(num, 2, sd)
)
})
) |>
tidyr::unnest(summary)I am aware there’s a better approach to calculate the mean and standard deviation of each column by group.
Here’s an example: Apply min-max normalization among numeric columns in iris dataset
iris |>
dplyr::as_tibble() |>
dplyr::mutate(
across(
where(is.numeric),
\(col) { col - min(col) } / { max(col) - min(col) }
)
)And once again, I never attach dplyr into the search path just to use across() and pick().
You can use across() in some dplyr “data-masking” function like filter(), but this is a deprecated behavior and attaching dplyr package is required.
Example: Removing all missing values across all columns in airquality data frame
If if_all() / if_any() is used outside filter(), those functions need dplyr package to be attached to use them.
Though, there are some exceptions: there are helper functions you actually need dplyr to be attached to use them, otherwise they don’t work and R will throw an error.
Here they are:
-
starwars |> group_by(species) |> reframe( species, name, hierarchical_id = sprintf("%02d-%03d", cur_group_id(), row_number()) # 👈 ) |> slice_min(hierarchical_id, n = 15) -
iris |> group_by(Species) |> slice_sample( n = 75, replace = TRUE ) |> summarise( m_sl = mean(Sepal.Length), n = {length(cur_group_rows()) + 30} # 👈 ) -
iris |> as_tibble() |> transmute( across( where(is.numeric), \(col) { if (stringr::str_detect(cur_column(), "Sepal")) { # 👈 col - mean(col) } else if (stringr::str_detect(cur_column(), "Petal")) { # 👈 (col - mean(col)) / sd(col) } else { col } } ) )
5 Conclusion
I hope they don’t change this soon, it is quite a nice feature (definitely not a bug 😋), assembling the DSL strengths across tidyverse APIs. Even if it is subtle. I still suggest you to attach these functions (through e.g. library() and box::use()) for better maintainability.