1 Introduction

Have you ever wondered how where(), starts_with(), and other selection helpers work seamlessly across different tidyverse packages? I recently discovered something surprising: you can actually use these functions in dplyr, tidyr, and other packages that invokes <tidy-select> API, without explicitly loading them.

Here’s how it works:

iris |> 
    tidyr::pivot_longer(
        cols = where(is.numeric), # using `where()` w/out calling dplyr / tidyselect
        names_to = 'Variable',
        values_to = 'Measure'
    )

Take note that I never load tidyselect and dplyr (the where() function in dplyr is just one of many re-exports). Yet, where() works perfectly. It doesn’t belong to / re-exported by tidyr, but you can use where(), if and only if the functions is invoking <tidy-select> API.

2 What Are These Functions Called?

These are officially called tidyselect helpers (or “selection language”). They’re part of the tidyselect package, which provides a domain-specific language (DSL) for selecting columns in data frames.

You might also hear them referred to as:

Selection helper functions
<Tidy-select> helpers
Column selection helpers

3 The Complete Family of Selection Helpers

The tidyselect package can be divided into 3 categories of helpers.

Columns starting with a prefix
```
iris |> 
    dplyr::select(starts_with("Sepal")) |> 
    head(3)
```
Columns ending with a suffix
```
iris |> 
    dplyr::select(ends_with("Width")) |> 
    head(3)
```
Columns containing a literal string
```
iris |> 
    dplyr::select(contains("al")) |> 
    head(3)
```
Columns matching a regular expression
```
iris |> 
    dplyr::select(matches("^Sepal")) |> 
    head(3)
```
Columns following the number pattern
```
iris |> 
    dplyr::select(num_range('x', 1:4)) |> 
    head(3)
```

The where() function is similar to SQL WHERE, except it is functional that (should) returns a Boolean value that satisfies the condition.

iris |> 
    dplyr::select(where(is.numeric)) |> 
    head(3)

iris |> 
    dplyr::select(where(is.factor)) |> 
    head(3)

iris |> 
    dplyr::select(where(\(col) is.numeric(col) && mean(col) > 3.5))

These are functions that select columns based on their position in the data frame

everything()
```
iris |> 
    dplyr::select(everything()) |> 
    head(3)
```
This is equivalent to relocate(iris, Species):
```
iris |> 
    dplyr::select(Species, everything()) |> 
    head(3)
```
last_col()

iris |> 
    dplyr::select(last_col()) |> 
    head(3)

Offset: 2nd to the last

iris |> 
    dplyr::select(last_col(1)) |> 
    head(3)

Offset: Multiple columns from the end

iris |> 
    dplyr::select(last_col(2):last_col()) |> 
    head(3)

Noticed that I invoked most of <tidy-select> helpers, but never loaded dplyr or tidyselect, not once, just to use them.

4 “Data-Masking” Subset

Just like those <tidy-select> helpers, some functions found in dplyr, but doesn’t in tidyselect. These are all the functions that can be used within “data-masking” functions, such as dplyr::mutate() and dplyr::summarise(). Take a note of the term “within”, which means, you can’t use them outside from “data-masking” functions.

I call across(), if_any(), and if_all() as projection helpers because they correspond to the SELECT clause in SQL, except they both if_any(), and if_all() map over the selected columns and returns the Boolean vector, while the across() function modifies the selected columns. The pick() function, on the other hand serves as a complement of across() by extracting them as a data frame, however, this only applies to subset a data frame to be invoked within the operations in “data-masking” functions. All of them can make use of the <tidy-select> API, meaning you can apply selectors like starts_with() or everything() to specify which columns to project.

Here’s an example: Calculating mean and standard deviation

iris |> 
    dplyr::group_by(Species) |> 
    dplyr::summarise(
        summary = list({
            num = pick(where(is.numeric))
            tibble::tibble(
                vars = colnames(num), 
                mean = colMeans(num),
                sd = apply(num, 2, sd)
            )
        })
    ) |> 
    tidyr::unnest(summary)

I am aware there’s a better approach to calculate the mean and standard deviation of each column by group.

Here’s an example: Apply min-max normalization among numeric columns in iris dataset

iris |> 
    dplyr::as_tibble() |> 
    dplyr::mutate(
        across(
            where(is.numeric), 
            \(col) { col - min(col) } / { max(col) - min(col) }
        )
    )

And once again, I never attach dplyr into the search path just to use across() and pick().

You can use across() in some dplyr “data-masking” function like filter(), but this is a deprecated behavior and attaching dplyr package is required.

Example: Removing all missing values across all columns in airquality data frame

airquality |> 
    dplyr::filter(if_all(everything(), \(col) !is.na(col))) |> 
    head(5)

If if_all() / if_any() is used outside filter(), those functions need dplyr package to be attached to use them.

Though, there are some exceptions: there are helper functions you actually need dplyr to be attached to use them, otherwise they don’t work and R will throw an error.

Here they are:

library(dplyr)

n()
```
iris |>
    group_by(Species) |> 
    slice_max(n = 20, order_by = Sepal.Length) |> 
    summarise(
        count = n(), # 👈 
        m_sl = mean(Sepal.Length)
    )
```
cur_group()
```
mtcars |>
    group_by(cyl) |>
    reframe({
        model = lm(mpg ~ wt, data = cur_group()) # 👈 
        coefs = coef(model)

        tibble(
            terms = names(coefs), 
            estimate = coefs
        )
    })
```
cur_group_id()
```
starwars |>
    group_by(species) |>
    reframe(
        species, 
        name, 
        hierarchical_id = sprintf("%02d-%03d", cur_group_id(), row_number()) # 👈 
    ) |> 
    slice_min(hierarchical_id, n = 15)
```
cur_group_rows()
```
iris |> 
    group_by(Species) |> 
    slice_sample(
        n = 75, replace = TRUE
    ) |> 
    summarise(
        m_sl = mean(Sepal.Length),
        n = {length(cur_group_rows()) + 30} # 👈 
    )
```
cur_column()
```
iris |> 
    as_tibble() |> 
    transmute(
        across(
            where(is.numeric),
            \(col) {
                if (stringr::str_detect(cur_column(), "Sepal")) { # 👈 
                    col - mean(col)
                } else if (stringr::str_detect(cur_column(), "Petal")) { # 👈 
                    (col - mean(col)) / sd(col)
                } else {
                    col
                }
            }
        )
    )
```

5 Conclusion

I hope they don’t change this soon, it is quite a nice feature (definitely not a bug 😋), assembling the DSL strengths across tidyverse APIs. Even if it is subtle. I still suggest you to attach these functions (through e.g. library() and box::use()) for better maintainability.