% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/topic-nse.R
\name{topic-data-mask-programming}
\alias{topic-data-mask-programming}
\title{Data mask programming patterns}
\description{
\link[=topic-data-mask]{Data-masking} functions require special programming patterns when used inside other functions. In this topic we'll review and compare the different patterns that can be used to solve specific problems.
If you are a beginner, you might want to start with one of these tutorials:
\itemize{
\item \href{https://dplyr.tidyverse.org/articles/programming.html}{Programming with dplyr}
\item \href{https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html}{Using ggplot2 in packages}
}
If you'd like to go further and learn about defusing and injecting expressions, read the \link[=topic-metaprogramming]{metaprogramming patterns topic}.
}
\section{Choosing a pattern}{
Two main considerations determine which programming pattern you need to wrap a data-masking function:
\enumerate{
\item What behaviour does the \emph{wrapped} function implement?
\item What behaviour should \emph{your} function implement?
}
Depending on the answers to these questions, you can choose between these approaches:
\itemize{
\item The \strong{forwarding patterns} with which your function inherits the behaviour of the function it interfaces with.
\item The \strong{name patterns} with which your function takes strings or character vectors of column names.
\item The \strong{bridge patterns} with which you change the behaviour of an argument instead of inheriting it.
}
You will also need to use different solutions for single named arguments than for multiple arguments in \code{...}.
}
\section{Argument behaviours}{
In a regular function, arguments can be defined in terms of a \emph{type} of objects that they accept. An argument might accept a character vector, a data frame, a single logical value, etc. Data-masked arguments are more complex. Not only do they generally accept a specific type of objects (for instance \code{dplyr::mutate()} accepts vectors), they exhibit special computational behaviours.
\itemize{
\item Data-masked expressions (base): E.g. \code{\link[=transform]{transform()}}, \code{\link[=with]{with()}}. Expressions may refer to the columns of the supplied data frame.
\item Data-masked expressions (tidy eval): E.g. \code{dplyr::mutate()}, \code{ggplot2::aes()}. Same as base data-masking but with tidy eval features enabled. This includes \link[=topic-inject]{injection operators} such as \ifelse{html}{\code{\link[=embrace-operator]{\{\{}}}{\verb{\{\{}} and \code{\link[=injection-operator]{!!}} and the \code{\link{.data}} and \code{\link{.env}} pronouns.
\item Data-masked symbols: Same as data-masked arguments but the supplied expressions must be simple column names. This often simplifies things, for instance this is an easy way of avoiding issues of \link[=topic-double-evaluation]{double evaluation}.
\item \href{https://tidyselect.r-lib.org/reference/language.html}{Tidy selections}: E.g. \code{dplyr::select()}, \code{tidyr::pivot_longer()}. This is an alternative to data masking that supports selection helpers like \code{starts_with()} or \code{all_of()}, and implements special behaviour for operators like \code{c()}, \code{|} and \code{&}.
Unlike data masking, tidy selection is an interpreted dialect. There is in fact no masking at all. Expressions are either interpreted in the context of the data frame (e.g. \code{c(cyl, am)} which stands for the union of the columns \code{cyl} and \code{am}), or evaluated in the user environment (e.g. \code{all_of()}, \code{starts_with()}, and any other expressions). This has implications for inheritance of argument behaviour as we will see below.
\item \link[=doc_dots_dynamic]{Dynamic dots}: These may be data-masked arguments, tidy selections, or just regular arguments. Dynamic dots support injection of multiple arguments with the \code{\link[=splice-operator]{!!!}} operator as well as name injection with \link[=glue-operators]{glue} operators.
}
To let users know about the capabilities of your function arguments, document them with the following tags, depending on which set of semantics they inherit from:
\if{html}{\out{
}}\preformatted{@param foo <[`data-masked`][dplyr::dplyr_data_masking]> What `foo` does.
@param bar <[`tidy-select`][dplyr::dplyr_tidy_select]> What `bar` does.
@param ... <[`dynamic-dots`][rlang::dyn-dots]> What these dots do.
}\if{html}{\out{
}}
}
\section{Forwarding patterns}{
With the forwarding patterns, arguments inherit the behaviour of the data-masked arguments they are passed in.
\subsection{Embrace with \verb{\{\{}}{
The embrace operator \ifelse{html}{\code{\link[=embrace-operator]{\{\{}}}{\verb{\{\{}} is a forwarding syntax for single arguments. You can forward an argument in data-masked context:
\if{html}{\out{}}\preformatted{my_summarise <- function(data, var) \{
data \%>\% dplyr::summarise(\{\{ var \}\})
\}
}\if{html}{\out{
}}
Or in tidyselections:
\if{html}{\out{}}\preformatted{my_pivot_longer <- function(data, var) \{
data \%>\% tidyr::pivot_longer(cols = \{\{ var \}\})
\}
}\if{html}{\out{
}}
The function automatically inherits the behaviour of the surrounding context. For instance arguments forwarded to a data-masked context may refer to columns or use the \code{\link{.data}} pronoun:
\if{html}{\out{}}\preformatted{mtcars \%>\% my_summarise(mean(cyl))
x <- "cyl"
mtcars \%>\% my_summarise(mean(.data[[x]]))
}\if{html}{\out{
}}
And arguments forwarded to a tidy selection may use all tidyselect features:
\if{html}{\out{}}\preformatted{mtcars \%>\% my_pivot_longer(cyl)
mtcars \%>\% my_pivot_longer(vs:gear)
mtcars \%>\% my_pivot_longer(starts_with("c"))
x <- c("cyl", "am")
mtcars \%>\% my_pivot_longer(all_of(x))
}\if{html}{\out{
}}
}
\subsection{Forward \code{...}}{
Simple forwarding of \code{...} arguments does not require any special syntax since dots are already a forwarding syntax. Just pass them to another function like you normally would. This works with data-masked arguments:
\if{html}{\out{}}\preformatted{my_group_by <- function(.data, ...) \{
.data \%>\% dplyr::group_by(...)
\}
mtcars \%>\% my_group_by(cyl = cyl * 100, am)
}\if{html}{\out{
}}
As well as tidy selections:
\if{html}{\out{}}\preformatted{my_select <- function(.data, ...) \{
.data \%>\% dplyr::select(...)
\}
mtcars \%>\% my_select(starts_with("c"), vs:carb)
}\if{html}{\out{
}}
Some functions take a tidy selection in a single named argument. In that case, pass the \code{...} inside \code{c()}:
\if{html}{\out{}}\preformatted{my_pivot_longer <- function(.data, ...) \{
.data \%>\% tidyr::pivot_longer(c(...))
\}
mtcars \%>\% my_pivot_longer(starts_with("c"), vs:carb)
}\if{html}{\out{
}}
Inside a tidy selection, \code{c()} is not a vector concatenator but a selection combinator. This makes it handy to interface between functions that take \code{...} and functions that take a single argument.
}
}
\section{Names patterns}{
With the names patterns you refer to columns by name with strings or character vectors stored in env-variables. Whereas the forwarding patterns are exclusively used within a function to pass \emph{arguments}, the names patterns can be used anywhere.
\itemize{
\item In a script, you can loop over a character vector with \code{for} or \code{lapply()} and use the \code{\link{.data}} pattern to connect a name to its data-variable. A vector can also be supplied all at once to the tidy select helper \code{all_of()}.
\item In a function, using the names patterns on function arguments lets users supply regular data-variable names without any of the complications that come with data-masking.
}
\subsection{Subsetting the \code{.data} pronoun}{
The \code{\link{.data}} pronoun is a tidy eval feature that is enabled in all data-masked arguments, just like \ifelse{html}{\code{\link[=embrace-operator]{\{\{}}}{\verb{\{\{}}. The pronoun represents the data mask and can be subsetted with \code{[[} and \code{$}. These three statements are equivalent:
\if{html}{\out{}}\preformatted{mtcars \%>\% dplyr::summarise(mean = mean(cyl))
mtcars \%>\% dplyr::summarise(mean = mean(.data$cyl))
var <- "cyl"
mtcars \%>\% dplyr::summarise(mean = mean(.data[[var]]))
}\if{html}{\out{
}}
The \code{.data} pronoun can be subsetted in loops:
\if{html}{\out{}}\preformatted{vars <- c("cyl", "am")
for (var in vars) print(dplyr::summarise(mtcars, mean = mean(.data[[var]])))
#> # A tibble: 1 x 1
#> mean
#>
#> 1 6.19
#> # A tibble: 1 x 1
#> mean
#>
#> 1 0.406
purrr::map(vars, ~ dplyr::summarise(mtcars, mean = mean(.data[[.x]])))
#> [[1]]
#> # A tibble: 1 x 1
#> mean
#>
#> 1 6.19
#>
#> [[2]]
#> # A tibble: 1 x 1
#> mean
#>
#> 1 0.406
}\if{html}{\out{
}}
And it can be used to connect function arguments to a data-variable:
\if{html}{\out{}}\preformatted{my_mean <- function(data, var) \{
data \%>\% dplyr::summarise(mean = mean(.data[[var]]))
\}
my_mean(mtcars, "cyl")
#> # A tibble: 1 x 1
#> mean
#>
#> 1 6.19
}\if{html}{\out{
}}
With this implementation, \code{my_mean()} is completely insulated from data-masking behaviour and is called like an ordinary function.
\if{html}{\out{}}\preformatted{# No masking
am <- "cyl"
my_mean(mtcars, am)
#> # A tibble: 1 x 1
#> mean
#>
#> 1 6.19
# Programmable
my_mean(mtcars, tolower("CYL"))
#> # A tibble: 1 x 1
#> mean
#>
#> 1 6.19
}\if{html}{\out{
}}
}
\subsection{Character vector of names}{
The \code{.data} pronoun can only be subsetted with single column names. It doesn't support single-bracket indexing:
\if{html}{\out{}}\preformatted{mtcars \%>\% dplyr::summarise(.data[c("cyl", "am")])
#> Error in `dplyr::summarise()`:
#> ! Can't compute `..1 = .data[c("cyl", "am")]`.
#> Caused by error in `.data[c("cyl", "am")]`:
#> ! `[` is not supported by the `.data` pronoun, use `[[` or $ instead.
}\if{html}{\out{
}}
There is no plural variant of \code{.data} built in tidy eval. Instead, we'll used the \code{all_of()} operator available in tidy selections to supply character vectors. This is straightforward in functions that take tidy selections, like \code{tidyr::pivot_longer()}:
\if{html}{\out{}}\preformatted{vars <- c("cyl", "am")
mtcars \%>\% tidyr::pivot_longer(all_of(vars))
#> # A tibble: 64 x 11
#> mpg disp hp drat wt qsec vs gear carb name value
#>
#> 1 21 160 110 3.9 2.62 16.5 0 4 4 cyl 6
#> 2 21 160 110 3.9 2.62 16.5 0 4 4 am 1
#> 3 21 160 110 3.9 2.88 17.0 0 4 4 cyl 6
#> 4 21 160 110 3.9 2.88 17.0 0 4 4 am 1
#> # ... with 60 more rows
}\if{html}{\out{
}}
If the function does not take a tidy selection, it might be possible to use a \emph{bridge pattern}. This option is presented in the bridge section below. If a bridge is impossible or inconvenient, a little metaprogramming with the \link[=topic-metaprogramming]{symbolise-and-inject pattern} can help.
}
}
\section{Bridge patterns}{
Sometimes the function you are calling does not implement the behaviour you would like to give to the arguments of your function. To work around this may require a little thought since there is no systematic way of turning one behaviour into another. The general technique consists in forwarding the arguments inside a context that implements the behaviour that you want. Then, find a way to bridge the result to the target verb or function.
\subsection{\code{across()} as a selection to data-mask bridge}{
dplyr 1.0 added support for tidy selections in all verbs via \code{across()}. This function is normally used for mapping over columns but can also be used to perform a simple selection. For instance, if you'd like to pass an argument to \code{group_by()} with a tidy-selection interface instead of a data-masked one, use \code{across()} as a bridge:
\if{html}{\out{}}\preformatted{my_group_by <- function(data, var) \{
data \%>\% dplyr::group_by(across(\{\{ var \}\}))
\}
mtcars \%>\% my_group_by(starts_with("c"))
}\if{html}{\out{
}}
Since \code{across()} takes selections in a single argument (unlike \code{select()} which takes multiple arguments), you can't directly pass \code{...}. Instead, take them within \code{c()}, which is the tidyselect way of supplying multiple selections within a single argument:
\if{html}{\out{}}\preformatted{my_group_by <- function(.data, ...) \{
.data \%>\% dplyr::group_by(across(c(...)\}))
\}
mtcars \%>\% my_group_by(starts_with("c"), vs:gear)
}\if{html}{\out{
}}
}
\subsection{\code{across(all_of())} as a names to data mask bridge}{
If instead of forwarding variables in \code{across()} you pass them to \code{all_of()}, you create a names to data mask bridge.
\if{html}{\out{}}\preformatted{my_group_by <- function(data, vars) \{
data \%>\% dplyr::group_by(across(all_of(vars)))
\}
mtcars \%>\% my_group_by(c("cyl", "am"))
}\if{html}{\out{
}}
Use this bridge technique to connect vectors of names to a data-masked context.
}
\subsection{\code{transmute()} as a data-mask to selection bridge}{
Passing data-masked arguments to a tidy selection is a little more tricky and requires a three step process.
\if{html}{\out{}}\preformatted{my_pivot_longer <- function(data, ...) \{
# Forward `...` in data-mask context with `transmute()`
# and save the inputs names
inputs <- dplyr::transmute(data, ...)
names <- names(inputs)
# Update the data with the inputs
data <- dplyr::mutate(data, !!!inputs)
# Select the inputs by name with `all_of()`
tidyr::pivot_longer(data, cols = all_of(names))
\}
mtcars \%>\% my_pivot_longer(cyl, am = am * 100)
}\if{html}{\out{
}}
\enumerate{
\item In a first step we pass the \code{...} expressions to \code{transmute()}. Unlike \code{mutate()}, it creates a new data frame from the user inputs. The only goal of this step is to inspect the names in \code{...}, including the default names created for unnamed arguments.
\item Once we have the names, we inject the arguments into \code{mutate()} to update the data frame.
\item Finally, we pass the names to the tidy selection via \href{https://tidyselect.r-lib.org/reference/all_of.html}{\code{all_of()}}.
}
}
}
\section{Transformation patterns}{
\subsection{Named inputs versus \code{...}}{
In the case of a named argument, transformation is easy. We simply surround the embraced input it in R code. For instance, the \code{my_summarise()} function is not exactly useful compared to just calling \code{summarise()}:
\if{html}{\out{}}\preformatted{my_summarise <- function(data, var) \{
data \%>\% dplyr::summarise(\{\{ var \}\})
\}
}\if{html}{\out{
}}
We can make it more useful by adding code around the variable:
\if{html}{\out{}}\preformatted{my_mean <- function(data, var) \{
data \%>\% dplyr::summarise(mean = mean(\{\{ var \}\}, na.rm = TRUE))
\}
}\if{html}{\out{
}}
For inputs in \code{...} however, this technique does not work. We would need some kind of templating syntax for dots that lets us specify R code with a placeholder for the dots elements. This isn't built in tidy eval but you can use operators like \code{dplyr::across()}, \code{dplyr::if_all()}, or \code{dplyr::if_any()}. When that isn't possible, you can template the expression manually.
}
\subsection{Transforming inputs with \code{across()}}{
The \code{across()} operation in dplyr is a convenient way of mapping an expression across a set of inputs. We will create a variant of \code{my_mean()} that computes the \code{mean()} of all arguments supplied in \code{...}. The easiest way it to forward the dots to \code{across()} (which causes \code{...} to inherit its tidy selection behaviour):
\if{html}{\out{}}\preformatted{my_mean <- function(data, ...) \{
data \%>\% dplyr::summarise(across(c(...), ~ mean(.x, na.rm = TRUE)))
\}
mtcars \%>\% my_mean(cyl, carb)
#> # A tibble: 1 x 2
#> cyl carb
#>
#> 1 6.19 2.81
mtcars \%>\% my_mean(foo = cyl, bar = carb)
#> # A tibble: 1 x 2
#> foo bar
#>
#> 1 6.19 2.81
mtcars \%>\% my_mean(starts_with("c"), mpg:disp)
#> # A tibble: 1 x 4
#> cyl carb mpg disp
#>
#> 1 6.19 2.81 20.1 231.
}\if{html}{\out{
}}
}
\subsection{Transforming inputs with \code{if_all()} and \code{if_any()}}{
\code{dplyr::filter()} requires a different operation than \code{across()} because it needs to combine the logical expressions with \code{&} or \code{|}. To solve this problem dplyr introduced the \code{if_all()} and \code{if_any()} variants of \code{across()}.
In the following example, we filter all rows for which a set of variables are not equal to their minimum value:
\if{html}{\out{}}\preformatted{filter_non_baseline <- function(.data, ...) \{
.data \%>\% dplyr::filter(if_all(c(...), ~ .x != min(.x, na.rm = TRUE)))
\}
mtcars \%>\% filter_non_baseline(vs, am, gear)
}\if{html}{\out{
}}
}
}
\keyword{internal}