Skip to content
Permalink
master
Switch branches/tags
Go to file
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

* You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved.
* with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

* 5b501c5 is the main switch to use InMemoryDataset
* b31fb5e deletes `array_expression`
* 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
* 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
* d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
* a0914f6 + eee491a contain ARROW-12696

Closes #10191 from nealrichardson/dplyr-in-memory

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
14 contributors

Users who have contributed to this file

@nealrichardson @romainfrancois @kszucs @kou @javierluraschi @wesm @bkietz @pitrou @traversc @OfekShilon @jonkeane @jeffwong-nflx
Package: arrow
Title: Integration to 'Apache' 'Arrow'
Version: 4.0.0.9000
Authors@R: c(
person("Neal", "Richardson", email = "neal@ursalabs.org", role = c("aut", "cre")),
person("Ian", "Cook", email = "ianmcook@gmail.com", role = c("aut")),
person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut")),
person("Romain", "Fran\u00e7ois", email = "romain@rstudio.com", role = c("aut"), comment = c(ORCID = "0000-0002-2444-4226")),
person("Jeroen", "Ooms", email = "jeroen@berkeley.edu", role = c("aut")),
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),
person("Apache Arrow", email = "dev@arrow.apache.org", role = c("aut", "cph"))
)
Description: 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language
development platform for in-memory data. It specifies a standardized
language-independent columnar memory format for flat and hierarchical data,
organized for efficient analytic operations on modern hardware. This
package provides an interface to the 'Arrow C++' library.
Depends: R (>= 3.3)
License: Apache License (>= 2.0)
URL: https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/
BugReports: https://issues.apache.org/jira/projects/ARROW/issues
Encoding: UTF-8
Language: en-US
SystemRequirements: C++11; for AWS S3 support on Linux, libcurl and openssl (optional)
Biarch: true
Imports:
assertthat,
bit64 (>= 0.9-7),
methods,
purrr,
R6,
rlang,
stats,
tidyselect,
utils,
vctrs
Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source")
RoxygenNote: 7.1.1
VignetteBuilder: knitr
Suggests:
decor,
distro,
dplyr,
hms,
knitr,
lubridate,
pkgload,
reticulate,
rmarkdown,
stringr,
testthat,
tibble,
withr
LinkingTo: cpp11 (>= 0.2.0)
Collate:
'enums.R'
'arrow-package.R'
'type.R'
'array-data.R'
'arrow-datum.R'
'array.R'
'arrow-tabular.R'
'arrowExports.R'
'buffer.R'
'chunked-array.R'
'io.R'
'compression.R'
'scalar.R'
'compute.R'
'config.R'
'csv.R'
'dataset.R'
'dataset-factory.R'
'dataset-format.R'
'dataset-partition.R'
'dataset-scan.R'
'dataset-write.R'
'deprecated.R'
'dictionary.R'
'dplyr-arrange.R'
'dplyr-collect.R'
'dplyr-eval.R'
'dplyr-filter.R'
'expression.R'
'dplyr-functions.R'
'dplyr-group-by.R'
'dplyr-mutate.R'
'dplyr-select.R'
'dplyr-summarize.R'
'record-batch.R'
'table.R'
'dplyr.R'
'feather.R'
'field.R'
'filesystem.R'
'flight.R'
'install-arrow.R'
'ipc_stream.R'
'json.R'
'memory-pool.R'
'message.R'
'metadata.R'
'parquet.R'
'python.R'
'record-batch-reader.R'
'record-batch-writer.R'
'reexports-bit64.R'
'reexports-tidyselect.R'
'schema.R'
'util.R'