apache / arrow

nealrichardson ARROW-12731 : [R] Use InMemoryDataset for Table/RecordBatch in dplyr code

Latest commit 9347731

May 13, 2021

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

* You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved.
* with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

* 5b501c5 is the main switch to use InMemoryDataset
* b31fb5e deletes `array_expression`
* 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
* 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
* d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
* a0914f6 + eee491a contain ARROW-12696

Closes #10191 from nealrichardson/dplyr-in-memory

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

14 contributors

Users who have contributed to this file

111 lines (111 sloc) 2.9 KB

Raw Blame

	Package: arrow
	Title: Integration to 'Apache' 'Arrow'
	Version: 4.0.0.9000
	Authors@R: c(
	person("Neal", "Richardson", email = "neal@ursalabs.org", role = c("aut", "cre")),
	person("Ian", "Cook", email = "ianmcook@gmail.com", role = c("aut")),
	person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut")),
	person("Romain", "Fran\u00e7ois", email = "romain@rstudio.com", role = c("aut"), comment = c(ORCID = "0000-0002-2444-4226")),
	person("Jeroen", "Ooms", email = "jeroen@berkeley.edu", role = c("aut")),
	person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
	person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),
	person("Apache Arrow", email = "dev@arrow.apache.org", role = c("aut", "cph"))
	)
	Description: 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language
	development platform for in-memory data. It specifies a standardized
	language-independent columnar memory format for flat and hierarchical data,
	organized for efficient analytic operations on modern hardware. This
	package provides an interface to the 'Arrow C++' library.
	Depends: R (>= 3.3)
	License: Apache License (>= 2.0)
	URL: https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/
	BugReports: https://issues.apache.org/jira/projects/ARROW/issues
	Encoding: UTF-8
	Language: en-US
	SystemRequirements: C++11; for AWS S3 support on Linux, libcurl and openssl (optional)
	Biarch: true
	Imports:
	assertthat,
	bit64 (>= 0.9-7),
	methods,
	purrr,
	R6,
	rlang,
	stats,
	tidyselect,
	utils,
	vctrs
	Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source")
	RoxygenNote: 7.1.1
	VignetteBuilder: knitr
	Suggests:
	decor,
	distro,
	dplyr,
	hms,
	knitr,
	lubridate,
	pkgload,
	reticulate,
	rmarkdown,
	stringr,
	testthat,
	tibble,
	withr
	LinkingTo: cpp11 (>= 0.2.0)
	Collate:
	'enums.R'
	'arrow-package.R'
	'type.R'
	'array-data.R'
	'arrow-datum.R'
	'array.R'
	'arrow-tabular.R'
	'arrowExports.R'
	'buffer.R'
	'chunked-array.R'
	'io.R'
	'compression.R'
	'scalar.R'
	'compute.R'
	'config.R'
	'csv.R'
	'dataset.R'
	'dataset-factory.R'
	'dataset-format.R'
	'dataset-partition.R'
	'dataset-scan.R'
	'dataset-write.R'
	'deprecated.R'
	'dictionary.R'
	'dplyr-arrange.R'
	'dplyr-collect.R'
	'dplyr-eval.R'
	'dplyr-filter.R'
	'expression.R'
	'dplyr-functions.R'
	'dplyr-group-by.R'
	'dplyr-mutate.R'
	'dplyr-select.R'
	'dplyr-summarize.R'
	'record-batch.R'
	'table.R'
	'dplyr.R'
	'feather.R'
	'field.R'
	'filesystem.R'
	'flight.R'
	'install-arrow.R'
	'ipc_stream.R'
	'json.R'
	'memory-pool.R'
	'message.R'
	'metadata.R'
	'parquet.R'
	'python.R'
	'record-batch-reader.R'
	'record-batch-writer.R'
	'reexports-bit64.R'
	'reexports-tidyselect.R'
	'schema.R'
	'util.R'

apache / arrow

arrow/r/DESCRIPTION

Users who have contributed to this file