Permalink
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * 5b501c5 is the main switch to use InMemoryDataset * b31fb5e deletes `array_expression` * 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * a0914f6 + eee491a contain ARROW-12696 Closes #10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
111 lines (111 sloc)
2.9 KB
| Package: arrow | |
| Title: Integration to 'Apache' 'Arrow' | |
| Version: 4.0.0.9000 | |
| Authors@R: c( | |
| person("Neal", "Richardson", email = "neal@ursalabs.org", role = c("aut", "cre")), | |
| person("Ian", "Cook", email = "ianmcook@gmail.com", role = c("aut")), | |
| person("Jonathan", "Keane", email = "jkeane@gmail.com", role = c("aut")), | |
| person("Romain", "Fran\u00e7ois", email = "romain@rstudio.com", role = c("aut"), comment = c(ORCID = "0000-0002-2444-4226")), | |
| person("Jeroen", "Ooms", email = "jeroen@berkeley.edu", role = c("aut")), | |
| person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")), | |
| person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")), | |
| person("Apache Arrow", email = "dev@arrow.apache.org", role = c("aut", "cph")) | |
| ) | |
| Description: 'Apache' 'Arrow' <https://arrow.apache.org/> is a cross-language | |
| development platform for in-memory data. It specifies a standardized | |
| language-independent columnar memory format for flat and hierarchical data, | |
| organized for efficient analytic operations on modern hardware. This | |
| package provides an interface to the 'Arrow C++' library. | |
| Depends: R (>= 3.3) | |
| License: Apache License (>= 2.0) | |
| URL: https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/ | |
| BugReports: https://issues.apache.org/jira/projects/ARROW/issues | |
| Encoding: UTF-8 | |
| Language: en-US | |
| SystemRequirements: C++11; for AWS S3 support on Linux, libcurl and openssl (optional) | |
| Biarch: true | |
| Imports: | |
| assertthat, | |
| bit64 (>= 0.9-7), | |
| methods, | |
| purrr, | |
| R6, | |
| rlang, | |
| stats, | |
| tidyselect, | |
| utils, | |
| vctrs | |
| Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source") | |
| RoxygenNote: 7.1.1 | |
| VignetteBuilder: knitr | |
| Suggests: | |
| decor, | |
| distro, | |
| dplyr, | |
| hms, | |
| knitr, | |
| lubridate, | |
| pkgload, | |
| reticulate, | |
| rmarkdown, | |
| stringr, | |
| testthat, | |
| tibble, | |
| withr | |
| LinkingTo: cpp11 (>= 0.2.0) | |
| Collate: | |
| 'enums.R' | |
| 'arrow-package.R' | |
| 'type.R' | |
| 'array-data.R' | |
| 'arrow-datum.R' | |
| 'array.R' | |
| 'arrow-tabular.R' | |
| 'arrowExports.R' | |
| 'buffer.R' | |
| 'chunked-array.R' | |
| 'io.R' | |
| 'compression.R' | |
| 'scalar.R' | |
| 'compute.R' | |
| 'config.R' | |
| 'csv.R' | |
| 'dataset.R' | |
| 'dataset-factory.R' | |
| 'dataset-format.R' | |
| 'dataset-partition.R' | |
| 'dataset-scan.R' | |
| 'dataset-write.R' | |
| 'deprecated.R' | |
| 'dictionary.R' | |
| 'dplyr-arrange.R' | |
| 'dplyr-collect.R' | |
| 'dplyr-eval.R' | |
| 'dplyr-filter.R' | |
| 'expression.R' | |
| 'dplyr-functions.R' | |
| 'dplyr-group-by.R' | |
| 'dplyr-mutate.R' | |
| 'dplyr-select.R' | |
| 'dplyr-summarize.R' | |
| 'record-batch.R' | |
| 'table.R' | |
| 'dplyr.R' | |
| 'feather.R' | |
| 'field.R' | |
| 'filesystem.R' | |
| 'flight.R' | |
| 'install-arrow.R' | |
| 'ipc_stream.R' | |
| 'json.R' | |
| 'memory-pool.R' | |
| 'message.R' | |
| 'metadata.R' | |
| 'parquet.R' | |
| 'python.R' | |
| 'record-batch-reader.R' | |
| 'record-batch-writer.R' | |
| 'reexports-bit64.R' | |
| 'reexports-tidyselect.R' | |
| 'schema.R' | |
| 'util.R' |