Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize table provider df impl #1712

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jan 30, 2022

This is a draft PR following up a discussion on #1699.

Issues:

  1. The execution context state is using Default, it compiles and tests pass,
    but I don't know if that's the right thing to do there.
  2. It doesn't appear that it's possible to allow registration of dyn DataFrame without the trait_upcasting feature, which is unstable.
  3. The schema methods are ambiguous when use datafusion::prelude::* is used (because both traits have a schema method), which makes using either's method without fully qualified syntax pretty ugly.

@cpcloud cpcloud requested review from alamb and houqp Jan 30, 2022
@houqp
Copy link
Member

@houqp houqp commented Jan 30, 2022

@cpcloud good point on lack of access to the trait_upcasting feature :( I found an interesting workaround for this online based on https://articles.bchlr.de/traits-dynamic-dispatch-upcasting:

diff --git a/datafusion/src/datasource/datasource.rs b/datafusion/src/datasource/datasource.rs
index 1b59c857f..703d14493 100644
--- a/datafusion/src/datasource/datasource.rs
+++ b/datafusion/src/datasource/datasource.rs
@@ -55,9 +55,15 @@ pub enum TableType {
     Temporary,
 }
 
+pub trait AsDynTableProvider {
+    fn as_dyn_table<'a>(self: Arc<Self>) -> Arc<dyn TableProvider + 'a>
+    where
+        Self: 'a;
+}
+
 /// Source table
 #[async_trait]
-pub trait TableProvider: Sync + Send {
+pub trait TableProvider: Sync + Send + AsDynTableProvider {
     /// Returns the table provider as [`Any`](std::any::Any) so that it can be
     /// downcast to a specific implementation.
     fn as_any(&self) -> &dyn Any;
@@ -94,3 +100,12 @@ pub trait TableProvider: Sync + Send {
         Ok(TableProviderFilterPushDown::Unsupported)
     }
 }
+
+impl<T: TableProvider + Sized> AsDynTableProvider for T {
+    fn as_dyn_table<'a>(self: Arc<Self>) -> Arc<dyn TableProvider + 'a>
+    where
+        Self: 'a,
+    {
+        self
+    }
+}
diff --git a/datafusion/src/execution/dataframe_impl.rs b/datafusion/src/execution/dataframe_impl.rs
index c1933adaa..eb59e3a93 100644
--- a/datafusion/src/execution/dataframe_impl.rs
+++ b/datafusion/src/execution/dataframe_impl.rs
@@ -550,7 +550,7 @@ mod tests {
         let mut ctx = ExecutionContext::new();
 
         // register a dataframe as a table
-        ctx.register_table("test_table", df)?;
+        ctx.register_table("test_table", df.as_dyn_table())?;
         Ok(())
     }

The schema method name conflict is unfortunate and I don't have a good solution for that other than renaming one of the methods :( Perhaps other contributors can jump in to help suggest workarounds.

@houqp
Copy link
Member

@houqp houqp commented Jan 31, 2022

As for the first issue you mentioned, I think we need to avoid referencing both DataFrameImpl and ExecutionContext within the default TableProvider implementation because these two types are specific to the DataFrameImpl implementation of the trait.

However, now that we don't have a separate Dataframe implementation in ballista anymore, I am not sure what we gain from having the DataFrame trait defined. @andygrove @realno @yahoNanJing @alamb @Dandandan do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?

@alamb
Copy link
Contributor

@alamb alamb commented Jan 31, 2022

do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?

It seems to make sense to me -- it seems to me if someone wanted a different dataframe implementation they would probably just implement it themselves rather than trying to conform to an interface defined by DataFusion. However, I may not be a good arbiter of what is useful and what is not as I don't use the DataFrame impl much

I can't seem to find where the DataFrame trait was defined now (the history gets murky for me when the code got moved around)

@realno
Copy link
Contributor

@realno realno commented Feb 1, 2022

do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?

I like the idea of simplifying interfaces - given the issues with this PR I think it seems to be a good idea to just consolidate DataFrame. Please take this as an opinion given my limited knowledge of the code base. Another thought is, will this have any impact on things like Python library?

@houqp
Copy link
Member

@houqp houqp commented Feb 1, 2022

@alamb it's defined in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/dataframe.rs.

@realno should have no impact to the python binding since traits are implementation details in the Rust land that won't get exposed to the python runtime.

Let's leave this on for couple days to see if @andygrove @Jimexist @yahoNanJing has a strong opinion on this or not.

@alamb
Copy link
Contributor

@alamb alamb commented Feb 1, 2022

@alamb it's defined in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/dataframe.rs.

Thanks @houqp -- looking at the history of that file the DataFrame trait appears to have been around a long time, lending credence to the idea that it might be due for some refactoring

@matthewmturner
Copy link
Contributor

@matthewmturner matthewmturner commented Mar 4, 2022

While working on #1922 i also ran into some issues with the DataFrame trait that i wasnt expecting. Since I hadnt really worked on that part of the code base before i was also surprised it was a trait and not a struct. I would be supportive of moving DataFrame to a struct as well.

@houqp
Copy link
Member

@houqp houqp commented Mar 9, 2022

Great, I filed #1962 to track the work.

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 10, 2022

I think my original intent was to provide different implementations for DataFusion vs Ballista but we have a better solution now with the DataFusion context allowing a pluggable query planner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants