-
Notifications
You must be signed in to change notification settings - Fork 260
poc: New ballista python interface #1338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@timsaucer you may be able to advise, new approach to add ballista support to datafusion python without changing datafusion python code directly. It is in state of early POC but I wanted to share it early, as rust python is not my strongest strength 😀 |
3473489 to
3750034
Compare
python/python/ballista/extension.py
Outdated
| class DistributedDataFrame(DataFrame, metaclass=RedefiningMeta): | ||
| def __init__(self, df: DataFrame, session_id: str, address: str): | ||
| super().__init__(df.df) | ||
| self.address = address | ||
| self.session_id = session_id | ||
|
|
||
| def _to_internal_df(self): | ||
| blob_plan = self.logical_plan().to_proto() | ||
| df = PyBallistaRemoteExecutor.create_data_frame( | ||
| blob_plan, self.address, self.session_id | ||
| ) | ||
| return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main idea, right? That you expose to the user a python class that uses the datafusion class as a Base and then extend on top of it. Did I get that right?
I think this is good approach. The one concern I have is users who accidentally use a mixture of DataFusion objects and Ballista extensions. The way I could imaging this happening was that someone has some portion of their code base that they've copied over from their single node work.
The other approach, and the one that I think in the long term better would be to make it possible for Ballista to just replace the execution plan via a physical plan optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for replacing the physical planning/optimizer path to use Ballista instead of DataFusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physical planner would be the best option, not sure if with @timsaucer FFI effort we're closer to get there
|
Long term plan, and the best plan, would be if we could just plug-in ballista's physical planner to datafusion-python. The main idea of this approach is to extend Regarding your concern @timsaucer, I'm not sure that I fully understand it. Current goal would be for ballista just to expose Current risks I see:
|
Which issue does this PR close?
Closes #1142
Rationale for this change
For quite some time, we wanted to provide a Ballista Python interface and make it an extension of DataFusion Python. For the reasons mentioned in #1142, we haven't been able to do so. The main issue was that we could not use DataFrame as there was a class mismatch, something like
was not possible due to FFI boundary between datafusion and ballista python.
We also wanted to reuse and rely on datafusion python as much as possible, as we do not have bandwidth to maintain duplicated effort.
What changes are included in this PR?
This PR relies on python duck typing, to "fake"
DataFrameinterface, withDistributedDataFrameextension which would execute query on ballista cluster.There is slight inneficiency where original logical plan will be serialised in datafusion python and deserialised in ballista python in order to cross FFI boundary as well as re-creation of BallistaContext
Also, we would need to override few methods to make it work with ballista.
Things left to do / address
Note of caution: there are too many sharp edges, but so far it looks as most promising approach
Are there any user-facing changes?
There will be change in interface, but too early to tell