Leverage PartialReduce AggregationExec mode to drastically reduce shuffle size

# Summary

When DataFusion 53 is released, we should use the `PartialReduce` mode of `AggregateExec` (https://github.com/apache/datafusion/pull/20019, kudos to @njsmith) for maximum local data reduction before shuffling.

The expectation is reducing the amount of data exchanged by up to a factor of `n_partition_per_task` (for example 16x).

# Details

Very schematically, here is how a shuffle happens:

<img width="600" alt="Image" src="https://github.com/user-attachments/assets/1e0047c3-ae19-48f1-b5b5-5d62ffbb8260" />

My understanding is that we will send duplicate aggregate keys on the network, where we could actually have reduced them to one line locally before the shuffle.

We should leverage `PartialReduce` to further reduce data before shuffling:

<img width="600" alt="Image" src="https://github.com/user-attachments/assets/57a45ede-ced3-45ab-9668-e8234da2b113" />

In this example, we'd reduce data exchanged by 4x (because there are 4 partitions per task).

No strong opinion on whether we should add another `RepartitionExec` layer like in this graph, or if we should run the `AggregateExec(PartialReduce)` on the "ready to shuffle" partitions. Since shuffles can have hundreds of partitions, I don't know what is the comparative cost of:
- `RepartitionExec(Hash, output_partitions=16)` + `AggregateExec(PartialReduce)` on 16 partitions
- vs `AggregateExec(PartialReduce)` on potentially hundreds of partitions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leverage PartialReduce AggregationExec mode to drastically reduce shuffle size #360

Summary

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Leverage PartialReduce AggregationExec mode to drastically reduce shuffle size #360

Description

Summary

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions