Skip to content

Improve shuffling performance #385

@gabotechs

Description

@gabotechs

The following PR added support for micro benchmarking shuffle operations:

The benchmarks shipped in that PR have several cases, and all of them shuffle the same amount of data, but the output tasks and partitioning is different. I'd expect all cases to take the same time, as they shuffle the same amount of data, but there is some variations:

Benchmarking shuffle/stream/1producer_1consumer_8partitions_1000000rows_1024batch_Nonecompression: 
                        time:   [90.704 ms 91.729 ms 94.261 ms]
shuffle/stream/1producer_1consumer_8partitions_1000000rows_1024batch_Some(LZ4_FRAME)compression: 
                        time:   [115.04 ms 125.10 ms 139.33 ms]
Benchmarking shuffle/stream/8producer_1consumer_8partitions_1000000rows_1024batch_Nonecompression: 
                        time:   [60.487 ms 60.780 ms 61.057 ms]
Benchmarking shuffle/stream/1producer_8consumer_8partitions_1000000rows_1024batch_Nonecompression: 
                        time:   [197.23 ms 197.69 ms 198.10 ms]
Benchmarking shuffle/stream/8producer_8consumer_8partitions_1000000rows_1024batch_Nonecompression: 
                        time:   [76.776 ms 77.825 ms 80.544 ms]

I think there should be some low hanging fruits regarding performance, and some other more complicated improvements:

  • I'm suspicious about arrow-flight cloning data unnecessarily
  • RepartitionExec might be doing something funny depending on the amount of output tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions