Skip to content

[Enhancement] Iceberg: skip files before open() using runtime filter min/max vs manifest column stats #71005

@osscm

Description

@osscm

Feature request

Is your feature request related to a problem? Please describe.

When executing star-schema queries that join a large Iceberg fact table with a small
filtered dimension table, StarRocks waits for runtime filters (RF) to arrive before
scanning (PRECONDITION_BLOCK). However, the RF is then only applied as a row-level
predicate after each file is already opened and data pages are read from remote storage
(S3/HDFS).

The Iceberg manifest already contains per-file column statistics (lower_bounds /
upper_bounds) that are available on the BE before any file is opened. Files whose
entire column range falls outside the RF's [min, max] are still opened and read,
wasting remote I/O unnecessarily.

Describe the solution you'd like

After the runtime filter arrives and before open() is called on each file, compare
the RF's [min, max] against the corresponding column's per-file stats from the Iceberg
manifest. If the file's range and the RF range do not overlap, skip the file entirely —
no file open, no remote read. This eliminates I/O at the file level rather than
discarding rows after they are already read.

Describe alternatives you've considered

Coordinator-side pruning: prune the scan range list at the FE before dispatching to
BEs. This saves split metadata transfer in addition to I/O, but requires protocol
changes (new Thrift fields + RPC). The BE-side file-skip described above delivers the
primary I/O benefit with no protocol changes and is a simpler first step.

Additional context

Trino benchmarks on TPC-DS show 50–66% reduction in data read on queries where dynamic
partition pruning is effective. The per-file min/max statistics are already populated in
THdfsScanRange.min_max_values from Iceberg manifest metadata — no new metadata
infrastructure is needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions