Skip to content

[Bug](scan) Preserve IN_LIST runtime filter predicates when key range is a scope range#62027

Open
BiteTheDDDDt wants to merge 6 commits intoapache:masterfrom
BiteTheDDDDt:fix_0402
Open

[Bug](scan) Preserve IN_LIST runtime filter predicates when key range is a scope range#62027
BiteTheDDDDt wants to merge 6 commits intoapache:masterfrom
BiteTheDDDDt:fix_0402

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

This pull request addresses a bug in the OLAP scan operator where IN_LIST predicates could be incorrectly erased when both MINMAX and IN runtime filters targeted the same key column, and the number of IN values exceeded the maximum allowed for pushdown. The changes ensure that IN_LIST predicates are preserved in such cases, preventing incorrect query results. Additionally, a regression test is added to verify the fix.

Bug fix in predicate handling:

  • Modified the logic in _build_key_ranges_and_filters() within olap_scan_operator.cpp to ensure that IN_LIST predicates are not erased when the key range is a scope range (e.g., >= X AND <= Y) and the IN filter's value count exceeds max_pushdown_conditions_per_column. This preserves filtering semantics that are not captured by the scope range. [1] [2] [3]

  • Enhanced the profiling output in _process_conjuncts() to accurately reflect the set of predicates that will reach the storage layer after key range and filter construction. This helps with debugging and verification of predicate pushdown.

Testing and regression coverage:

  • Added a new regression test test_rf_in_list_not_erased_by_scope_range.groovy to verify that IN_LIST predicates are not incorrectly erased when both MINMAX and IN filters are present and the IN list is too large to be absorbed into the key range.

  • Added the corresponding expected output file for the new regression test.

…s a scope range

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
Commit d7153c0 ("[fix](predicate) Avoid recomputing predicates (apache#60484)")
introduced a regression where IN_LIST predicates from IN_FILTER runtime filters
are incorrectly erased during key range construction in _build_key_ranges_and_filters().

When both MINMAX and IN runtime filters target the same key column, and the IN
filter value count exceeds max_pushdown_conditions_per_column, the IN values are
not absorbed into the ColumnValueRange (which remains a scope range from MINMAX).
However, the erase logic treated the scope range as "exact" and removed all
could_be_erased() predicates including the IN_LIST predicate. This caused the
IN_LIST to never reach the storage layer for row-level filtering, resulting in
significant performance degradation.

The fix preserves IN_LIST predicates when the ColumnValueRange is a scope range
(not a fixed value range), since scope ranges do not capture IN_LIST filtering
semantics. Additionally, the PushDownPredicates profile string is now re-printed
after _build_key_ranges_and_filters() to accurately reflect which predicates
actually reach the storage layer.

### Release note

Fixed a performance regression where IN_LIST runtime filter predicates were
incorrectly erased when a MINMAX runtime filter created a scope range on the
same key column, causing the IN filter to have no effect on row filtering.

### Check List (For Author)

- Test: Regression test
    - Regression test: test_rf_in_list_not_erased_by_scope_range
- Behavior changed: No
- Does this need documentation: No
Copilot AI review requested due to automatic review settings April 2, 2026 02:07
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt BiteTheDDDDt changed the title [fix](be) Preserve IN_LIST runtime filter predicates when key range is a scope range [Bug](scan) Preserve IN_LIST runtime filter predicates when key range is a scope range Apr 2, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a BE predicate-pushdown bug in the OLAP scan operator where IN_LIST runtime-filter predicates could be incorrectly erased when a key column’s ColumnValueRange is a scope range (e.g. from MINMAX RF), leading to incorrect filtering semantics. Adds a regression test intended to validate the scenario.

Changes:

  • Preserve IN_LIST column predicates when the key range is derived from a non-fixed (scope) ColumnValueRange in _build_key_ranges_and_filters().
  • Update profiling output to reflect the final predicate set after key range construction potentially erases predicates.
  • Add a new regression test suite + expected output for the reported runtime-filter interaction.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
be/src/exec/operator/olap_scan_operator.cpp Adjusts key-range/predicate erasure logic to retain IN_LIST predicates for scope ranges; re-prints profile predicate info after erasure.
regression-test/suites/correctness_p0/test_rf_in_list_not_erased_by_scope_range.groovy New regression test case targeting the reported runtime-filter + pushdown-limit scenario.
regression-test/data/correctness_p0/test_rf_in_list_not_erased_by_scope_range.out Expected output for the new regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +87

// The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}).
// If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12]
// would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results.
// We verify correctness by checking the result.
order_qt_join """
SELECT p.k1, p.v1, b.k1, b.v1
FROM rf_scope_probe p
INNER JOIN rf_scope_build b ON p.k1 = b.k1
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression currently uses an INNER JOIN, so the result set is determined by the join condition (p.k1=b.k1) even if the runtime filter predicates are missing/erased; that means the test likely won’t fail under the bug described. To make it sensitive to an over-permissive runtime filter (e.g., MINMAX scope without the IN_LIST), consider using a LEFT SEMI JOIN / EXISTS pattern and enabling enable_left_semi_direct_return_opt (as other runtime-filter correctness tests do), so false positives like k1=3,5,7,9,11 would surface in the output when IN_LIST is erased.

Suggested change
// The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}).
// If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12]
// would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results.
// We verify correctness by checking the result.
order_qt_join """
SELECT p.k1, p.v1, b.k1, b.v1
FROM rf_scope_probe p
INNER JOIN rf_scope_build b ON p.k1 = b.k1
sql "set enable_left_semi_direct_return_opt = true;"
// The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}).
// If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12]
// would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results.
// We verify correctness by checking the result.
order_qt_join """
SELECT p.k1, p.v1
FROM rf_scope_probe p
WHERE EXISTS (
SELECT 1
FROM rf_scope_build b
WHERE p.k1 = b.k1
)

Copilot uses AI. Check for mistakes.
// Use both IN and MIN_MAX runtime filter types so both are generated on the join key.
sql "set runtime_filter_type = 'IN_OR_BLOOM_FILTER,MIN_MAX';"
sql "set runtime_filter_wait_time_ms = 10000;"
sql "set runtime_filter_wait_infinitely = true;"
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime_filter_wait_infinitely = true makes the effective wait time fall back to the query execution timeout (see RuntimeFilterConsumer), so the preceding runtime_filter_wait_time_ms = 10000 no longer bounds the wait and the test could stall much longer than intended if a runtime filter is not produced. Consider setting runtime_filter_wait_infinitely=false (or removing it) if the goal is to cap waits at 10s, or otherwise update the comments to reflect the longer wait behavior.

Suggested change
sql "set runtime_filter_wait_infinitely = true;"
// Cap runtime filter wait at 10s for this test instead of falling back to the query timeout.
sql "set runtime_filter_wait_infinitely = false;"

Copilot uses AI. Check for mistakes.
Comment on lines +404 to +415
fmt::memory_buffer buf;
for (const auto& [slot_id, predicates] : _slot_id_to_predicates) {
if (predicates.empty()) {
continue;
}
fmt::format_to(buf, "Slot ID: {}: [", slot_id);
for (const auto& predicate : predicates) {
fmt::format_to(buf, "{{{}}}, ", predicate->debug_string());
}
fmt::format_to(buf, "] ");
}
custom_profile()->add_info_string("PushDownPredicates", fmt::to_string(buf));
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block duplicates the formatting logic already implemented in ScanLocalState (see predicates_to_string in scan_operator.cpp). Consider factoring that helper into a shared header/utility so predicate formatting stays consistent across scan operators and avoids future drift when the format needs to change.

Suggested change
fmt::memory_buffer buf;
for (const auto& [slot_id, predicates] : _slot_id_to_predicates) {
if (predicates.empty()) {
continue;
}
fmt::format_to(buf, "Slot ID: {}: [", slot_id);
for (const auto& predicate : predicates) {
fmt::format_to(buf, "{{{}}}, ", predicate->debug_string());
}
fmt::format_to(buf, "] ");
}
custom_profile()->add_info_string("PushDownPredicates", fmt::to_string(buf));
custom_profile()->add_info_string(
"PushDownPredicates", predicates_to_string(_slot_id_to_predicates));

Copilot uses AI. Check for mistakes.
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

No issues found in review. I checked the scan-key/predicate rewrite, the runtime-filter end-to-end path, and the added regression coverage; the change looks correct and appropriately scoped.

// Comparison predicates (EQ/LT/LE/GT/GE): subsumed by both
// fixed value and scope ranges.
if (PredicateTypeTraits::is_comparison(pt)) {
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 1007 check should after the PredicateTypeTraits::is_comparison(pt)

(pt == PredicateType::NOT_IN_LIST && opposite)) {
return is_fixed_value_range;
}
// Everything else (BF, BITMAP, NOT_IN_LIST, IS_NULL, etc.): keep.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is null should be prune

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (37/37) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.61% (27297/37085)
Line Coverage 57.25% (294080/513698)
Region Coverage 54.53% (245233/449739)
Branch Coverage 56.19% (106218/189044)

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.94% (20053/37882)
Line Coverage 36.53% (188306/515511)
Region Coverage 32.83% (146363/445854)
Branch Coverage 33.95% (64042/188625)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants