[Bug](scan) Preserve IN_LIST runtime filter predicates when key range is a scope range#62027
[Bug](scan) Preserve IN_LIST runtime filter predicates when key range is a scope range#62027BiteTheDDDDt wants to merge 6 commits intoapache:masterfrom
Conversation
…s a scope range ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Commit d7153c0 ("[fix](predicate) Avoid recomputing predicates (apache#60484)") introduced a regression where IN_LIST predicates from IN_FILTER runtime filters are incorrectly erased during key range construction in _build_key_ranges_and_filters(). When both MINMAX and IN runtime filters target the same key column, and the IN filter value count exceeds max_pushdown_conditions_per_column, the IN values are not absorbed into the ColumnValueRange (which remains a scope range from MINMAX). However, the erase logic treated the scope range as "exact" and removed all could_be_erased() predicates including the IN_LIST predicate. This caused the IN_LIST to never reach the storage layer for row-level filtering, resulting in significant performance degradation. The fix preserves IN_LIST predicates when the ColumnValueRange is a scope range (not a fixed value range), since scope ranges do not capture IN_LIST filtering semantics. Additionally, the PushDownPredicates profile string is now re-printed after _build_key_ranges_and_filters() to accurately reflect which predicates actually reach the storage layer. ### Release note Fixed a performance regression where IN_LIST runtime filter predicates were incorrectly erased when a MINMAX runtime filter created a scope range on the same key column, causing the IN filter to have no effect on row filtering. ### Check List (For Author) - Test: Regression test - Regression test: test_rf_in_list_not_erased_by_scope_range - Behavior changed: No - Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Fixes a BE predicate-pushdown bug in the OLAP scan operator where IN_LIST runtime-filter predicates could be incorrectly erased when a key column’s ColumnValueRange is a scope range (e.g. from MINMAX RF), leading to incorrect filtering semantics. Adds a regression test intended to validate the scenario.
Changes:
- Preserve
IN_LISTcolumn predicates when the key range is derived from a non-fixed (scope)ColumnValueRangein_build_key_ranges_and_filters(). - Update profiling output to reflect the final predicate set after key range construction potentially erases predicates.
- Add a new regression test suite + expected output for the reported runtime-filter interaction.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
be/src/exec/operator/olap_scan_operator.cpp |
Adjusts key-range/predicate erasure logic to retain IN_LIST predicates for scope ranges; re-prints profile predicate info after erasure. |
regression-test/suites/correctness_p0/test_rf_in_list_not_erased_by_scope_range.groovy |
New regression test case targeting the reported runtime-filter + pushdown-limit scenario. |
regression-test/data/correctness_p0/test_rf_in_list_not_erased_by_scope_range.out |
Expected output for the new regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}). | ||
| // If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12] | ||
| // would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results. | ||
| // We verify correctness by checking the result. | ||
| order_qt_join """ | ||
| SELECT p.k1, p.v1, b.k1, b.v1 | ||
| FROM rf_scope_probe p | ||
| INNER JOIN rf_scope_build b ON p.k1 = b.k1 |
There was a problem hiding this comment.
This regression currently uses an INNER JOIN, so the result set is determined by the join condition (p.k1=b.k1) even if the runtime filter predicates are missing/erased; that means the test likely won’t fail under the bug described. To make it sensitive to an over-permissive runtime filter (e.g., MINMAX scope without the IN_LIST), consider using a LEFT SEMI JOIN / EXISTS pattern and enabling enable_left_semi_direct_return_opt (as other runtime-filter correctness tests do), so false positives like k1=3,5,7,9,11 would surface in the output when IN_LIST is erased.
| // The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}). | |
| // If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12] | |
| // would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results. | |
| // We verify correctness by checking the result. | |
| order_qt_join """ | |
| SELECT p.k1, p.v1, b.k1, b.v1 | |
| FROM rf_scope_probe p | |
| INNER JOIN rf_scope_build b ON p.k1 = b.k1 | |
| sql "set enable_left_semi_direct_return_opt = true;" | |
| // The join should only return 6 rows (matching k1 in {2,4,6,8,10,12}). | |
| // If the IN_LIST predicate is incorrectly erased, the MINMAX scope [2,12] | |
| // would let through rows with k1 in {3,5,7,9,11} as well, producing wrong results. | |
| // We verify correctness by checking the result. | |
| order_qt_join """ | |
| SELECT p.k1, p.v1 | |
| FROM rf_scope_probe p | |
| WHERE EXISTS ( | |
| SELECT 1 | |
| FROM rf_scope_build b | |
| WHERE p.k1 = b.k1 | |
| ) |
| // Use both IN and MIN_MAX runtime filter types so both are generated on the join key. | ||
| sql "set runtime_filter_type = 'IN_OR_BLOOM_FILTER,MIN_MAX';" | ||
| sql "set runtime_filter_wait_time_ms = 10000;" | ||
| sql "set runtime_filter_wait_infinitely = true;" |
There was a problem hiding this comment.
runtime_filter_wait_infinitely = true makes the effective wait time fall back to the query execution timeout (see RuntimeFilterConsumer), so the preceding runtime_filter_wait_time_ms = 10000 no longer bounds the wait and the test could stall much longer than intended if a runtime filter is not produced. Consider setting runtime_filter_wait_infinitely=false (or removing it) if the goal is to cap waits at 10s, or otherwise update the comments to reflect the longer wait behavior.
| sql "set runtime_filter_wait_infinitely = true;" | |
| // Cap runtime filter wait at 10s for this test instead of falling back to the query timeout. | |
| sql "set runtime_filter_wait_infinitely = false;" |
| fmt::memory_buffer buf; | ||
| for (const auto& [slot_id, predicates] : _slot_id_to_predicates) { | ||
| if (predicates.empty()) { | ||
| continue; | ||
| } | ||
| fmt::format_to(buf, "Slot ID: {}: [", slot_id); | ||
| for (const auto& predicate : predicates) { | ||
| fmt::format_to(buf, "{{{}}}, ", predicate->debug_string()); | ||
| } | ||
| fmt::format_to(buf, "] "); | ||
| } | ||
| custom_profile()->add_info_string("PushDownPredicates", fmt::to_string(buf)); |
There was a problem hiding this comment.
This block duplicates the formatting logic already implemented in ScanLocalState (see predicates_to_string in scan_operator.cpp). Consider factoring that helper into a shared header/utility so predicate formatting stays consistent across scan operators and avoids future drift when the format needs to change.
| fmt::memory_buffer buf; | |
| for (const auto& [slot_id, predicates] : _slot_id_to_predicates) { | |
| if (predicates.empty()) { | |
| continue; | |
| } | |
| fmt::format_to(buf, "Slot ID: {}: [", slot_id); | |
| for (const auto& predicate : predicates) { | |
| fmt::format_to(buf, "{{{}}}, ", predicate->debug_string()); | |
| } | |
| fmt::format_to(buf, "] "); | |
| } | |
| custom_profile()->add_info_string("PushDownPredicates", fmt::to_string(buf)); | |
| custom_profile()->add_info_string( | |
| "PushDownPredicates", predicates_to_string(_slot_id_to_predicates)); |
|
run buildall |
|
run buildall |
|
run buildall |
|
/review |
|
run buildall |
|
No issues found in review. I checked the scan-key/predicate rewrite, the runtime-filter end-to-end path, and the added regression coverage; the change looks correct and appropriately scoped. |
| // Comparison predicates (EQ/LT/LE/GT/GE): subsumed by both | ||
| // fixed value and scope ranges. | ||
| if (PredicateTypeTraits::is_comparison(pt)) { | ||
| return true; |
There was a problem hiding this comment.
line 1007 check should after the PredicateTypeTraits::is_comparison(pt)
| (pt == PredicateType::NOT_IN_LIST && opposite)) { | ||
| return is_fixed_value_range; | ||
| } | ||
| // Everything else (BF, BITMAP, NOT_IN_LIST, IS_NULL, etc.): keep. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
This pull request addresses a bug in the OLAP scan operator where
IN_LISTpredicates could be incorrectly erased when bothMINMAXandINruntime filters targeted the same key column, and the number ofINvalues exceeded the maximum allowed for pushdown. The changes ensure thatIN_LISTpredicates are preserved in such cases, preventing incorrect query results. Additionally, a regression test is added to verify the fix.Bug fix in predicate handling:
Modified the logic in
_build_key_ranges_and_filters()withinolap_scan_operator.cppto ensure thatIN_LISTpredicates are not erased when the key range is a scope range (e.g.,>= X AND <= Y) and theINfilter's value count exceedsmax_pushdown_conditions_per_column. This preserves filtering semantics that are not captured by the scope range. [1] [2] [3]Enhanced the profiling output in
_process_conjuncts()to accurately reflect the set of predicates that will reach the storage layer after key range and filter construction. This helps with debugging and verification of predicate pushdown.Testing and regression coverage:
Added a new regression test
test_rf_in_list_not_erased_by_scope_range.groovyto verify thatIN_LISTpredicates are not incorrectly erased when bothMINMAXandINfilters are present and theINlist is too large to be absorbed into the key range.Added the corresponding expected output file for the new regression test.