Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476
Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476ducndh wants to merge 1 commit intosirius-db:devfrom
Conversation
…warm runs With `cache = "table_gpu"` in the config, the second identical gpu_execution query on DuckDB tables deadlocks. The preload path in get_scan_output() serves cached data through the scan task infrastructure, but DuckDB scan pipeline completion depends on compute_task() returning empty — which never happens in preload mode. The pipeline waits forever for scan completion. Additionally, when cache level is set via SET scan_cache_level mid- session, set_scan_caching_config() ran after cache_scan_results_for_query() in QueryBegin, so the scan executor's _cache_level was still NONE when the cache check ran. This delayed the deadlock from Run 2 to Run 3 by wasting one run that never populated the cache. Fix: bypass the scan task infrastructure entirely on cache hits. - Add preload_into_pipelines() to inject cached data directly into downstream pipeline repos and explicitly mark scan pipelines finished - Skip DuckDB/parquet global state creation in preload mode (no init_global, no rebind, no footer reads) - Use .find() instead of .at() for global state lookups so preloaded scans return nullptr instead of throwing - Move set_scan_caching_config() before cache_scan_results_for_query() in QueryBegin - Fix get_scan_output() preload fallback to return empty instead of throwing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abd929d to
4a5b1f0
Compare
joosthooz
left a comment
There was a problem hiding this comment.
Thanks for this PR, it has some crucial fixes for running from duckdb data! I'm just unsure if we want to remove the task scheduling for cached duckdb table scan runs entirely. I think I would prefer the scheduling behavior to stay the same between parquet and duckdb scans, and between cached and non-cached runs, and keep the logic needed for handling cached runs in the scan code. Do you have an idea of what the trade-off would be for that?
| // In preload mode, all scans are served from cache — no tasks to schedule | ||
| if (_scan_executor->is_preload_mode()) { | ||
| while (!_priority_scans.empty()) { | ||
| _priority_scans.pop(); | ||
| } | ||
| } |
There was a problem hiding this comment.
I think I would prefer the scheduling to still happen in the same way between cached and cold runs, so without these changes to pipeline_executor.cpp, even if it complicates the logic in the scan classes slightly. What do you think?
| } | ||
| } | ||
|
|
||
| std::vector<op::sirius_physical_operator*> duckdb_scan_executor::preload_into_pipelines( |
There was a problem hiding this comment.
If we keep the scan task scheduling in place, this function would probably no longer be needed?
|
|
||
| if (type == scan_type::DUCKDB) { | ||
| for (auto& batch : batches) { | ||
| cloned.push_back(batch->clone(get_next_batch_id(), stream)); |
There was a problem hiding this comment.
Is there a way for this scan type to use a shallow clone instead? If not, should we reserve memory for this clone?
|
@felipeblazing Does your pull request solve all of these problems already? I will close the pull request now if that is the case: |
I think so but I am having @kevkrist look into this and we can see what his determination is |
|
Although #513 does solve the issues stated in this PR, per offline conversation, #513 is missing some important pieces, and does not support TABLE_GPU caching. Do you see a way forward layering #513 onto the changes introduced in this PR so that we get the duckdb scan improvements there while also having genuine GPU caching? This also pertains to @joosthooz comment above about keeping the scan scheduling intact, as #513 does this. |
Reproduction
Start with a config file that has no cache setting, enable via SET, then repeat the same query:
If the cache is set in the config file instead of via SET, the deadlock happens on Run 2 (Run 1 populates, Run 2 hangs).
Why tests never caught this
integration.cfghas no cache setting (defaults tonone)SET scan_cache_levelbefore each query, so every query is the "first query after SET" — the deadlock only triggers when the same query runs a second time with an already-populated cachegpu_executionquery twice on DuckDB tables withtable_gpucacheWhat goes wrong
Two bugs compound:
Bug 1 — Ordering in QueryBegin (Run 2 slow instead of fast)
In
SiriusContext::QueryBegin,set_scan_caching_config()ran aftercache_scan_results_for_query():When cache level is changed via
SET scan_cache_levelmid-session, the scan executor's_cache_levelhasn't been propagated yet when the cache check runs. So Run 1 never stores the query hash. On Run 2, the hash doesn't match (was never stored) → cache is cleared and repopulated instead of being reused. This wastes one full cold scan.Bug 2 — Deadlock in preload mode (Run 3 hangs)
On Run 3 (or Run 2 if cache is set via config file),
_preload_mode = truebecause the cache is non-empty. The old code tries to serve cached data through the scan task infrastructure, which doesn't work:task_creator::prepare_for_query()createsduckdb_scan_task_global_state→ calls DuckDBinit_global, acquires locksget_scan_output()which serves cached batches from_cacheget_scan_output()throws"Scan results for query not cached"instead of signaling completioncompute_task()returns data and creates continuation tasks until it returns empty, which setsexhausted = true. In preload modecompute_task()is never called, soexhaustedis never set.Parquet scans don't deadlock because they use upfront partition counting (
has_more_partitions) instead of the self-schedulingexhaustedflag — but they still waste time re-reading parquet footer metadata (~25-60ms per table) on every warm run viarebind().Fix
Bypass the scan task infrastructure entirely when the cache is hot:
duckdb_scan_executor— Addpreload_into_pipelines(): walks the scan operators, clones cached batches, pushes them directly into downstream pipeline data repositories (found viasink->get_next_port_after_sink()), and explicitly marks each scan pipeline as finished (exhausted = truefor DuckDB scans,has_more_partitions = falsefor parquet scans). No scan tasks are created. Extractclone_cached_batches()(handles TABLE_GPU zero-copy vs deep clone vs parquet shallow_clone) andmark_scan_pipeline_finished()as helpers. Fix theget_scan_output()preload fallback to return empty data instead of throwing.pipeline_executor— Callpreload_into_pipelines()duringprepare_for_query(). If in preload mode, clear the priority scan queue (no scan tasks to schedule). Instart_query(), schedule the downstream consumers returned by preload so GPU pipeline tasks begin immediately.task_creator— In preload mode, skipduckdb_scan_task_global_statecreation (avoidsinit_globaland DuckDB lock acquisition) and skip parquetrebind()(avoids re-reading parquet file footers). Change.at()to.find()for both DuckDB and parquet global state lookups so that preloaded scans gracefully returnnullptr(no more tasks) instead of throwing.sirius_context— Moveset_scan_caching_config()beforecache_scan_results_for_query()so the scan executor has the correct cache level before deciding whether to cache or preload.After fix:
TPC-H SF10 warm run benchmark (parquet views)
The benchmark script issues
SET scan_cache_levelbefore each query group, which avoids the deadlock on old code. But the preload fix still saves time by skippinginit_globalbinding and scan task thread pool overhead on every warm run. The exact savings depend on the number of scan pipelines (joins have more) and data source (parquet binding is more expensive than DuckDB table binding).Most queries save 10-20ms. Q6 and Q8 show noise (single-table scans where binding overhead is minimal and measurement variance dominates).