Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs by ducndh · Pull Request #476 · sirius-db/sirius

ducndh · 2026-03-17T01:45:57Z

Reproduction

Start with a config file that has no cache setting, enable via SET, then repeat the same query:

INSTALL tpch; LOAD tpch;
CALL dbgen(sf=1);

SET scan_cache_level = 'table_gpu';

.timer on
-- Run 1 (cold — scans from DuckDB, populates cache)
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');

-- Run 2 (should be cached at ~0.06s, but rescans at ~0.2s)
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');

-- Run 3: HANGS — must kill process
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');

Run 1: 0.58s  — cold scan
Run 2: 0.21s  — cache miss, rescans from DuckDB (should be ~0.06s)
Run 3: HANG   — deadlock, process must be killed

If the cache is set in the config file instead of via SET, the deadlock happens on Run 2 (Run 1 populates, Run 2 hangs).

Why tests never caught this

integration.cfg has no cache setting (defaults to none)
The benchmark script issues SET scan_cache_level before each query, so every query is the "first query after SET" — the deadlock only triggers when the same query runs a second time with an already-populated cache
No test runs the same gpu_execution query twice on DuckDB tables with table_gpu cache

What goes wrong

Two bugs compound:

Bug 1 — Ordering in QueryBegin (Run 2 slow instead of fast)

In SiriusContext::QueryBegin, set_scan_caching_config() ran after cache_scan_results_for_query():

// BEFORE
cache_scan_results_for_query(query);                   // _cache_level is still NONE → bails, hash not stored
set_scan_caching_config(config_.get_cache_level());    // too late — sets _cache_level to TABLE_GPU

When cache level is changed via SET scan_cache_level mid-session, the scan executor's _cache_level hasn't been propagated yet when the cache check runs. So Run 1 never stores the query hash. On Run 2, the hash doesn't match (was never stored) → cache is cleared and repopulated instead of being reused. This wastes one full cold scan.

Bug 2 — Deadlock in preload mode (Run 3 hangs)

On Run 3 (or Run 2 if cache is set via config file), _preload_mode = true because the cache is non-empty. The old code tries to serve cached data through the scan task infrastructure, which doesn't work:

task_creator::prepare_for_query() creates duckdb_scan_task_global_state → calls DuckDB init_global, acquires locks
Scan tasks are scheduled through the kiosk and thread pool as normal
Each task calls get_scan_output() which serves cached batches from _cache
When all cached batches are exhausted, get_scan_output() throws "Scan results for query not cached" instead of signaling completion
DuckDB scans use a self-scheduling model: compute_task() returns data and creates continuation tasks until it returns empty, which sets exhausted = true. In preload mode compute_task() is never called, so exhausted is never set.
The pipeline waits for scan completion that never arrives → deadlock

Parquet scans don't deadlock because they use upfront partition counting (has_more_partitions) instead of the self-scheduling exhausted flag — but they still waste time re-reading parquet footer metadata (~25-60ms per table) on every warm run via rebind().

Fix

Bypass the scan task infrastructure entirely when the cache is hot:

duckdb_scan_executor — Add preload_into_pipelines(): walks the scan operators, clones cached batches, pushes them directly into downstream pipeline data repositories (found via sink->get_next_port_after_sink()), and explicitly marks each scan pipeline as finished (exhausted = true for DuckDB scans, has_more_partitions = false for parquet scans). No scan tasks are created. Extract clone_cached_batches() (handles TABLE_GPU zero-copy vs deep clone vs parquet shallow_clone) and mark_scan_pipeline_finished() as helpers. Fix the get_scan_output() preload fallback to return empty data instead of throwing.

pipeline_executor — Call preload_into_pipelines() during prepare_for_query(). If in preload mode, clear the priority scan queue (no scan tasks to schedule). In start_query(), schedule the downstream consumers returned by preload so GPU pipeline tasks begin immediately.

task_creator — In preload mode, skip duckdb_scan_task_global_state creation (avoids init_global and DuckDB lock acquisition) and skip parquet rebind() (avoids re-reading parquet file footers). Change .at() to .find() for both DuckDB and parquet global state lookups so that preloaded scans gracefully return nullptr (no more tasks) instead of throwing.

sirius_context — Move set_scan_caching_config() before cache_scan_results_for_query() so the scan executor has the correct cache level before deciding whether to cache or preload.

After fix:

Run 1: 0.67s  — cold scan, populates cache
Run 2: 0.06s  — preload, no deadlock, no binding overhead
Run 3: 0.06s  — stable

TPC-H SF10 warm run benchmark (parquet views)

The benchmark script issues SET scan_cache_level before each query group, which avoids the deadlock on old code. But the preload fix still saves time by skipping init_global binding and scan task thread pool overhead on every warm run. The exact savings depend on the number of scan pipelines (joins have more) and data source (parquet binding is more expensive than DuckDB table binding).

Query	Old warm	New warm	Saved
Q1	0.102s	0.101s	-1ms
Q2	0.081s	0.061s	-20ms
Q3	0.061s	0.051s	-10ms
Q4	0.061s	0.050s	-11ms
Q5	0.072s	0.061s	-11ms
Q6	0.041s	0.051s	+10ms
Q7	0.091s	0.091s	0ms
Q8	0.071s	0.081s	+10ms
Q9	0.132s	0.112s	-20ms
Q10	0.102s	0.081s	-21ms
Q11	0.051s	0.040s	-11ms
Q12	0.061s	0.051s	-10ms
Q13	0.111s	0.102s	-9ms
Q14	0.040s	0.031s	-9ms
Q15	0.051s	0.051s	0ms
Q16	0.091s	0.071s	-20ms
Q17	0.101s	0.082s	-19ms
Q18	0.132s	0.111s	-21ms
Q19	0.092s	0.070s	-22ms
Q20	0.080s	0.059s	-21ms
Q21	0.182s	0.173s	-9ms
Q22	0.051s	0.040s	-11ms

Most queries save 10-20ms. Q6 and Q8 show noise (single-table scans where binding overhead is minimal and measurement variance dominates).

…warm runs With `cache = "table_gpu"` in the config, the second identical gpu_execution query on DuckDB tables deadlocks. The preload path in get_scan_output() serves cached data through the scan task infrastructure, but DuckDB scan pipeline completion depends on compute_task() returning empty — which never happens in preload mode. The pipeline waits forever for scan completion. Additionally, when cache level is set via SET scan_cache_level mid- session, set_scan_caching_config() ran after cache_scan_results_for_query() in QueryBegin, so the scan executor's _cache_level was still NONE when the cache check ran. This delayed the deadlock from Run 2 to Run 3 by wasting one run that never populated the cache. Fix: bypass the scan task infrastructure entirely on cache hits. - Add preload_into_pipelines() to inject cached data directly into downstream pipeline repos and explicitly mark scan pipelines finished - Skip DuckDB/parquet global state creation in preload mode (no init_global, no rebind, no footer reads) - Use .find() instead of .at() for global state lookups so preloaded scans return nullptr instead of throwing - Move set_scan_caching_config() before cache_scan_results_for_query() in QueryBegin - Fix get_scan_output() preload fallback to return empty instead of throwing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

joosthooz

Thanks for this PR, it has some crucial fixes for running from duckdb data! I'm just unsure if we want to remove the task scheduling for cached duckdb table scan runs entirely. I think I would prefer the scheduling behavior to stay the same between parquet and duckdb scans, and between cached and non-cached runs, and keep the logic needed for handling cached runs in the scan code. Do you have an idea of what the trade-off would be for that?

joosthooz · 2026-03-23T10:09:14Z

src/pipeline/pipeline_executor.cpp

+  // In preload mode, all scans are served from cache — no tasks to schedule
+  if (_scan_executor->is_preload_mode()) {
+    while (!_priority_scans.empty()) {
+      _priority_scans.pop();
+    }
+  }


I think I would prefer the scheduling to still happen in the same way between cached and cold runs, so without these changes to pipeline_executor.cpp, even if it complicates the logic in the scan classes slightly. What do you think?

joosthooz · 2026-03-23T10:11:53Z

src/op/scan/duckdb_scan_executor.cpp

  }
 }

+std::vector<op::sirius_physical_operator*> duckdb_scan_executor::preload_into_pipelines(


If we keep the scan task scheduling in place, this function would probably no longer be needed?

joosthooz · 2026-03-23T10:16:50Z

src/op/scan/duckdb_scan_executor.cpp

+
+  if (type == scan_type::DUCKDB) {
+    for (auto& batch : batches) {
+      cloned.push_back(batch->clone(get_next_batch_id(), stream));


Is there a way for this scan type to use a shallow clone instead? If not, should we reserve memory for this clone?

ducndh · 2026-03-24T17:14:05Z

@felipeblazing Does your pull request solve all of these problems already? I will close the pull request now if that is the case:
#513

felipeblazing · 2026-04-02T15:41:06Z

@felipeblazing Does your pull request solve all of these problems already? I will close the pull request now if that is the case: #513

I think so but I am having @kevkrist look into this and we can see what his determination is

kevkrist · 2026-04-03T17:35:27Z

Although #513 does solve the issues stated in this PR, per offline conversation, #513 is missing some important pieces, and does not support TABLE_GPU caching. Do you see a way forward layering #513 onto the changes introduced in this PR so that we get the duckdb scan improvements there while also having genuine GPU caching? This also pertains to @joosthooz comment above about keeping the scan scheduling intact, as #513 does this.
@felipeblazing

ducndh requested a review from aminaramoon March 17, 2026 02:03

ducndh mentioned this pull request Mar 17, 2026

Fix memory leak and hanging in TPC-H queries when running back to back #497

Open

ducndh requested a review from felipeblazing March 17, 2026 22:00

ducndh force-pushed the fix/scan-cache-deadlock branch from abd929d to 4a5b1f0 Compare March 22, 2026 06:56

joosthooz reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476

Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476
ducndh wants to merge 1 commit intosirius-db:devfrom
ducndh:fix/scan-cache-deadlock

ducndh commented Mar 17, 2026

Uh oh!

joosthooz left a comment

Uh oh!

joosthooz Mar 23, 2026

Uh oh!

joosthooz Mar 23, 2026

Uh oh!

joosthooz Mar 23, 2026

Uh oh!

ducndh commented Mar 24, 2026

Uh oh!

felipeblazing commented Apr 2, 2026

Uh oh!

kevkrist commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ducndh commented Mar 17, 2026

Reproduction

Why tests never caught this

What goes wrong

Bug 1 — Ordering in QueryBegin (Run 2 slow instead of fast)

Bug 2 — Deadlock in preload mode (Run 3 hangs)

Fix

TPC-H SF10 warm run benchmark (parquet views)

Uh oh!

joosthooz left a comment

Choose a reason for hiding this comment

Uh oh!

joosthooz Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

joosthooz Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

joosthooz Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ducndh commented Mar 24, 2026

Uh oh!

felipeblazing commented Apr 2, 2026

Uh oh!

kevkrist commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants