Skip to content

Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476

Open
ducndh wants to merge 1 commit intosirius-db:devfrom
ducndh:fix/scan-cache-deadlock
Open

Fix scan cache deadlock on DuckDB table scans + skip redundant binding on warm runs#476
ducndh wants to merge 1 commit intosirius-db:devfrom
ducndh:fix/scan-cache-deadlock

Conversation

@ducndh
Copy link
Copy Markdown
Collaborator

@ducndh ducndh commented Mar 17, 2026

Reproduction

Start with a config file that has no cache setting, enable via SET, then repeat the same query:

INSTALL tpch; LOAD tpch;
CALL dbgen(sf=1);

SET scan_cache_level = 'table_gpu';

.timer on
-- Run 1 (cold — scans from DuckDB, populates cache)
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');

-- Run 2 (should be cached at ~0.06s, but rescans at ~0.2s)
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');

-- Run 3: HANGS — must kill process
call gpu_execution('select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty
  from lineitem where l_shipdate <= date ''1995-08-19''
  group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus');
Run 1: 0.58s  — cold scan
Run 2: 0.21s  — cache miss, rescans from DuckDB (should be ~0.06s)
Run 3: HANG   — deadlock, process must be killed

If the cache is set in the config file instead of via SET, the deadlock happens on Run 2 (Run 1 populates, Run 2 hangs).

Why tests never caught this

  • integration.cfg has no cache setting (defaults to none)
  • The benchmark script issues SET scan_cache_level before each query, so every query is the "first query after SET" — the deadlock only triggers when the same query runs a second time with an already-populated cache
  • No test runs the same gpu_execution query twice on DuckDB tables with table_gpu cache

What goes wrong

Two bugs compound:

Bug 1 — Ordering in QueryBegin (Run 2 slow instead of fast)

In SiriusContext::QueryBegin, set_scan_caching_config() ran after cache_scan_results_for_query():

// BEFORE
cache_scan_results_for_query(query);                   // _cache_level is still NONE → bails, hash not stored
set_scan_caching_config(config_.get_cache_level());    // too late — sets _cache_level to TABLE_GPU

When cache level is changed via SET scan_cache_level mid-session, the scan executor's _cache_level hasn't been propagated yet when the cache check runs. So Run 1 never stores the query hash. On Run 2, the hash doesn't match (was never stored) → cache is cleared and repopulated instead of being reused. This wastes one full cold scan.

Bug 2 — Deadlock in preload mode (Run 3 hangs)

On Run 3 (or Run 2 if cache is set via config file), _preload_mode = true because the cache is non-empty. The old code tries to serve cached data through the scan task infrastructure, which doesn't work:

  1. task_creator::prepare_for_query() creates duckdb_scan_task_global_state → calls DuckDB init_global, acquires locks
  2. Scan tasks are scheduled through the kiosk and thread pool as normal
  3. Each task calls get_scan_output() which serves cached batches from _cache
  4. When all cached batches are exhausted, get_scan_output() throws "Scan results for query not cached" instead of signaling completion
  5. DuckDB scans use a self-scheduling model: compute_task() returns data and creates continuation tasks until it returns empty, which sets exhausted = true. In preload mode compute_task() is never called, so exhausted is never set.
  6. The pipeline waits for scan completion that never arrives → deadlock

Parquet scans don't deadlock because they use upfront partition counting (has_more_partitions) instead of the self-scheduling exhausted flag — but they still waste time re-reading parquet footer metadata (~25-60ms per table) on every warm run via rebind().

Fix

Bypass the scan task infrastructure entirely when the cache is hot:

duckdb_scan_executor — Add preload_into_pipelines(): walks the scan operators, clones cached batches, pushes them directly into downstream pipeline data repositories (found via sink->get_next_port_after_sink()), and explicitly marks each scan pipeline as finished (exhausted = true for DuckDB scans, has_more_partitions = false for parquet scans). No scan tasks are created. Extract clone_cached_batches() (handles TABLE_GPU zero-copy vs deep clone vs parquet shallow_clone) and mark_scan_pipeline_finished() as helpers. Fix the get_scan_output() preload fallback to return empty data instead of throwing.

pipeline_executor — Call preload_into_pipelines() during prepare_for_query(). If in preload mode, clear the priority scan queue (no scan tasks to schedule). In start_query(), schedule the downstream consumers returned by preload so GPU pipeline tasks begin immediately.

task_creator — In preload mode, skip duckdb_scan_task_global_state creation (avoids init_global and DuckDB lock acquisition) and skip parquet rebind() (avoids re-reading parquet file footers). Change .at() to .find() for both DuckDB and parquet global state lookups so that preloaded scans gracefully return nullptr (no more tasks) instead of throwing.

sirius_context — Move set_scan_caching_config() before cache_scan_results_for_query() so the scan executor has the correct cache level before deciding whether to cache or preload.

After fix:

Run 1: 0.67s  — cold scan, populates cache
Run 2: 0.06s  — preload, no deadlock, no binding overhead
Run 3: 0.06s  — stable

TPC-H SF10 warm run benchmark (parquet views)

The benchmark script issues SET scan_cache_level before each query group, which avoids the deadlock on old code. But the preload fix still saves time by skipping init_global binding and scan task thread pool overhead on every warm run. The exact savings depend on the number of scan pipelines (joins have more) and data source (parquet binding is more expensive than DuckDB table binding).

Query Old warm New warm Saved
Q1 0.102s 0.101s -1ms
Q2 0.081s 0.061s -20ms
Q3 0.061s 0.051s -10ms
Q4 0.061s 0.050s -11ms
Q5 0.072s 0.061s -11ms
Q6 0.041s 0.051s +10ms
Q7 0.091s 0.091s 0ms
Q8 0.071s 0.081s +10ms
Q9 0.132s 0.112s -20ms
Q10 0.102s 0.081s -21ms
Q11 0.051s 0.040s -11ms
Q12 0.061s 0.051s -10ms
Q13 0.111s 0.102s -9ms
Q14 0.040s 0.031s -9ms
Q15 0.051s 0.051s 0ms
Q16 0.091s 0.071s -20ms
Q17 0.101s 0.082s -19ms
Q18 0.132s 0.111s -21ms
Q19 0.092s 0.070s -22ms
Q20 0.080s 0.059s -21ms
Q21 0.182s 0.173s -9ms
Q22 0.051s 0.040s -11ms

Most queries save 10-20ms. Q6 and Q8 show noise (single-table scans where binding overhead is minimal and measurement variance dominates).

…warm runs

With `cache = "table_gpu"` in the config, the second identical
gpu_execution query on DuckDB tables deadlocks. The preload path in
get_scan_output() serves cached data through the scan task
infrastructure, but DuckDB scan pipeline completion depends on
compute_task() returning empty — which never happens in preload mode.
The pipeline waits forever for scan completion.

Additionally, when cache level is set via SET scan_cache_level mid-
session, set_scan_caching_config() ran after cache_scan_results_for_query()
in QueryBegin, so the scan executor's _cache_level was still NONE when
the cache check ran. This delayed the deadlock from Run 2 to Run 3 by
wasting one run that never populated the cache.

Fix: bypass the scan task infrastructure entirely on cache hits.
- Add preload_into_pipelines() to inject cached data directly into
  downstream pipeline repos and explicitly mark scan pipelines finished
- Skip DuckDB/parquet global state creation in preload mode (no
  init_global, no rebind, no footer reads)
- Use .find() instead of .at() for global state lookups so preloaded
  scans return nullptr instead of throwing
- Move set_scan_caching_config() before cache_scan_results_for_query()
  in QueryBegin
- Fix get_scan_output() preload fallback to return empty instead of
  throwing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ducndh ducndh force-pushed the fix/scan-cache-deadlock branch from abd929d to 4a5b1f0 Compare March 22, 2026 06:56
Copy link
Copy Markdown
Collaborator

@joosthooz joosthooz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, it has some crucial fixes for running from duckdb data! I'm just unsure if we want to remove the task scheduling for cached duckdb table scan runs entirely. I think I would prefer the scheduling behavior to stay the same between parquet and duckdb scans, and between cached and non-cached runs, and keep the logic needed for handling cached runs in the scan code. Do you have an idea of what the trade-off would be for that?

Comment on lines +155 to +160
// In preload mode, all scans are served from cache — no tasks to schedule
if (_scan_executor->is_preload_mode()) {
while (!_priority_scans.empty()) {
_priority_scans.pop();
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer the scheduling to still happen in the same way between cached and cold runs, so without these changes to pipeline_executor.cpp, even if it complicates the logic in the scan classes slightly. What do you think?

}
}

std::vector<op::sirius_physical_operator*> duckdb_scan_executor::preload_into_pipelines(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the scan task scheduling in place, this function would probably no longer be needed?


if (type == scan_type::DUCKDB) {
for (auto& batch : batches) {
cloned.push_back(batch->clone(get_next_batch_id(), stream));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for this scan type to use a shallow clone instead? If not, should we reserve memory for this clone?

@ducndh
Copy link
Copy Markdown
Collaborator Author

ducndh commented Mar 24, 2026

@felipeblazing Does your pull request solve all of these problems already? I will close the pull request now if that is the case:
#513

@felipeblazing
Copy link
Copy Markdown
Collaborator

@felipeblazing Does your pull request solve all of these problems already? I will close the pull request now if that is the case: #513

I think so but I am having @kevkrist look into this and we can see what his determination is

@kevkrist
Copy link
Copy Markdown
Collaborator

kevkrist commented Apr 3, 2026

Although #513 does solve the issues stated in this PR, per offline conversation, #513 is missing some important pieces, and does not support TABLE_GPU caching. Do you see a way forward layering #513 onto the changes introduced in this PR so that we get the duckdb scan improvements there while also having genuine GPU caching? This also pertains to @joosthooz comment above about keeping the scan scheduling intact, as #513 does this.
@felipeblazing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants