feat: Transparent GPU execution via optimizer extension#518
feat: Transparent GPU execution via optimizer extension#518mbrobbel wants to merge 17 commits intosirius-db:devfrom
Conversation
Users can now run plain SQL that automatically executes on the GPU,
eliminating the need to wrap queries in CALL gpu_execution('...').
The implementation uses two DuckDB extension hooks:
- OptimizerExtension (post-optimization): captures a copy of the
optimized logical plan when the query uses only GPU-supported operators
- OnFinalizePrepare: generates a Sirius physical plan from the captured
copy and replaces DuckDB's CPU physical plan with a custom
PhysicalSiriusExecution source operator
Queries with unsupported operators (WINDOW, UNNEST, etc.) silently fall
back to CPU execution. Controlled by SET sirius_transparent_execution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three key fixes to make transparent execution work end-to-end: 1. Override CanRequestRebind() to return true — DuckDB only calls OnFinalizePrepare when at least one registered state can request rebind. Without this, our hook was never invoked. 2. Add pre_optimize_function to disable IN_CLAUSE and COMPRESSED_MATERIALIZATION optimizers before DuckDB's built-in optimizers run. These produce internal functions Sirius can't handle. The post-optimize hook re-enables them to avoid leaking state. 3. Return a real LocalSourceState instead of nullptr to avoid null pointer dereference in DuckDB's PipelineExecutor. Also updates tests/scripts to use transparent execution: - test_gpu_execution_tpch.cpp: uses SET sirius_transparent_execution instead of CALL gpu_execution() wrapper - run_tpch_parquet.sh: both engines use orig/ query directory - run_tpcds_super.sh: plain SQL instead of gpu_execution() wrapper - performance_test.py: SET-based toggle instead of wrapping function Validated on RTX PRO 6000 — filter, projection, aggregation, ORDER BY queries all execute on GPU transparently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sorted comparison pass was re-running the query wrapped in a subquery (SELECT * FROM (...) t ORDER BY), which itself went through transparent GPU execution and could fail for complex plans. Instead, collect rows from the already-materialized GPU and CPU results, sort them in C++, and compare directly. This avoids re-running queries entirely and is also simpler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Convert all remaining inline CALL gpu_execution() and SELECT * FROM gpu_execution() usages to use compare_gpu_vs_cpu() which uses transparent execution. This includes: - order by multipartition parquet - order by with decimal column (duckdb + parquet) - order by with varchar column (duckdb + parquet) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pre-optimization plan shape can differ from the post-optimization shape (e.g. subqueries get flattened), so is_acceleratable_query() may not match the same queries in both hooks. Unconditionally disable IN_CLAUSE and COMPRESSED_MATERIALIZATION in the pre-hook; the post-hook re-enables them regardless. This fixes TPC-H Q2 which has subqueries that transform during optimization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # src/sirius_context.cpp
…urce of truth The duplicated operator allow-list in is_acceleratable_query() was fragile — it could easily get out of sync with create_plan(). Instead, always Copy() the plan in the optimizer hook and let create_plan() in OnFinalizePrepare determine GPU support. If create_plan() throws NotImplementedException, we silently fall back to CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: Update intro to describe transparent execution as the primary usage mode, add usage example with plain SQL - execution-flow: Add Step 1 (optimizer extension hooks + OnFinalizePrepare) and Step 2 (PhysicalSiriusExecution), move explicit gpu_execution() path to Step 1b (legacy) - configuration: Add sirius_transparent_execution SET variable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // Disable optimizers that produce DuckDB-internal functions Sirius can't handle. | ||
| // The post-hook re-enables them so non-GPU queries aren't affected. | ||
| auto& disabled = duckdb::DBConfig::GetConfig(context).options.disabled_optimizers; | ||
| disabled.insert(duckdb::OptimizerType::IN_CLAUSE); |
There was a problem hiding this comment.
why are IN_CLAUSE and COMPRESSED_MATERIALIZATION treated seperately from the ones in the config. Or are those disabled optimizers separate from the ones we are disabling for duckdb?
There was a problem hiding this comment.
So this is duckdb specific optimization that doesn't apply to sirius. E.g. Compressed Materialization is to store intermediate result in a compressed duckdb format
| // Disable fallback so GPU errors are not silently hidden | ||
| con->Query("SET enable_duckdb_fallback = false;"); | ||
| // Enable transparent GPU execution | ||
| con->Query("SET sirius_transparent_execution = true;"); |
There was a problem hiding this comment.
do we not still need to disable the duckdb fallback so it doesnt fall back to duckdb on failure?
There was a problem hiding this comment.
i think we can enable the fallback again now that GTC has passed?
| if (float_tolerance.has_value()) { | ||
| // Check if this looks like a float value for tolerance comparison | ||
| try { | ||
| double gpu_d = std::stod(gpu_rows[r][c]); |
There was a problem hiding this comment.
what if its a decimal value? we want to compare decimals exactly. could we not do something like use the duckdb to cudf converter than then compare tables?
There was a problem hiding this comment.
it shoudl be chekcing the types
|
Okay this looks good to me based on what i see i guess one thing is that let's name it set |
|
One other thing that I am thinking maybe we should start taking note is I believe the config that we modified here is global and affect all connection. For example, if i set |
# Conflicts: # src/sirius_extension.cpp
Addresses PR sirius-db#518 review feedback to use a shorter, clearer name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ison Check the result's column LogicalType to determine if float tolerance should be applied, rather than attempting stod() on every value. Decimals and other numeric types are now compared exactly via string equality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Override GetDataInternal instead of GetData (GetData is no longer virtual in PhysicalOperator) - Use OptimizerExtension::Register() instead of directly pushing to config.optimizer_extensions (field moved to ExtensionCallbackManager) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple sirius_pipeline, sirius_meta_pipeline, and sirius_pipeline_converter from sirius_engine so that pipeline construction can happen at plan/bind time without an engine instance. Introduce pipeline_build_context — a lightweight struct carrying only the plan-time parameters that pipeline construction needs (currently just preserve_insertion_order). This replaces the sirius_engine& reference that was previously threaded through the entire pipeline build chain. Key changes: - sirius_pipeline: constructor takes pipeline_build_context& instead of sirius_engine&. Runtime ClientContext set via set_client_context() before execution. - sirius_meta_pipeline: takes pipeline_build_context& instead of sirius_engine&. get_engine() replaced with get_build_context(). - sirius_pipeline_converter: takes pipeline_build_context& for plan-time work. construct_sirius_specific_operator extracted as a free function. wire_data_repositories takes sirius_engine& as parameter (only remaining runtime dependency). - sirius_engine::create_child_pipeline removed (logic inlined into sirius_pipeline_build_state::create_child_pipeline). - sirius_engine::initialize_internal creates the pipeline_build_context and sets client_context on all pipelines after conversion. This is Phase 1 of moving planning from initialize_internal() to the optimizer stage (issue sirius-db#545, PRs sirius-db#518, sirius-db#529). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple sirius_pipeline, sirius_meta_pipeline, and sirius_pipeline_converter from sirius_engine so that pipeline construction can happen at plan/bind time without an engine instance. Introduce pipeline_build_context — a lightweight struct carrying only the plan-time parameters that pipeline construction needs (currently just preserve_insertion_order). This replaces the sirius_engine& reference that was previously threaded through the entire pipeline build chain. Key changes: - sirius_pipeline: constructor takes pipeline_build_context& instead of sirius_engine&. Runtime ClientContext set via set_client_context() before execution. - sirius_meta_pipeline: takes pipeline_build_context& instead of sirius_engine&. get_engine() replaced with get_build_context(). - sirius_pipeline_converter: takes pipeline_build_context& for plan-time work. construct_sirius_specific_operator extracted as a free function. wire_data_repositories takes sirius_engine& as parameter (only remaining runtime dependency). - sirius_engine::create_child_pipeline removed (logic inlined into sirius_pipeline_build_state::create_child_pipeline). - sirius_engine::initialize_internal creates the pipeline_build_context and sets client_context on all pipelines after conversion. This is Phase 1 of moving planning from initialize_internal() to the optimizer stage (issue sirius-db#545, PRs sirius-db#518, sirius-db#529). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…te multi-format tests The transparent execution optimizer pre-hook was missing STATISTICS_PROPAGATION in the disabled optimizer set. This optimizer can fold ungrouped aggregates into EXPRESSION_GET + DUMMY_SCAN plans with COLUMN_DATA_SCAN sources that the GPU pipeline cannot schedule, causing the test suite to hang. Also update test_gpu_execution_multi_format.cpp to use the transparent execution pattern (SET gpu_execution = true/false, in-memory result comparison) instead of the old CALL gpu_execution() + re-query approach. With ENABLE_GPU_EXECUTION defaulting to true, the old "CPU baseline" queries were silently routing through the transparent GPU path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
CALL gpu_execution('...')Copy()when the query uses only GPU-supported operatorsPhysicalSiriusExecutionsource operatorSET sirius_transparent_execution = true/false(default:true)Usage
New files
src/include/transparent/sirius_optimizer_extension.hpp+.cppis_acceleratable_query()src/include/transparent/physical_sirius_execution.hpp+.cppPhysicalOperator(EXTENSION type) wrapping Sirius GPU enginetest/cpp/integration/test_transparent_execution.cppTest plan
sirius_unittest "[transparent]"gpu_execution()wrapper)SET sirius_transparent_execution = falsedisables GPU interceptiongpu_execution()tests still pass🤖 Generated with Claude Code