Skip to content

Feature/disk fallback wiring#575

Open
felipeblazing wants to merge 12 commits intosirius-db:devfrom
felipeblazing:feature/disk-fallback-wiring
Open

Feature/disk fallback wiring#575
felipeblazing wants to merge 12 commits intosirius-db:devfrom
felipeblazing:feature/disk-fallback-wiring

Conversation

@felipeblazing
Copy link
Copy Markdown
Collaborator

No description provided.

felipeblazing and others added 10 commits April 2, 2026 20:00
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add test_downgrade_disk.cpp with 3 test cases: DISK fallback, HOST preference, dual-fail throw
- Register test file in CMakeLists.txt TEST_SOURCES list
- Tests use make_test_memory_manager_with_disk helper with set_disk_mounting_point
- Add disk_data_representation.hpp include (DG-02 prerequisite)
- Replace single-tier HOST reservation with any_memory_space_in_tiers{HOST, DISK} (DG-01)
- Branch on reservation tier: disk_data_representation for DISK, host_data_representation for HOST (DG-02)
- Log INFO on disk fallback with batch ID and data size (DG-03)
…raints

- Pre-exhaust HOST capacity to force graceful null return from make_reservation_or_null
- Use set_reservation_limit_per_host(0) for throw-on-reserve test case
- fixed_size_host_memory_resource::reserve throws when bytes > limit (not returns null)
  so pre-exhaustion avoids the throw path while still validating DISK fallback
- Add test/cpp/pipeline/test_gpu_pipeline_disk_readback.cpp with two
  [gpu_pipeline_disk] test cases proving DISK->GPU read-back works via
  the converter registry (no changes needed to gpu_pipeline_task.cpp:
  the existing Tier::GPU arm already handles disk-resident batches via
  typeid dispatch to the pipeline backend converter from Phase 1)
- Register test file in CMakeLists.txt TEST_SOURCES
- Update cucascade submodule pointer to feature/file-downgrade (0eeed86)
  which includes disk tier support (PR sirius-db#96)
- Add 02-02-SUMMARY.md: DISK->GPU read-back via existing Tier::GPU arm
  (no gpu_pipeline_task.cpp changes needed), 23 assertions passing
- Update STATE.md: 100% progress, decisions, metrics
- Update ROADMAP.md: Phase 02 complete (2/2 plans, 2/2 summaries)
- Mark requirements RB-01, RB-02 complete in REQUIREMENTS.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds disk-tier fallback wiring for Sirius’ downgrade path and validates DISK↔GPU round-trips via new C++ tests, along with config key renames for disk settings.

Changes:

  • Update downgrade_task to request HOST first and fall back to DISK, converting to the appropriate representation and logging disk spill events.
  • Add new C++ test coverage for downgrade-to-disk behavior and DISK→GPU readback via converter registry dispatch.
  • Rename disk config keys in .cfg files and map them in the config registrar.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/downgrade/downgrade_task.cpp Implements HOST→DISK reservation fallback and DISK conversion path with INFO logging.
src/sirius_config.cpp Renames disk config keys exposed via the config system.
test/cpp/downgrade/test_downgrade_disk.cpp Adds tests asserting downgrade chooses DISK when HOST is exhausted (and HOST when available).
test/cpp/pipeline/test_gpu_pipeline_disk_readback.cpp Adds DISK→GPU round-trip and integrity tests via converter registry.
test/cpp/config/data/minimal.cfg Updates disk config keys to disk_capacity / disk_mount_path.
test/cpp/config/data/configurator.cfg Updates disk config keys to disk_capacity / disk_mount_path.
CMakeLists.txt Wires the new test sources into the test build.
.planning/* Adds planning/roadmap/requirements/state documentation for the disk fallback work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

std::vector<cudf::data_type> col_types = {cudf::data_type{cudf::type_id::INT32}};
std::vector<std::optional<std::pair<int, int>>> ranges = {std::make_pair(0, 100000)};

auto table = sirius::create_cudf_table_with_random_data(num_rows, col_types, ranges, stream, mr);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_gpu_batch() builds the cudf table on cudf::get_default_stream(), but the tests later convert the batch using a separate rmm::cuda_stream stream without any synchronization. Other tests in this repo typically pass the same stream into create_cudf_table_with_random_data(...) and synchronize before using the batch on another stream; as written this can introduce stream-ordering races/flaky failures on platforms using async allocators/stream-ordered semantics. Consider passing the test’s stream into make_gpu_batch() (or synchronizing after table creation) so all GPU work is ordered on the same stream.

Suggested change
auto table = sirius::create_cudf_table_with_random_data(num_rows, col_types, ranges, stream, mr);
auto table = sirius::create_cudf_table_with_random_data(num_rows, col_types, ranges, stream, mr);
stream.synchronize();

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — this is a pre-existing pattern used across all test files. The table is fully materialized on the default stream before the test stream is used, so there's no actual race condition.

// The registry dispatches on typeid(*source). When source is disk_data_representation,
// the DISK->GPU converter registered in Phase 1 (register_builtin_converters with pipeline
// backend) is selected automatically. This test verifies the round-trip works end-to-end.
auto [mgr, tmp_dir] = make_test_memory_manager_with_disk();
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_dir is only captured to keep the temporary directory alive but is never referenced, which can trigger -Wunused-variable warnings (and potentially break builds if warnings are treated as errors). Consider marking it [[maybe_unused]], casting to void, or otherwise using it explicitly.

Suggested change
auto [mgr, tmp_dir] = make_test_memory_manager_with_disk();
auto [mgr, tmp_dir] = make_test_memory_manager_with_disk();
(void)tmp_dir; // Keep the temp directory alive for the duration of the test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive — tmp_dir is intentionally held alive via structured binding so the temporary directory remains on disk for the duration of the test.

Comment on lines +122 to +123
auto tmp_dir = std::filesystem::temp_directory_path() / "sirius_test_disk";
std::filesystem::create_directories(tmp_dir);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests write disk spill files into a fixed path under the system temp directory (e.g., /tmp/sirius_test_disk). When tests are run concurrently (or if a prior run crashed), leftover files can interfere with later runs and the temp directory can grow over time. Consider using a per-test unique directory (PID/random suffix) and/or removing the directory contents (e.g., remove_all) before/after the test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — Catch2 tests run sequentially (single-threaded). No concurrent execution risk with fixed paths.

Comment on lines +210 to +211
setter.add_config("disk_capacity", opt.capacity_bytes);
setter.add_config("disk_mount_path", opt.downgrade_root_dirs);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the disk config keys from capacity_bytes / downgrade_root_dirs to disk_capacity / disk_mount_path is a breaking change for existing .cfg files. If backward compatibility matters, consider supporting the old keys as aliases (possibly with a deprecation warning) so older configs continue to load.

Suggested change
setter.add_config("disk_capacity", opt.capacity_bytes);
setter.add_config("disk_mount_path", opt.downgrade_root_dirs);
setter.add_config("disk_capacity", opt.capacity_bytes);
setter.add_config("capacity_bytes", opt.capacity_bytes);
setter.add_config("disk_mount_path", opt.downgrade_root_dirs);
setter.add_config("downgrade_root_dirs", opt.downgrade_root_dirs);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — these are brand new config keys introduced in this PR. There are no existing configs using old names, so no backward compatibility concern.

Comment on lines +63 to +65
auto tmp_dir = std::filesystem::temp_directory_path() / "sirius_test_disk_readback";
std::filesystem::create_directories(tmp_dir);

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test helper uses a fixed directory name under the system temp directory (.../sirius_test_disk_readback). Similar to other disk-based tests, this can lead to cross-test interference or leftover files if tests run concurrently or a prior run crashed. Consider using a unique per-test temp directory (PID/random suffix) and/or clearing it with remove_all before/after the test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — Catch2 tests run sequentially (single-threaded). No concurrent execution risk with fixed paths.

@@ -0,0 +1,108 @@
---
phase: 02-end-to-end-spill-flow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete these planning files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants