Iceberg V3 + equality deletes format coherent + schema evolution + snapshot-aware deletes#574
Iceberg V3 + equality deletes format coherent + schema evolution + snapshot-aware deletes#574ducndh wants to merge 37 commits intosirius-db:devfrom
Conversation
…ble extension loading DuckDB 1.5.1's iceberg extension resolves parquet columns by PARQUET:field_id metadata instead of column names. Our test parquet files (from tpchgen-rs) lack field IDs, causing all-NULL reads. Add schema.name-mapping.default property to generated iceberg metadata so DuckDB can map names→field IDs. Also replace hardcoded v1.4.4 extension path with INSTALL/LOAD for version portability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace runtime-generated test data with pre-built Iceberg fixtures checked into test/cpp/integration/data/. This eliminates the Python/PyIceberg dependency for running tests and ensures deterministic, reproducible test data. Also skips count(*) tests that hit a pre-existing parquet_scan_task byte-range mismatch bug (unrelated to iceberg), and removes fs::exists() guards since fixtures are now always present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Iceberg V3 deletion vector support (Puffin/Roaring bitmap format) and refactor the delete infrastructure for maintainability: - Add puffin_reader for V3 deletion vector parsing - Extend Avro manifest reader to extract V3 fields (file_format, referenced_data_file, content_offset, content_size_in_bytes) - Normalize file_format to lowercase at the Avro parse boundary - Consolidate all delete I/O into IcebergDeleteData struct with a single read_iceberg_delete_data() entry point, called once under InternalQueryGuard during initialize() — fixes V2 deadlock where build_delete_pipeline() opened DuckDB connections without the guard - Operator carries shared_ptr<const IcebergDeleteData> instead of raw file path vectors — eliminates operator-as-data-shuttle pattern - build_delete_pipeline() is now I/O-free pure filter construction (~130 lines, down from ~320) - equality_delete_filter uses shared ownership of pre-built hash join Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Iceberg field ID extraction and matching for schema-evolution-safe equality delete column resolution: - extract_field_id_map() utility: walks parquet FileMetaData schema to build column_name → field_id map for leaf columns - Equality delete reader now extracts field IDs from parquet footers and stores them in IcebergDeleteData::equality_key_field_ids - build_delete_pipeline() prefers field ID matching over name matching when both delete keys and data columns have IDs, falling back to name matching when unavailable - Add get_file_metadata() / num_files() accessors to parquet_scan_task_global_state for data file schema access Per-file schema resolution (different schemas per data file) is a documented follow-on — current implementation uses the first data file's schema for all files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Partitioned tables (Iceberg, Hive-partitioned parquet) encode partition
values in directory paths (e.g., partition_col=42/), not in the parquet
files. DuckDB includes these as columns in the query schema, but Sirius
previously tried to read them from the parquet file — causing index
mismatches and crashes ("flat projected columns" error on all partitioned
tables).
Changes:
- Detect partition columns by comparing DuckDB schema names against the
parquet file schema (leaf column names). Works for both Iceberg and
plain hive-partitioned parquet regardless of DuckDB's
hive_partitioning_indexes mechanism.
- Filter partition column indices from _selected_column_indices so cuDF
only reads actual data columns.
- Change projected_columns_are_flat() and accumulate_row_group_byte_sizes()
to use name-based column lookup instead of index-based. Fixes the
fundamental issue where DuckDB schema indices != parquet column indices
when partition columns shift the numbering.
- Add partition_inject_fn_t: after GPU decompression, inject constant
columns for partition values (parsed from file path) at the correct
output positions. Runs as a separate step after the iceberg delete
post_convert hook, avoiding composition issues.
- Supports INT32, INT64, FLOAT, DOUBLE, BOOL, VARCHAR, and DATE
partition value types.
Tested: 353/362 existing tests pass (same 9 pre-existing TPC-H failures).
8/12 DuckDB iceberg fixtures with partition columns now pass with exact
CPU match (up from 0/12 before this change).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ersion for partitions Extract the DuckDB Value → cudf scalar conversion into a shared utility in cudf_utils.hpp (DuckDBValueToCudfScalar). This eliminates the manual per-type parsing (stoi, stod, Date::FromString, etc.) in the partition injection lambda and provides a reusable conversion bridge between DuckDB's and cudf's type systems. The same Value→scalar switch exists in 3 places: - gpu_execute_constant.cpp (constant expression materialization) - gpu_expression_translator.cpp (AST literal nodes) - partition column injection (new, this PR) The first two can be refactored to use DuckDBValueToCudfScalar in a follow-up, reducing ~60 lines of duplicated switch statements. Partition values are now cast via DuckDB's Value::DefaultCastAs() which handles DATE, DECIMAL, TIMESTAMP, and all other types correctly without custom parsing code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Iceberg manifest list and manifest files can use the Avro "deflate" codec (RFC 1951 raw deflate). Previously, our Avro reader only supported the "null" (uncompressed) codec — deflate-compressed files silently failed and the table was treated as V1 with no deletes. This affected all DuckDB iceberg extension test fixtures, which use deflate-compressed manifest lists generated by Spark/REST catalog. Uses DuckDB's bundled miniz library for decompression. Both block reading loops (manifest list reader and manifest entry reader) now decompress deflate blocks before parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…partition tests Two fixes: 1. Manifest list content-type: The Iceberg spec defines manifest list content field as 0=data manifest, 1=delete manifest (any type). Our code incorrectly treated 1=positional deletes, 2=equality deletes. Since 2 never appears in manifest lists, equality deletes were never discovered. Now reads ALL delete entries from content=1 manifests and classifies by data_file.content (1=positional, 2=equality). 2. Hive partition unit tests: Add 3 integration tests for GPU execution on hive-partitioned parquet (basic scan, filter on data column, aggregation). Test data: 3 parquet files in year=/month=/ layout with partition columns not in parquet files. Known limitation: filter/aggregation on partition columns (WHERE year=X, GROUP BY year) requires refactoring _selected_column_indices to decouple DuckDB schema indices from parquet column indices. Tracked separately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move deflate Avro support and manifest content-type fix to the iceberg branch where they belong. This keeps the hive partition PR clean — only parquet scan infrastructure changes, no iceberg code modifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge hive partition support into the iceberg branch. Resolves conflicts in parquet_scan_task.hpp (keep both V3 accessors and hive partition API) and test_gpu_execution_multi_format.cpp (keep both V3 DV tests and hive partition tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s schemas Three fixes that enable equality deletes on real-world Iceberg tables: 1. Manifest content-type: the manifest list `content` field is 0=data, 1=delete (any type). Previously we looked for content=2 for equality deletes, which never exists in spec-compliant tables. Now we read all entries from content=1 manifests and classify by data_file.content. 2. Deflate Avro: add inflate_deflate() using DuckDB's bundled miniz. Spark and REST catalog tables use deflate-compressed Avro manifests; our reader previously threw on any codec except "null". 3. Heterogeneous equality-delete schemas: real tables can have delete files with different key column sets (e.g., delete by "name" vs "name+bir"). Changed from single equality key set to vector of EqualityDeleteGroup, each with its own hash join and key mapping. Also fixes test fixture data: manifest list content=2 -> content=1 to match the Iceberg spec (previously both generator and reader had the same wrong assumption, masking the bug). Verified: DuckDB equality_deletes fixture (unpartitioned + partitioned) now matches CPU results. All 17 iceberg unit tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ing, snapshots Five new Iceberg test tables that exercise known gaps between DuckDB and Sirius GPU: - iceberg_schema_evolution: added column across snapshots (field-ID mapping) - iceberg_type_widening: INT32→INT64 across files (type widening) - iceberg_multi_snapshot: 3 snapshots for time-travel testing - iceberg_snapshot_deletes: positional deletes in snap2, snap1 has none - iceberg_deflate_manifest: deflate-compressed Avro (basic scan) All pass on DuckDB CPU. GPU failures confirm remaining gaps: - schema_evolution: empty strings instead of NULL for missing columns - type_widening: cuDF "Type mismatch in columns to concatenate" - snapshot_deletes at snap1: wrong delete metadata applied Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ULL injection
When Iceberg tables have files with different schemas (e.g., column added
in a later snapshot), Sirius now handles this correctly:
1. Detect schema evolution by checking ALL files' schemas, not just the
first. Columns missing from ALL files are partition columns; columns
missing from SOME files trigger per-file projection.
2. Per-file reader options: make_reader() overrides column_names per file
so cuDF only reads columns that exist in each file.
3. Schema reconciliation: a new build_schema_reconciliation() method
builds a partition_inject_fn that adds typed NULL columns for columns
missing from specific files, producing a consistent output schema.
4. Fix partition_inject_fn propagation: was gated on has_hive_partitions()
which missed schema reconciliation without partitions.
Verified: iceberg_schema_evolution fixture (file1={fruit,count},
file2={fruit,count,color}) now returns correct NULL values for the
missing color column in file1, matching DuckDB CPU output.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When querying a specific Iceberg snapshot (snapshot_from_id), delete metadata must come from that snapshot, not the latest one. Previously the metadata reader always queried the latest snapshot's manifest list. Changes: - Add named_parameters field to sirius_physical_table_scan and wire it through from LogicalGet in sirius_plan_get.cpp - Extract snapshot_from_id from named parameters in prefetch_iceberg_delete_data and pass to read_iceberg_delete_data - discover_delete_file_paths now filters iceberg_snapshots() by snapshot_id when provided, falling back to latest when not Verified: iceberg_snapshot_deletes fixture — snap1 returns 5 rows (no deletes), snap2 returns 3 rows (deletes applied). Both match CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per the Iceberg spec, equality deletes only apply to data files with a strictly lower sequence number than the delete. Without this, newer data inserted after a delete could be incorrectly filtered out. Changes: - Extract sequence_number from manifest entries in the Avro reader - Store min_sequence_number per EqualityDeleteGroup - Read data file sequence numbers from data manifests - equality_delete_filter skips groups where data_seq > delete_seq - Conservative: when sequence numbers are unavailable or equal, apply (matches behavior for same-snapshot data+delete operations) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that equality deletes respect sequence numbers: a delete at seq=2 should NOT apply to data inserted at seq=3, even if the values match. Fixture: seq1=[apple/1, banana/2, cherry/3], seq2=delete(banana), seq3=[banana/4, date/5]. Expected: 4 rows (banana/4 survives). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…q numbers 4 new test cases in [iceberg] tag: - Schema evolution: added column returns NULL for old files - Snapshot-aware deletes: snap1 has no deletes, snap2 has deletes - Equality delete sequence numbers: insert-delete-insert pattern - Deflate compressed manifests: basic scan correctness Total: 21 iceberg tests (was 17). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ename Three cleanup items from the patch-by-patch implementation: 1. Extract detect_columns_and_apply_projection() — the ~70 lines of schema detection + projection setup were copy-pasted between both parquet_scan_task constructors. Now a single shared method. 2. Single-pass manifest discovery — discover_from_manifests() reads the manifest list once and each manifest once. Previously read each delete manifest twice (positional + equality) and re-queried iceberg_snapshots() a second time for data file sequence numbers. 3. Rename read_iceberg_manifest_delete_entries → read_iceberg_manifest_entries since it reads data file entries too (content=0). Accept target_content=-1 for unfiltered reads. Net -112 lines. No behavior change — 21/21 iceberg tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Apply clang-format to all changed iceberg files - Fix stale manifest list content comment (was 0=DATA,1=POS,2=EQ, should be 0=data manifest, 1=delete manifest) - Update IcebergDeleteFileEntry doc (used for data files too, not just deletes) - Update equality delete field-ID map comment (schema evolution is no longer a "future follow-on", it's implemented) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR expands Sirius’s Iceberg scan support beyond the baseline in #570 by improving delete discovery and application (including Iceberg V3 deletion vectors), enabling snapshot-aware metadata reads via named parameters, and adding broader integration fixtures for equality deletes and schema evolution scenarios.
Changes:
- Propagate DuckDB
named_parametersthrough planning/scan operators to support snapshot-aware Iceberg delete discovery (e.g.,snapshot_from_id). - Introduce/extend pre-materialized Iceberg delete metadata usage in the scan path (shared delete data, equality-delete grouping, and Puffin DV reader wiring).
- Add extensive Iceberg/Hive-partition integration fixtures and update the TPC-H Iceberg wrapper metadata to include a default schema name-mapping.
Reviewed changes
Copilot reviewed 53 out of 142 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/tpch_performance/wrap_parquet_as_iceberg.py | Adds Iceberg metadata properties with a default schema name-mapping for field-id resolution. |
| test/cpp/integration/data/iceberg_v3_deletion_vector/metadata/version-hint.text | Adds Iceberg V3 fixture metadata version hint. |
| test/cpp/integration/data/iceberg_v3_deletion_vector/metadata/v1.metadata.json | Adds Iceberg V3 table metadata fixture. |
| test/cpp/integration/data/iceberg_v3_deletion_vector/metadata/snap-7000000000000000001-1-d7e8f9a0-0007-0007-0007-000000000005.avro | Adds V3 snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v3_deletion_vector/data/dv-00000-0-d7e8f9a0-0007-0007-0007-000000000002-00001.puffin | Adds Puffin deletion-vector fixture. |
| test/cpp/integration/data/iceberg_v3_deletion_vector/data/00000-0-d7e8f9a0-0007-0007-0007-000000000001-00001.parquet | Adds V3 data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_equality_delete/metadata/version-hint.text | Adds Iceberg V2 equality-delete fixture metadata version hint. |
| test/cpp/integration/data/iceberg_v2_equality_delete/metadata/v1.metadata.json | Adds Iceberg V2 table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_equality_delete/metadata/snap-3000000000000000001-1-c3d4e5f6-0003-0003-0003-000000000005.avro | Adds V2 snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_equality_delete/data/delete-eq-00000-0-c3d4e5f6-0003-0003-0003-000000000002-00001.parquet | Adds equality-delete parquet fixture. |
| test/cpp/integration/data/iceberg_v2_equality_delete/data/00000-0-c3d4e5f6-0003-0003-0003-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_single_col/metadata/version-hint.text | Adds single-column equality-delete fixture metadata version hint. |
| test/cpp/integration/data/iceberg_v2_eq_single_col/metadata/v1.metadata.json | Adds single-column equality-delete table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_eq_single_col/metadata/snap-4000000000000000001-1-d4e5f6a7-0004-0004-0004-000000000005.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_single_col/data/delete-eq-d4e5f6a7-0004-0004-0004-000000000002-00001.parquet | Adds single-column equality-delete parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_single_col/data/00000-0-d4e5f6a7-0004-0004-0004-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/metadata/version-hint.text | Adds combined positional+equality delete fixture version hint. |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/metadata/v1.metadata.json | Adds combined delete table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/metadata/snap-7000000000000000001-1-a7b8c9d0-0007-0007-0007-000000000007.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/data/delete-pos-a7b8c9d0-0007-0007-0007-000000000002-00001.parquet | Adds positional-delete parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/data/delete-eq-a7b8c9d0-0007-0007-0007-000000000003-00001.parquet | Adds equality-delete parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_pos_combined/data/00000-0-a7b8c9d0-0007-0007-0007-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/metadata/version-hint.text | Adds multi-delete-file fixture version hint. |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/metadata/v1.metadata.json | Adds multi-delete-file table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/metadata/snap-5000000000000000001-1-e5f6a7b8-0005-0005-0005-000000000006.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/data/delete-eq-2-e5f6a7b8-0005-0005-0005-000000000003-00001.parquet | Adds equality-delete parquet fixture (file 2). |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/data/delete-eq-1-e5f6a7b8-0005-0005-0005-000000000002-00001.parquet | Adds equality-delete parquet fixture (file 1). |
| test/cpp/integration/data/iceberg_v2_eq_multi_delete_file/data/00000-0-e5f6a7b8-0005-0005-0005-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/metadata/version-hint.text | Adds multi-data-file fixture version hint. |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/metadata/v1.metadata.json | Adds multi-data-file table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/metadata/snap-8000000000000000001-1-b8c9d0e1-0008-0008-0008-000000000006.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/data/delete-eq-b8c9d0e1-0008-0008-0008-000000000003-00001.parquet | Adds equality-delete parquet fixture. |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/data/00001-0-b8c9d0e1-0008-0008-0008-000000000002-00001.parquet | Adds data parquet fixture (file 2). |
| test/cpp/integration/data/iceberg_v2_eq_multi_data_file/data/00000-0-b8c9d0e1-0008-0008-0008-000000000001-00001.parquet | Adds data parquet fixture (file 1). |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/metadata/version-hint.text | Adds “all rows deleted” fixture version hint. |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/metadata/v1.metadata.json | Adds “all rows deleted” table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/metadata/snap-6000000000000000001-1-f6a7b8c9-0006-0006-0006-000000000005.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/metadata/f6a7b8c9-0006-0006-0006-000000000003-m0.avro | Adds manifest file fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/data/delete-eq-f6a7b8c9-0006-0006-0006-000000000002-00001.parquet | Adds equality-delete parquet fixture (deletes all). |
| test/cpp/integration/data/iceberg_v2_eq_all_deleted/data/00000-0-f6a7b8c9-0006-0006-0006-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v2_delete/metadata/version-hint.text | Adds positional-delete-only fixture version hint. |
| test/cpp/integration/data/iceberg_v2_delete/metadata/v1.metadata.json | Adds positional-delete-only table metadata fixture. |
| test/cpp/integration/data/iceberg_v2_delete/metadata/snap-2000000000000000001-1-b2c3d4e5-0002-0002-0002-000000000005.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_delete/metadata/b2c3d4e5-0002-0002-0002-000000000004-m1.avro | Adds manifest file fixture (Avro). |
| test/cpp/integration/data/iceberg_v2_delete/data/00000-0-b2c3d4e5-0002-0002-0002-000000000001-00001.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_v1/metadata/version-hint.text | Adds Iceberg v1 fixture version hint. |
| test/cpp/integration/data/iceberg_v1/metadata/v1.metadata.json | Adds Iceberg v1 table metadata fixture. |
| test/cpp/integration/data/iceberg_v1/metadata/snap-1000000000000000001-1-a1b2c3d4-0001-0001-0001-000000000001.avro | Adds v1 snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_v1/metadata/a1b2c3d4-0001-0001-0001-000000000002-m0.avro | Adds v1 manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_v1/data/00000-0-a1b2c3d4-0001-0001-0001-000000000003-00001.parquet | Adds v1 data parquet fixture. |
| test/cpp/integration/data/iceberg_type_widening/metadata/version-hint.text | Adds type-widening fixture version hint. |
| test/cpp/integration/data/iceberg_type_widening/metadata/v1.metadata.json | Adds type-widening table metadata fixture (multiple schemas/snapshots). |
| test/cpp/integration/data/iceberg_type_widening/metadata/snap-9200000000000002-2-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_type_widening/metadata/snap-9200000000000001-1-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_type_widening/metadata/manifest-snap2-m0.avro | Adds manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_type_widening/metadata/manifest-snap1-m0.avro | Adds manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_type_widening/data/00000-0-wide-type.parquet | Adds wide-type data parquet fixture. |
| test/cpp/integration/data/iceberg_type_widening/data/00000-0-narrow-type.parquet | Adds narrow-type data parquet fixture. |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/version-hint.text | Adds snapshot-aware deletes fixture version hint. |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/v1.metadata.json | Adds multi-snapshot table metadata fixture (delete introduced in later snapshot). |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/snap-9400000000000002-2-ml.avro | Adds later-snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/snap-9400000000000001-1-ml.avro | Adds earlier-snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/manifest-delete-m1.avro | Adds delete manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_snapshot_deletes/metadata/manifest-data-m0.avro | Adds data manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_snapshot_deletes/data/delete-00000.parquet | Adds positional delete parquet fixture. |
| test/cpp/integration/data/iceberg_snapshot_deletes/data/00000-data.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/version-hint.text | Adds schema-evolution fixture version hint. |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/v1.metadata.json | Adds schema-evolution metadata fixture (schema add column). |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/snap-9100000000000002-2-ml.avro | Adds later-snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/snap-9100000000000001-1-ml.avro | Adds earlier-snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/manifest-snap2-m0.avro | Adds later snapshot manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_schema_evolution/metadata/manifest-snap1-m0.avro | Adds earlier snapshot manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_schema_evolution/data/00000-0-data-file-2.parquet | Adds data parquet fixture (new schema). |
| test/cpp/integration/data/iceberg_schema_evolution/data/00000-0-data-file-1.parquet | Adds data parquet fixture (old schema). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/version-hint.text | Adds multi-snapshot fixture version hint. |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/v1.metadata.json | Adds multi-snapshot table metadata fixture. |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/snap-9300000000000003-3-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/snap-9300000000000002-2-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/snap-9300000000000001-1-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/manifest-snap3-m0.avro | Adds manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/manifest-snap2-m0.avro | Adds manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/metadata/manifest-snap1-m0.avro | Adds manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_multi_snapshot/data/data-snap3.parquet | Adds data parquet fixture (snapshot 3). |
| test/cpp/integration/data/iceberg_multi_snapshot/data/data-snap2.parquet | Adds data parquet fixture (snapshot 2). |
| test/cpp/integration/data/iceberg_multi_snapshot/data/data-snap1.parquet | Adds data parquet fixture (snapshot 1). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/version-hint.text | Adds equality-delete sequence-number fixture version hint. |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/v1.metadata.json | Adds sequence-number test table metadata fixture. |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/snap-9600000000000003-3-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/snap-9600000000000002-2-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/snap-9600000000000001-1-ml.avro | Adds snapshot manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/manifest-delete.avro | Adds delete manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/manifest-data2.avro | Adds data manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/metadata/manifest-data1.avro | Adds data manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_eq_seq_number/data/delete-eq.parquet | Adds equality-delete parquet fixture. |
| test/cpp/integration/data/iceberg_eq_seq_number/data/data2.parquet | Adds data parquet fixture (post-delete insert). |
| test/cpp/integration/data/iceberg_eq_seq_number/data/data1.parquet | Adds data parquet fixture (pre-delete insert). |
| test/cpp/integration/data/iceberg_deflate_manifest/metadata/version-hint.text | Adds deflate-manifest fixture version hint. |
| test/cpp/integration/data/iceberg_deflate_manifest/metadata/v1.metadata.json | Adds deflate-manifest table metadata fixture. |
| test/cpp/integration/data/iceberg_deflate_manifest/metadata/snap-9500000000000001-1-ml.avro | Adds deflate-coded manifest-list fixture (Avro). |
| test/cpp/integration/data/iceberg_deflate_manifest/metadata/manifest-m0.avro | Adds deflate-coded manifest fixture (Avro). |
| test/cpp/integration/data/iceberg_deflate_manifest/data/00000-data.parquet | Adds data parquet fixture. |
| test/cpp/integration/data/hive_partitioned/year=2025/month=01/data.parquet | Adds hive-partitioned parquet fixture (year=2025/month=01). |
| test/cpp/integration/data/hive_partitioned/year=2024/month=02/data.parquet | Adds hive-partitioned parquet fixture (year=2024/month=02). |
| test/cpp/integration/data/hive_partitioned/year=2024/month=01/data.parquet | Adds hive-partitioned parquet fixture (year=2024/month=01). |
| src/planner/sirius_plan_get.cpp | Moves DuckDB named_parameters onto physical scan nodes. |
| src/op/scan/equality_delete_filter.cpp | Updates equality delete filter to use shared delete data and adds sequence-number gating. |
| src/include/sirius_engine.hpp | Renames/reshapes Iceberg prefetch API and changes cache to store shared delete data. |
| src/include/op/sirius_physical_table_scan.hpp | Adds named_parameters to physical table scans. |
| src/include/op/sirius_physical_iceberg_scan.hpp | Replaces per-scan delete file path vectors with shared immutable delete-data pointer. |
| src/include/op/scan/puffin_reader.hpp | Adds Puffin deletion-vector reader API. |
| src/include/op/scan/iceberg_delete_filter.hpp | Updates equality delete filter interface to reference shared delete data. |
| src/include/op/scan/iceberg_avro_reader.hpp | Adds richer manifest-entry metadata (incl. DV offsets/sequence numbers) and updates docs for deflate support. |
| src/include/data/host_parquet_representation.hpp | Adds a partition-column injection hook alongside the existing post-convert hook. |
| src/data/host_parquet_representation_converters.cpp | Applies the partition-column injection hook during host→GPU conversion and preserves it in host→host conversion. |
| CMakeLists.txt | Adds Puffin reader implementation file to the build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pre-commit hooks: pretty-format-json, fix-end-of-files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DuckDB's iceberg extension reads version-hint.text verbatim and appends it to the metadata path. A trailing newline causes "v1\n.metadata.json" lookup which fails. Exclude version-hint.text from end-of-file-fixer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…andling Fixes from Copilot code review: 1. Column ordering: output_map now uses a duckdb_idx→cudf_idx map built from _selected_column_indices, ensuring correct column positions even when column_ids order differs from primary index order. 2. Memory: use tbl->release() + move instead of copying data columns during partition injection — eliminates redundant GPU allocation. 3. Error handling: throw on missing partition key in file path instead of silently casting empty string (which crashes on non-VARCHAR types). 4. Type safety: DuckDBValueToCudfScalar throws InvalidInputException on unsupported types instead of silently returning a string scalar with mismatched type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… error handling" This reverts commit e130fc8.
…handling Fixes from code review: 1. Column ordering: build output_map in column_ids order (matching pipeline expectations) instead of schema order. Uses a duckdb_idx→cudf_position map built from _selected_column_indices. Fixes wrong results on queries like SELECT year, name, id (reversed column order). 2. Zero-copy: use tbl->release() to move data columns out instead of constructing new cudf::column from column_view (which deep-copies device memory). Partition columns are still allocated fresh since they're constant-filled. 3. Missing partition key: throw with descriptive error instead of silently casting empty string (which crashes on non-VARCHAR types). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous approach grouped only by schema and used min_sequence_number which was incorrect: deletes from different sequence numbers in the same group could be wrongly skipped or applied. Now each group has exactly one sequence number, matching DuckDB's approach (map<seq_num, DeleteData>). Scan-time check is a simple CPU comparison: skip group if data_file_seq >= group.seq (no extra GPU work for skipped groups). Also fixes equality delete test fixtures: delete manifest entries now have sequence_number=2 (data files at seq=1). Per Iceberg spec, equality deletes only apply to data files with strictly lower seq. Previously both were at seq=1, making deletes silently not apply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are small committed unit test fixtures, not external stress tests. The "stress" naming was misleading — they test edge cases like single-column keys, multiple delete files, all-rows-deleted, etc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DuckDB's optimizer prunes columns not referenced in the query (e.g. SELECT count(tx_id) drops user_id, tx_date). But equality delete hash joins need those key columns to probe against. Without them, deletes silently fail — returning all rows instead of filtered rows. Fix: in prepare(), detect equality delete key column names and append any missing ones to _selected_column_indices. The delete pipeline's build_hook() already strips extra trailing columns after filtering, so downstream operators see only the originally requested columns. Before: SELECT count(tx_id) FROM iceberg_scan(...) → 1000000 (wrong) After: SELECT count(tx_id) FROM iceberg_scan(...) → 997865 (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ST catalog support Replace custom Avro manifest reading with DuckDB's iceberg_metadata() for the common path (latest snapshot). This eliminates our Avro codec compatibility issues and adds automatic support for REST catalog tables. - REST catalog: derive table path from bind_data file list when parameters[] is empty (e.g. SELECT FROM rc.test.interleaved) - iceberg_metadata(): handles all manifest versions, codecs, catalogs - Custom Avro reader kept as fallback for: snapshot-specific queries (iceberg_metadata doesn't support snapshot selection) and V3 deletion vectors (iceberg_metadata doesn't expose content_offset/size fields) - Force-project equality delete key columns into scan projection so deletes work even when query doesn't SELECT those columns Verified: REST catalog interleaved table (create, append, delete x2) CPU=2450, GPU=2450. 21/21 unit tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review findings from code reuse, quality, and efficiency agents: - Fix double-fetch bug in snapshot path: Fetch() consumed the chunk in the validation condition, then re-ran the entire SQL query. Now fetches once and validates the stored chunk. - Add DV manifest cache: when iceberg_metadata() returns multiple PUFFIN entries from the same manifest, we now read the manifest once instead of N times (N+1 → 1 pattern). - Named constants for iceberg_metadata() content strings: replace raw "POSITION_DELETES", "EQUALITY_DELETES", "EXISTING", "PUFFIN" with static constexpr variables. - Remove carry-through field _extra_eq_delete_columns: was stored on the class just to ferry between prepare() and build_delete_pipeline(). Now passed as a parameter directly. - Replace triple-nested loop in prepare() with name→index map for O(1) column lookups instead of O(groups × keys × schema_cols). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ode duplication - Unified snapshot + latest paths: iceberg_metadata() supports snapshot_from_id, so the custom Avro reader fallback for snapshot queries is eliminated. One code path for all cases. - Extracted AvroBlockDecoder: deduplicated ~40 lines of deflate decompression + pointer management from both Avro reader functions into a reusable struct with next()/advance_source()/verify_sync(). - Positional delete filter shared ownership: now holds shared_ptr<const IcebergDeleteData> instead of deep-copying the entire positional delete map. Matches equality_delete_filter pattern. - Moved forward declaration of IcebergDeleteData before positional filter class (was after, causing compilation error). Net: -49 lines, 5 files changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tables with partition evolution (e.g., a column that's a data column in some files but a partition column in others) previously crashed with vector::_M_range_check because init_hive_partitions assumed a fixed cuDF column count across all files. Replace the two separate chained inject functions (hive partition + schema reconciliation) with a single unified function that handles all three cases per file: 1. Column exists in parquet → read from cuDF table 2. Column is a partition value in file path → inject from path 3. Column truly missing → inject typed NULL The unified function uses per-file column sets from detect_columns_and_apply_projection() to determine the correct source for each column on each file. Verified: DuckDB hive_partitioned_table fixture (partition evolution from event_date-only to event_date+event_type) now matches CPU exactly. Stress test: 16/25 pass (was 15/25). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… by tests Extract shared constructor logic into initialize_scan() — footer reading, partition detection, projection, and row group partitioning were copy-pasted across both constructors (~130 lines each). Now both constructors set _file_paths + _selected_column_indices then call initialize_scan(). Net -71 lines, single place for future fixes. Add 3 new hive partition tests covering the cases that previously failed (now fixed by column ordering fix in previous commit): - WHERE on partition column (year = 2024) - GROUP BY partition column - Reversed column order in SELECT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merges feature/hive-partition-columns commits d9cee21 and 0a2d10c: - extract initialize_scan() for shared footer/projection/partition init - fix column ordering for filter pushdown (duckdb_to_cudf mapping) - add partition filter/group-by tests (WHERE year=2024, GROUP BY year) Conflict resolution: - Keep detect_columns_and_apply_projection() (all-file schema evolution) over inline first-file-only detection in initialize_scan() - Keep hive branch's stricter error handling for missing partition keys - Fix build_schema_reconciliation: only activates for schema evolution (per-file column differences), not plain hive partitions — the hive fn's duckdb_to_cudf mapping handles column reordering for filter pushdown correctly, but the unified fn uses schema order which breaks WHERE/GROUP BY on partition columns All tests: 21/21 iceberg, 6/6 hive partition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The unified inject fn built output_cols in schema order (names[0..N]), but the cuDF table has columns in _selected_column_indices order (which DuckDB reorders for filter pushdown / GROUP BY). This caused type mismatch when concatenating batches from different files — e.g., event_type (STRING from parquet) vs user_id (INT64 from parquet) swapped. Fix: build output_cols from column_ids in DuckDB's query-specific order, matching init_hive_partitions' duckdb_to_cudf mapping. Verified: GROUP BY event_type on hive_partitioned_table (partition evolution) now matches CPU: click=2, purchase=2, view=2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 56 out of 145 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…validation - CRC32 table: replace mutable static + bool flag with std::call_once for thread-safe lazy initialization. - Roaring offset header: clarify spec-correct behavior: NO_RUNCONTAINER always has offset header, RUN format only when num_containers >= 4. Previous code was correct but had misleading comments; Copilot's suggested fix was wrong (broke 1-container DVs). - Input validation: guard against negative content_offset/size before attempting file I/O. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Makes Iceberg tables work correctly on GPU for real-world tables — not just our simple test fixtures. Before this PR, GPU iceberg scans silently returned wrong results on tables written by Spark, REST catalogs, or any tool that uses standard Iceberg features like equality deletes, schema evolution, or partition changes.
Depends on: #570 (hive partition columns)
Problems Fixed
1. Heterogeneous equality delete schemas
Real Iceberg tables can have multiple equality delete files with different key columns (e.g., one delete file keyed on
name, another onname+birthdate). Our GPU path used a single hash join, which broke when delete files had different schemas.Fix: Changed from a single hash join to
vector<EqualityDeleteGroup>, each with its own GPU hash join.2. Schema evolution produced wrong data
When an Iceberg table adds a column (e.g.,
ALTER TABLE ADD COLUMN color), old parquet files don't have that column. DuckDB handles this by injecting NULL values. Our GPU path either returned empty strings or crashed.Fix: Detect which columns each file has, give each file its own cuDF reader projection, and inject typed NULL columns for missing ones after reading.
3. Snapshot time-travel applied wrong deletes
Querying a historical snapshot (
snapshot_from_id = X) should only see deletes from that snapshot. We always read deletes from the latest snapshot, so historical queries could show fewer rows than they should.Fix: Pass the snapshot ID to
iceberg_metadata()so DuckDB gives us the correct manifest for that snapshot.4. Equality deletes applied to wrong data files
Iceberg tracks when each file was written (sequence number). A delete at sequence 5 should only affect data files written before sequence 5 — not files written after. We applied all deletes to all files.
Fix: Group equality deletes by sequence number. At scan time, skip any delete group whose sequence number isn't higher than the data file's. This is a CPU-side check — no extra GPU work.
5. Queries that don't SELECT the delete key columns skipped deletes
SELECT count(tx_id) FROM iceberg_tablewould return all rows becauseuser_id(the equality delete key) wasn't in the projection. DuckDB's optimizer prunes columns not in the query, so the delete filter couldn't find the key columns to match against.Fix: Force equality delete key columns into the scan projection, then strip them after delete filtering so downstream operators see only the requested columns.
6. REST catalog tables had no delete support
Tables accessed via REST catalog (
ATTACH '' AS rc (TYPE iceberg, ...); SELECT FROM rc.schema.table) had no deletes applied because the table path wasn't available through the normal parameter mechanism.Fix: Derive the table path from the first data file in the bind data. Also switched the main discovery path to use DuckDB's
iceberg_metadata()function instead of our custom Avro reader — this automatically handles all manifest versions, compression codecs, and catalog types.7. Partition evolution crashed
Some Iceberg tables change their partition scheme over time (e.g.,
event_typestarts as a data column in the parquet file, then later becomes a partition value in the directory path). Our code assumed all files have the same column layout and crashed with an out-of-bounds access.Fix: Unified inject function that checks per-file whether each column comes from parquet data, from the file path, or is truly missing (NULL). Replaces the previous two-stage approach that couldn't handle per-file column count variation.
Test Plan
Unit Tests
GROUP BYon partition columns)Stress Tests
DuckDB Iceberg Extension Fixtures — 34 real-world tables from the
duckdb/duckdb_icebergtest suite, generated by Spark, PyIceberg, and DuckDB. Covers V1/V2 tables, equality deletes, positional deletes, hive partitioning, partition evolution, and various data types. Compared GPU output against CPU row-by-row.Result: 16/25 pass. The failures are all pre-existing unsupported cuDF types (UUID, STRUCT, BLOB, small DECIMAL, TIMESTAMP_TZ) plus one empty-table edge case and one decimal sort bug — none introduced by this PR.
Apache Iceberg REST Catalog — tested against the official
apache/iceberg-rest-fixtureDocker image. Created tables with interleaved INSERT + DELETE operations through the REST API. Verified GPU matches CPU for tables with positional deletes across multiple snapshots.Targeted Correctness Checks
GROUP BYon evolved partition column (column is data in some files, partition in others): GPU = CPUKnown Limitations
ALTER TABLE RENAME COLUMNis not supported. Iceberg tracks renames via field IDs, but cuDF's parquet reader doesn't expose Thrift-level field IDs from the parquet footer. The fix is implemented (field-ID resolution indetect_columns_and_apply_projection) but blocked on cuDF metadata access. I might be wrong on this one and need to reinvestigate this deeperALTER TABLE ALTER COLUMN x SET DATA TYPE BIGINT(wasINT) causes a cuDF "Type mismatch in columns to concatenate" error because old files haveINT32and new files haveINT64. Needs a post-read cast step and might affect other downstream.count(*)with equality deletes: DuckDB optimizescount(*)to read row counts from parquet footer metadata, bypassing our delete filter entirely. I ahve submitted a PR on creating a cpu_source_task for these types of operation.