fix(soda): detect missing columns in CSV/Parquet files via field_is_present by armorbreak001 · Pull Request #1185 · datacontract/datacontract-cli

armorbreak001 · 2026-04-21T04:51:25Z

Summary

Fixes #1065 — field_is_present always passes for CSV/Parquet files even when the column is missing from the data.

Root Cause

In create_view_with_schema_union(), when reading CSV/Parquet files via DuckDB, all columns from the datacontract schema are created in the table (even if they don't exist in the actual data file). Missing columns are filled with NULL values. Soda's when required column missing check only verifies column existence in the table — so it always passes, regardless of whether the column was actually in the source file.

Fix

Two coordinated changes:

duckdb_connection.py: After inserting data into the schema-based table, drop any columns that don't exist in the source data file. This ensures Soda's field_is_present check correctly detects missing columns.
check_soda_execute.py: Before running Soda scans, pre-filter constraint/quality checks (e.g., field_required, field_minimum) that reference dropped columns. These are marked as failed with reason "Column X not found in data file" and excluded from the SodaCL YAML to prevent SQL errors. field_is_present checks are intentionally left in so Soda can evaluate them normally (and correctly report them as failed).

Test Updates

Updated 2 tests to reflect the corrected behavior:

test_csv_optional_field_missing_from_old_data — now expects field_is_present failure for missing optional field
test_parquet_optional_field_missing_from_old_data — same for Parquet

All 10 schema evolution tests pass.

…resent When using DuckDB to read CSV/Parquet files, create_view_with_schema_union() creates all columns from the datacontract schema even if they don't exist in the data file (filling missing ones with NULLs). This caused field_is_present to always pass since Soda's 'when required column missing' check only verifies the column exists in the table. Fix: After inserting data, drop columns that aren't present in the source file so that field_is_present correctly detects them as missing. Constraint checks for dropped columns are pre-filtered to avoid SQL errors while still reporting them as failed. Updates tests to reflect the new correct behavior: missing optional fields now fail field_is_present as expected. Fixes datacontract#1065

armorbreak001 · 2026-04-22T01:52:54Z

Update: PyPI now has Linux x86_64 wheels for pyarrow==24.0.0. The CI failure was likely a temporary resolution issue. Triggered a CI rerun — if it still fails, the root cause may be different (e.g., uv.lock pinning an incompatible transitive dependency).

jschoedl · 2026-04-22T10:28:23Z

Closing in favour of #1163. Please avoid creating PRs if another one already exists for this issue, except when it is stale for a longer time.

armorbreak001 added 2 commits April 22, 2026 06:34

style: fix ruff linting and formatting

7de485b

armorbreak001 force-pushed the fix/1065-field-is-present-csv-parquet branch from c0fca17 to 7de485b Compare April 21, 2026 22:34

ci: trigger CI rerun

11f9815

jschoedl closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185

fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185
armorbreak001 wants to merge 3 commits into
datacontract:mainfrom
armorbreak001:fix/1065-field-is-present-csv-parquet

armorbreak001 commented Apr 21, 2026

Uh oh!

armorbreak001 commented Apr 22, 2026 •

edited

Loading

Uh oh!

jschoedl commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

armorbreak001 commented Apr 21, 2026

Summary

Root Cause

Fix

Test Updates

Uh oh!

armorbreak001 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschoedl commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

armorbreak001 commented Apr 22, 2026 •

edited

Loading