Skip to content

fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185

Closed
armorbreak001 wants to merge 3 commits into
datacontract:mainfrom
armorbreak001:fix/1065-field-is-present-csv-parquet
Closed

fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185
armorbreak001 wants to merge 3 commits into
datacontract:mainfrom
armorbreak001:fix/1065-field-is-present-csv-parquet

Conversation

@armorbreak001
Copy link
Copy Markdown
Contributor

Summary

Fixes #1065field_is_present always passes for CSV/Parquet files even when the column is missing from the data.

Root Cause

In create_view_with_schema_union(), when reading CSV/Parquet files via DuckDB, all columns from the datacontract schema are created in the table (even if they don't exist in the actual data file). Missing columns are filled with NULL values. Soda's when required column missing check only verifies column existence in the table — so it always passes, regardless of whether the column was actually in the source file.

Fix

Two coordinated changes:

  1. duckdb_connection.py: After inserting data into the schema-based table, drop any columns that don't exist in the source data file. This ensures Soda's field_is_present check correctly detects missing columns.

  2. check_soda_execute.py: Before running Soda scans, pre-filter constraint/quality checks (e.g., field_required, field_minimum) that reference dropped columns. These are marked as failed with reason "Column X not found in data file" and excluded from the SodaCL YAML to prevent SQL errors. field_is_present checks are intentionally left in so Soda can evaluate them normally (and correctly report them as failed).

Test Updates

Updated 2 tests to reflect the corrected behavior:

  • test_csv_optional_field_missing_from_old_data — now expects field_is_present failure for missing optional field
  • test_parquet_optional_field_missing_from_old_data — same for Parquet

All 10 schema evolution tests pass.

…resent

When using DuckDB to read CSV/Parquet files, create_view_with_schema_union()
creates all columns from the datacontract schema even if they don't exist in
the data file (filling missing ones with NULLs). This caused field_is_present
to always pass since Soda's 'when required column missing' check only verifies
the column exists in the table.

Fix: After inserting data, drop columns that aren't present in the source file
so that field_is_present correctly detects them as missing. Constraint checks
for dropped columns are pre-filtered to avoid SQL errors while still reporting
them as failed.

Updates tests to reflect the new correct behavior: missing optional fields
now fail field_is_present as expected.

Fixes datacontract#1065
@armorbreak001 armorbreak001 force-pushed the fix/1065-field-is-present-csv-parquet branch from c0fca17 to 7de485b Compare April 21, 2026 22:34
@armorbreak001
Copy link
Copy Markdown
Contributor Author

armorbreak001 commented Apr 22, 2026

Update: PyPI now has Linux x86_64 wheels for pyarrow==24.0.0. The CI failure was likely a temporary resolution issue. Triggered a CI rerun — if it still fails, the root cause may be different (e.g., uv.lock pinning an incompatible transitive dependency).

@jschoedl
Copy link
Copy Markdown
Collaborator

Closing in favour of #1163. Please avoid creating PRs if another one already exists for this issue, except when it is stale for a longer time.

@jschoedl jschoedl closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check field_is_present always passes for CSV/Parquet files

2 participants