fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185
Closed
armorbreak001 wants to merge 3 commits into
Closed
fix(soda): detect missing columns in CSV/Parquet files via field_is_present#1185armorbreak001 wants to merge 3 commits into
armorbreak001 wants to merge 3 commits into
Conversation
…resent When using DuckDB to read CSV/Parquet files, create_view_with_schema_union() creates all columns from the datacontract schema even if they don't exist in the data file (filling missing ones with NULLs). This caused field_is_present to always pass since Soda's 'when required column missing' check only verifies the column exists in the table. Fix: After inserting data, drop columns that aren't present in the source file so that field_is_present correctly detects them as missing. Constraint checks for dropped columns are pre-filtered to avoid SQL errors while still reporting them as failed. Updates tests to reflect the new correct behavior: missing optional fields now fail field_is_present as expected. Fixes datacontract#1065
c0fca17 to
7de485b
Compare
Contributor
Author
|
Update: PyPI now has Linux x86_64 wheels for |
Collaborator
|
Closing in favour of #1163. Please avoid creating PRs if another one already exists for this issue, except when it is stale for a longer time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1065 —
field_is_presentalways passes for CSV/Parquet files even when the column is missing from the data.Root Cause
In
create_view_with_schema_union(), when reading CSV/Parquet files via DuckDB, all columns from the datacontract schema are created in the table (even if they don't exist in the actual data file). Missing columns are filled with NULL values. Soda'swhen required column missingcheck only verifies column existence in the table — so it always passes, regardless of whether the column was actually in the source file.Fix
Two coordinated changes:
duckdb_connection.py: After inserting data into the schema-based table, drop any columns that don't exist in the source data file. This ensures Soda'sfield_is_presentcheck correctly detects missing columns.check_soda_execute.py: Before running Soda scans, pre-filter constraint/quality checks (e.g.,field_required,field_minimum) that reference dropped columns. These are marked as failed with reason"Column X not found in data file"and excluded from the SodaCL YAML to prevent SQL errors.field_is_presentchecks are intentionally left in so Soda can evaluate them normally (and correctly report them as failed).Test Updates
Updated 2 tests to reflect the corrected behavior:
test_csv_optional_field_missing_from_old_data— now expectsfield_is_presentfailure for missing optional fieldtest_parquet_optional_field_missing_from_old_data— same for ParquetAll 10 schema evolution tests pass.