Skip to content

fix: detect missing fields in CSV/Parquet files correctly#1163

Open
hieusats wants to merge 1 commit into
datacontract:mainfrom
hieusats:fix/field-is-present-check
Open

fix: detect missing fields in CSV/Parquet files correctly#1163
hieusats wants to merge 1 commit into
datacontract:mainfrom
hieusats:fix/field-is-present-check

Conversation

@hieusats
Copy link
Copy Markdown

Problem

For CSV and Parquet files accessed via DuckDB (local, S3, GCS, Azure), the check field_is_present always passes regardless of whether the column exists in the file.

Root cause: create_view_with_schema_union() creates an empty table with all contract columns, then inserts only intersecting columns from the data. Missing columns remain in the table (filled with NULLs), so SodaCL's when required column missing check sees the column as present.

Fix

  1. Create a {model}_raw view alongside the unioned table that exposes only actual data columns (no contract-only columns)
  2. Use the _raw view for field_is_present checks on CSV/Parquet data, so missing columns are truly absent
  3. Other checks (type, required, unique, constraints) continue using the unioned table as before

Files changed

  • duckdb_connection.py: Create {model}_raw view in create_view_with_schema_union()
  • data_contract_checks.py: Pass use_raw_model=True for field_is_present on csv/parquet
  • tests/test_test_schema_evolution.py: Update tests to expect correct behavior

All 10 schema evolution tests pass.

Closes #1065

…ct#1065)

field_is_present check always passed for CSV/Parquet because
create_view_with_schema_union creates a table with ALL contract columns
(missing ones filled with NULLs), making SodaCL schema check see the
column as present.

Fix: Create a _raw view that exposes actual data columns (without
contract-only columns), and use it for field_is_present checks on
CSV/Parquet data.

- duckdb_connection.py: Create {model}_raw view alongside unioned table
- data_contract_checks.py: Use _raw view for field_is_present on csv/parquet
- tests: Update schema evolution tests to expect correct behavior
@bfgoh
Copy link
Copy Markdown

bfgoh commented Apr 20, 2026

@hieusats thank you so much for working on this. I need this for my project!

Copy link
Copy Markdown
Collaborator

@jschoedl jschoedl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One minor note, and please add a CHANGELOG entry noting the breaking change.


def check_property_is_present(model_name, field_name, quoting_config: QuotingConfig = QuotingConfig()) -> Check:
def check_property_is_present(
model_name, field_name, quoting_config: QuotingConfig = QuotingConfig(), use_raw_model: bool = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, use_raw_model is not quite self-explanatory - I need to read the code to understand what it does. What about adding a parameter view_name: str = None instead, which gets set to model_name on None?

# Also create a raw view for field_is_present checks, so missing columns
# are actually absent (not filled with NULLs from the unioned table).
# See https://github.com/datacontract/datacontract-cli/issues/1065
raw_view_name = f"{model_name}_raw"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field_is_present bug affects JSON too. Line 73's read_json_auto(..., columns=...) fills missing columns with NULL, so the check still passes there. Maybe fix here as well, or as a follow-up?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. @hieusats You can include this if you like, the required change looks quite similar. But feel free to just resolve this as a follow-up if its getting complicated.

@github-actions
Copy link
Copy Markdown

This PR has been inactive for 30 days. It will be closed in 14 days if there is no further activity. Feel free to reopen if you'd like to continue working on it.

@github-actions github-actions Bot added the stale Issue was created, but no updates for a long time. label May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issue was created, but no updates for a long time. waiting-for-response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check field_is_present always passes for CSV/Parquet files

4 participants