fix: detect missing fields in CSV/Parquet files correctly by hieusats · Pull Request #1163 · datacontract/datacontract-cli

hieusats · 2026-04-18T03:27:19Z

Problem

For CSV and Parquet files accessed via DuckDB (local, S3, GCS, Azure), the check field_is_present always passes regardless of whether the column exists in the file.

Root cause: create_view_with_schema_union() creates an empty table with all contract columns, then inserts only intersecting columns from the data. Missing columns remain in the table (filled with NULLs), so SodaCL's when required column missing check sees the column as present.

Fix

Create a {model}_raw view alongside the unioned table that exposes only actual data columns (no contract-only columns)
Use the _raw view for field_is_present checks on CSV/Parquet data, so missing columns are truly absent
Other checks (type, required, unique, constraints) continue using the unioned table as before

Files changed

duckdb_connection.py: Create {model}_raw view in create_view_with_schema_union()
data_contract_checks.py: Pass use_raw_model=True for field_is_present on csv/parquet
tests/test_test_schema_evolution.py: Update tests to expect correct behavior

All 10 schema evolution tests pass.

Closes #1065

…ct#1065) field_is_present check always passed for CSV/Parquet because create_view_with_schema_union creates a table with ALL contract columns (missing ones filled with NULLs), making SodaCL schema check see the column as present. Fix: Create a _raw view that exposes actual data columns (without contract-only columns), and use it for field_is_present checks on CSV/Parquet data. - duckdb_connection.py: Create {model}_raw view alongside unioned table - data_contract_checks.py: Use _raw view for field_is_present on csv/parquet - tests: Update schema evolution tests to expect correct behavior

bfgoh · 2026-04-20T07:41:56Z

@hieusats thank you so much for working on this. I need this for my project!

jschoedl

Thanks! One minor note, and please add a CHANGELOG entry noting the breaking change.

jschoedl · 2026-04-20T08:14:18Z


-def check_property_is_present(model_name, field_name, quoting_config: QuotingConfig = QuotingConfig()) -> Check:
+def check_property_is_present(
+    model_name, field_name, quoting_config: QuotingConfig = QuotingConfig(), use_raw_model: bool = False


To me, use_raw_model is not quite self-explanatory - I need to read the code to understand what it does. What about adding a parameter view_name: str = None instead, which gets set to model_name on None?

eyupcanakman · 2026-04-21T21:45:36Z

+        # Also create a raw view for field_is_present checks, so missing columns
+        # are actually absent (not filled with NULLs from the unioned table).
+        # See https://github.com/datacontract/datacontract-cli/issues/1065
+        raw_view_name = f"{model_name}_raw"


The field_is_present bug affects JSON too. Line 73's read_json_auto(..., columns=...) fills missing columns with NULL, so the check still passes there. Maybe fix here as well, or as a follow-up?

Good catch. @hieusats You can include this if you like, the required change looks quite similar. But feel free to just resolve this as a follow-up if its getting complicated.

github-actions · 2026-05-25T12:33:50Z

This PR has been inactive for 30 days. It will be closed in 14 days if there is no further activity. Feel free to reopen if you'd like to continue working on it.

jschoedl mentioned this pull request Apr 20, 2026

Check field_is_present always passes for CSV/Parquet files #1065

Open

jschoedl requested changes Apr 20, 2026

View reviewed changes

eyupcanakman reviewed Apr 21, 2026

View reviewed changes

jschoedl added the waiting-for-response label Apr 22, 2026

This was referenced Apr 22, 2026

fix(duckdb): use explicit contract schema columns for CSV/Parquet reads (#1065) #1183

Closed

fix(soda): detect missing columns in CSV/Parquet files via field_is_present #1185

Closed

github-actions Bot added the stale Issue was created, but no updates for a long time. label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect missing fields in CSV/Parquet files correctly#1163

fix: detect missing fields in CSV/Parquet files correctly#1163
hieusats wants to merge 1 commit into
datacontract:mainfrom
hieusats:fix/field-is-present-check

hieusats commented Apr 18, 2026

Uh oh!

bfgoh commented Apr 20, 2026

Uh oh!

jschoedl left a comment

Uh oh!

jschoedl Apr 20, 2026

Uh oh!

eyupcanakman Apr 21, 2026

Uh oh!

jschoedl Apr 22, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hieusats commented Apr 18, 2026

Problem

Fix

Files changed

Uh oh!

bfgoh commented Apr 20, 2026

Uh oh!

jschoedl left a comment

Choose a reason for hiding this comment

Uh oh!

jschoedl Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

eyupcanakman Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jschoedl Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants