Skip to content

fix(duckdb): use explicit contract schema columns for CSV/Parquet reads (#1065)#1183

Closed
barry0451 wants to merge 1 commit into
datacontract:mainfrom
0451-software:upstream-pr-1065
Closed

fix(duckdb): use explicit contract schema columns for CSV/Parquet reads (#1065)#1183
barry0451 wants to merge 1 commit into
datacontract:mainfrom
0451-software:upstream-pr-1065

Conversation

@barry0451
Copy link
Copy Markdown
Contributor

Summary

Fixes #1065field_is_present check always passes for CSV/Parquet files because the current implementation compares row counts instead of checking actual column overlap. Uses explicit SELECT of contract schema columns so missing data columns = NULL, allowing field_is_present to catch them.

Changes

  • Modified create_view_with_schema_union in datacontract/engines/soda/connections/duckdb_connection.py
  • Previously used INTERSECT to find columns in both contract and data (missing contract columns were silently ignored)
  • Now explicitly SELECTs contract schema columns by name, so missing data columns become NULL

Testing

  • All existing tests pass

…ds (datacontract#1065)

Previously the INTERSECT query selected only columns present in BOTH contract
and data, so missing contract columns were silently ignored — field_is_present
always passed. Now we SELECT explicitly the contract schema columns by name,
so missing data columns become NULL and field_is_present can properly catch them.
@jschoedl
Copy link
Copy Markdown
Collaborator

Closing in favour of #1163. Please avoid creating PRs if another one already exists for this issue, except when it is stale for a longer time.

@jschoedl jschoedl closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check field_is_present always passes for CSV/Parquet files

2 participants