fix: detect missing fields in CSV/Parquet files correctly#1163
Conversation
…ct#1065) field_is_present check always passed for CSV/Parquet because create_view_with_schema_union creates a table with ALL contract columns (missing ones filled with NULLs), making SodaCL schema check see the column as present. Fix: Create a _raw view that exposes actual data columns (without contract-only columns), and use it for field_is_present checks on CSV/Parquet data. - duckdb_connection.py: Create {model}_raw view alongside unioned table - data_contract_checks.py: Use _raw view for field_is_present on csv/parquet - tests: Update schema evolution tests to expect correct behavior
|
@hieusats thank you so much for working on this. I need this for my project! |
jschoedl
left a comment
There was a problem hiding this comment.
Thanks! One minor note, and please add a CHANGELOG entry noting the breaking change.
|
|
||
| def check_property_is_present(model_name, field_name, quoting_config: QuotingConfig = QuotingConfig()) -> Check: | ||
| def check_property_is_present( | ||
| model_name, field_name, quoting_config: QuotingConfig = QuotingConfig(), use_raw_model: bool = False |
There was a problem hiding this comment.
To me, use_raw_model is not quite self-explanatory - I need to read the code to understand what it does. What about adding a parameter view_name: str = None instead, which gets set to model_name on None?
| # Also create a raw view for field_is_present checks, so missing columns | ||
| # are actually absent (not filled with NULLs from the unioned table). | ||
| # See https://github.com/datacontract/datacontract-cli/issues/1065 | ||
| raw_view_name = f"{model_name}_raw" |
There was a problem hiding this comment.
The field_is_present bug affects JSON too. Line 73's read_json_auto(..., columns=...) fills missing columns with NULL, so the check still passes there. Maybe fix here as well, or as a follow-up?
There was a problem hiding this comment.
Good catch. @hieusats You can include this if you like, the required change looks quite similar. But feel free to just resolve this as a follow-up if its getting complicated.
|
This PR has been inactive for 30 days. It will be closed in 14 days if there is no further activity. Feel free to reopen if you'd like to continue working on it. |
Problem
For CSV and Parquet files accessed via DuckDB (local, S3, GCS, Azure), the check
field_is_presentalways passes regardless of whether the column exists in the file.Root cause:
create_view_with_schema_union()creates an empty table with all contract columns, then inserts only intersecting columns from the data. Missing columns remain in the table (filled with NULLs), so SodaCL'swhen required column missingcheck sees the column as present.Fix
{model}_rawview alongside the unioned table that exposes only actual data columns (no contract-only columns)_rawview forfield_is_presentchecks on CSV/Parquet data, so missing columns are truly absentFiles changed
duckdb_connection.py: Create{model}_rawview increate_view_with_schema_union()data_contract_checks.py: Passuse_raw_model=Trueforfield_is_presenton csv/parquettests/test_test_schema_evolution.py: Update tests to expect correct behaviorAll 10 schema evolution tests pass.
Closes #1065