feat: support for relative (percent) metrics in data contract checks.#1248
feat: support for relative (percent) metrics in data contract checks.#1248HenrikSpiegel wants to merge 4 commits into
Conversation
| return None | ||
| threshold_suffix = "%" if quality.unit == "percent" else "" | ||
|
|
||
| if quality.mustBe is not None: |
There was a problem hiding this comment.
discussion:
I initially implemented this going for minimal changes and ensuring the changes matched the style of the repository.
I would however personally prefer to re-write the function using a operator mapping:
arg_to_op = {
'mustBe' : '=',
'mustNotBe': '!='
}
for arg, op in arg_to_op.items():
if val := getattr(arg, quality) is not none:
return f"{op} {val}{threshold_suffix}"
There was a problem hiding this comment.
I get your point, but this won't work out for the 'between' cases, right?
The reason for this repetitive style is that it is easier to understand the code with all relevant cases, whereas a mapping as you proposed has more mental overhead to read. Of course this is a trade-off, since copying things like the {threshold_suffix} you added can be error-prone as well. But in this case, I'd vote for keeping it as is.
jschoedl
left a comment
There was a problem hiding this comment.
Thank you for the PR! A few notes, and please add test cases for this in the tests directory - e.g.:
- test that
percentreally appends% - test that no
%exists for the default case orrows - test that e.g.
check_property_null_valueswith percent produces the correct sodacl check - test that a percent sign in
check_row_countbehaves correctly
| if "%" in threshold: | ||
| logger.warning("Row count threshold cannot be specified as a percentage.") |
There was a problem hiding this comment.
After the warning, we should not create the check (Soda would ignore the percent sign and only use the integer, which is not what the user is intending)
| f"Unsupported quality.unit ={quality.unit} in quality check, must be 'rows' or 'percent' or None" | ||
| ) |
There was a problem hiding this comment.
| f"Unsupported quality.unit ={quality.unit} in quality check, must be 'rows' or 'percent' or None" | |
| ) | |
| f"Unsupported quality.unit '{quality.unit}' in quality check, must be 'rows' (default) or 'percent'" | |
| ) |
None basically means that the value is not present and 'rows' will be used.
| if quality.unit is not None and quality.unit not in ("rows", "percent"): | ||
| logger.warning( |
There was a problem hiding this comment.
| if quality.unit is not None and quality.unit not in ("rows", "percent"): | |
| logger.warning( | |
| if quality.unit is not None and quality.unit.lower() not in ("rows", "percent"): |
| threshold_suffix = "%" if quality.unit == "percent" else "" | ||
|
|
There was a problem hiding this comment.
| threshold_suffix = "%" if quality.unit == "percent" else "" | |
| threshold_suffix = "%" if quality.unit.lower() == "percent" else "" | |
| key=check_key, | ||
| category="quality", | ||
| type=check_type, | ||
| name=f"Check that model {model_name} has duplicate_count {threshold} for columns {col_joined}", |
There was a problem hiding this comment.
This still includes duplicate_count. (5 similar locations need to be fixed)
uv run pytest)uv run ruff check --fix && uv run ruff format)#1228