Skip to content

feat: support for relative (percent) metrics in data contract checks.#1248

Open
HenrikSpiegel wants to merge 4 commits into
datacontract:mainfrom
HenrikSpiegel:main
Open

feat: support for relative (percent) metrics in data contract checks.#1248
HenrikSpiegel wants to merge 4 commits into
datacontract:mainfrom
HenrikSpiegel:main

Conversation

@HenrikSpiegel
Copy link
Copy Markdown

  • Tests pass (uv run pytest)
  • Code formatted (uv run ruff check --fix && uv run ruff format)
  • README.md updated (if relevant)
  • CHANGELOG.md entry added

#1228

return None
threshold_suffix = "%" if quality.unit == "percent" else ""

if quality.mustBe is not None:
Copy link
Copy Markdown
Author

@HenrikSpiegel HenrikSpiegel May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussion:

I initially implemented this going for minimal changes and ensuring the changes matched the style of the repository.

I would however personally prefer to re-write the function using a operator mapping:

arg_to_op = {
  'mustBe' : '=',
  'mustNotBe': '!='
}
for arg, op in arg_to_op.items():
  if val := getattr(arg, quality) is not none:
    return f"{op} {val}{threshold_suffix}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but this won't work out for the 'between' cases, right?
The reason for this repetitive style is that it is easier to understand the code with all relevant cases, whereas a mapping as you proposed has more mental overhead to read. Of course this is a trade-off, since copying things like the {threshold_suffix} you added can be error-prone as well. But in this case, I'd vote for keeping it as is.

Copy link
Copy Markdown
Collaborator

@jschoedl jschoedl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! A few notes, and please add test cases for this in the tests directory - e.g.:

  • test that percent really appends %
  • test that no % exists for the default case or rows
  • test that e.g. check_property_null_values with percent produces the correct sodacl check
  • test that a percent sign in check_row_count behaves correctly

Comment on lines +561 to +562
if "%" in threshold:
logger.warning("Row count threshold cannot be specified as a percentage.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the warning, we should not create the check (Soda would ignore the percent sign and only use the integer, which is not what the user is intending)

Comment on lines +933 to +934
f"Unsupported quality.unit ={quality.unit} in quality check, must be 'rows' or 'percent' or None"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Unsupported quality.unit ={quality.unit} in quality check, must be 'rows' or 'percent' or None"
)
f"Unsupported quality.unit '{quality.unit}' in quality check, must be 'rows' (default) or 'percent'"
)

None basically means that the value is not present and 'rows' will be used.

Comment on lines +931 to +932
if quality.unit is not None and quality.unit not in ("rows", "percent"):
logger.warning(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if quality.unit is not None and quality.unit not in ("rows", "percent"):
logger.warning(
if quality.unit is not None and quality.unit.lower() not in ("rows", "percent"):

Comment on lines +936 to +937
threshold_suffix = "%" if quality.unit == "percent" else ""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
threshold_suffix = "%" if quality.unit == "percent" else ""
threshold_suffix = "%" if quality.unit.lower() == "percent" else ""

key=check_key,
category="quality",
type=check_type,
name=f"Check that model {model_name} has duplicate_count {threshold} for columns {col_joined}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still includes duplicate_count. (5 similar locations need to be fixed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for inbuilt quality metrics with relative thresholds as per ODCS 3.1.

2 participants