feat(smart_grid): add 25 additional transformer scenarios#292
feat(smart_grid): add 25 additional transformer scenarios#292eggrollofchaos wants to merge 6 commits into
Conversation
…rio corpus
Adds the Smart Grid transformer-maintenance domain to AssetOpsBench as a
focused upstream cut from the SmartGridBench source project (Columbia
University, 2026). New surfaces:
- Smart Grid MCP servers under `src/servers/smart_grid/` for IoT, FMSR/DGA,
TSFM/RUL, and work-order workflows. Nested under a domain-specific
sub-namespace to coexist with the existing domain-general
`src/servers/{iot,fmsr,tsfm,wo}` servers (different backends, asset
types, and data assumptions; PR body documents the design rationale).
- A direct adapter exposing the Smart Grid tools as plain Python callables.
- 36 canonical Smart Grid scenarios + 5 negative-check fixtures in the AOB
local scenario array convention; extended evaluator metadata documented
in `docs/smart_grid_data_provenance.md`.
- `SG_DATA_DIR` runtime data-provenance contract and a no-CSV-port policy:
no raw or processed source-project CSV datasets are shipped.
- Console-script entry points for the four Smart Grid MCP servers.
- Unit tests for the direct adapter, IEC 60599 DGA classification,
JSON-safe divergent ratios, and scenario shape/uniqueness.
Validation: uv run pytest src/servers/smart_grid/ -- 25 passed.
Scenario JSON contains 36 unique canonical records and 5 unique
negative-check records.
Refs: HPML6998-S26-Team13/hpml-assetopsbench-smart-grid-mcp#46
Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Normalize pandas Timestamp values from get_dga_record before returning them through MCP JSON-RPC. Without this, valid DGA lookups over parsed CSV fixtures fail strict JSON serialization even though the in-process Python call succeeds. Adds a regression test that builds a temporary SG_DATA_DIR fixture, calls get_dga_record, and verifies json.dumps(..., allow_nan=False) succeeds. This is a follow-up from PR IBM#287 self-review and should remain a separate review-iteration commit on the published branch. Validation: - SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run pytest src/servers/smart_grid/ - uv run ruff format --check src/servers/smart_grid/ - uv run ruff check src/servers/smart_grid/ - SG_DATA_DIR=/Users/wax/coding/hpml-assetopsbench-smart-grid-mcp/data/processed uv run python <19-tool JSON serialization smoke> Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Move the JSON-safe record normalizer from `fmsr/main.py` (where it was added in `fix(smart_grid): serialize DGA sample dates`) up to `base.py` as the public canonical helper `json_safe_record`. Replace the latent pre-fix `_normalize_record` in `wo/main.py` (which only handled `pd.isna`, not `pd.Timestamp`) with the canonical helper. `wo._normalize_record` was correct in behavior at the time it ran because `load_fault_records` does not currently pass `parse_dates`, so no `pd.Timestamp` ever leaked through. Adding `parse_dates=["report_date"]` (or similar) later would have silently broken JSON-RPC the same way the DGA path broke before its fix. Centralizing the boundary normalizer prevents that regression class. Verification: `uv run pytest src/servers/smart_grid/` -- 42 passed. `uv run ruff format --check src/servers/smart_grid/` clean. `uv run ruff check src/servers/smart_grid/` clean. Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Add `tests/test_json_safety.py` that walks every `@mcp.tool()`-decorated callable across `iot`, `fmsr`, `tsfm`, and `wo` and asserts `json.dumps(result, allow_nan=False)` succeeds against a hermetic `SG_DATA_DIR` fixture. Catches the boundary-contract bug class fixed in `fmsr.get_dga_record` for any current or future Smart Grid tool, without per-tool test boilerplate. The fixture writes minimal CSVs for all six processed-data files, sets `SG_DATA_DIR` to a `tmp_path`, and resets module-level dataframe caches across all four servers so each test gets a clean read path. 16 parametrized cases land 42 total tests passing (was 26). Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
Expand the Smart Grid scenario corpus on top of IBM PR IBM#287 from 36 records to 61 records by adding SGT-036 through SGT-060 from the SmartGridBench source project. The added batch includes domain-coverage gap-fill scenarios plus capability-targeted discrimination checks with benchmark_design metadata and negative must_NOT_include rubric fields. Update the Smart Grid provenance docs and README count so the corpus size and new evaluator-facing metadata are documented. Extend the scenario JSON tests to assert the full SGT-001..SGT-060 ID set and guard against silently dropping the capability-targeted rubric fields. Validation: - uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q - uv run pytest src/servers/smart_grid/ -q - uv run ruff format --check src/servers/smart_grid/ - uv run ruff check src/servers/smart_grid/ Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
There was a problem hiding this comment.
Self-review
Upstream-context pass after opening. Stacked on #287; new commit specific to #292 is 5b7d8b0.
PR context checked: head 5b7d8b0fed1c1ee744a4e6753282e80c826cd779; top-level comments 0; review records 0; inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T03:38:24Z; all 5 commits signed off; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).
Critical: none.
High: none.
Medium: none.
Low: none.
Nit
- N1 — PR size +754 lines exceeds IBM's
<300 linespreference without acknowledgment in PR body. #287 had a## Size and split offersection; #292 doesn't. Adding a short paragraph naming the natural split boundary (gap-fillSGT-036..SGT-050vs capability-targetedSGT-051..SGT-060) defuses the size question pre-emptively. No code change. - N2 —
src/servers/smart_grid/tests/test_scenarios.py:108-112: per-record assertions are truthy-only (assert design.get("target_capability"), raw["id"]). A future regression that stuffstarget_capability: " "(whitespace) would still pass. Optional: wrap inisinstance(...) and value.strip()for safer guard. - N3 — PR body validation section says "10 records with
benchmark_designand 13 records withmust_NOT_include" as exact numbers; tests assert>= 10and>= 13(forward-compatible floor). Intentional but a reader checking the body against the test might briefly stall. Optional clarification in body wording ("10 records (floor)" or similar).
Verified non-findings
- README
36 → 61records update accurate; new phrasing fairly describes corpus shape. - Provenance doc additions correctly distinguish SGT-036..SGT-050 (gap-fill) from SGT-051..SGT-060 (capability-targeted); two new evaluator-metadata table rows describe optional fields without overclaiming.
- Acknowledgments paragraph credits source-project authors alphabetically (Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin). Consistent with #287.
- All
expected_toolsnamespace-prefixed correctly across the 25 new scenarios. - No raw or processed CSV/data files included —
SG_DATA_DIRruntime contract preserved. - Stacking is clean:
5b7d8b0is the only PR292-specific commit; the four predecessors are #287 content visible because #287 hasn't landed yet. - DCO check green; all commits signed off with consistent identity.
- Branch name
feature/smart-grid-additional-scenariosmatches<type>/<description>convention.
Verification
uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q→ 7 passed.uv run pytest src/servers/smart_grid/ -q→ 44 passed.uv run ruff format --check src/servers/smart_grid/→ clean.uv run ruff check src/servers/smart_grid/→ clean.- JSON corpus: 61 records, unique IDs, exact set
{AOB-FMSR-001} ∪ {SGT-001..SGT-060}. - PR292 batch (SGT-036..SGT-060): 25 records, all schema-valid,
benchmark_design=10/25,must_NOT_include=13/25.
Verdict
LGTM with 3 Nits. None blocks merge. IBM-maintainer review is the remaining external gate.
Address the PR IBM#292 v1 review nit by treating whitespace-only evaluator rubric strings as invalid. The capability-targeted test now shares a small non-empty-string predicate for benchmark_design fields and must_NOT_include entries, so future corpus edits cannot satisfy the preservation guard with blank strings.\n\nValidation:\n- uv run ruff format src/servers/smart_grid/tests/test_scenarios.py\n- uv run ruff check src/servers/smart_grid/tests/test_scenarios.py\n- uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q\n- uv run pytest src/servers/smart_grid/ -q Signed-off-by: Wei Alexander Xin <eggrollofchaos@gmail.com>
There was a problem hiding this comment.
Self-review follow-up
PR context checked at head 1cffe54ff734ed9bb300957f5d06059268276c09; top-level comments 0; review records 1 prior (v1 at 5b7d8b0); inline comments 0; review threads 0; DCO SUCCESS at 2026-05-11T21:41:48Z on v3 head; mergeStateStatus: BLOCKED, reviewDecision: REVIEW_REQUIRED (IBM-maintainer gates, expected).
Body originally posted at v2 head 5b7d8b0 after the PR-body edit closing N1+N3; edited in place after v3 head 1cffe54 landed with the N2 fixup commit, to keep one consolidated follow-up record rather than two.
v1 Nit closure
- N1 — closed in v2 PR-body edit. New
## Size and split offersection honestly acknowledges size, names natural split boundary, offers to split post-#287. - N3 — closed in v2 PR-body edit. Validation bullet rewritten to "current-batch floors" + forward-compat rationale.
- N2 — closed at v3 commit
1cffe54(test(smart_grid): tighten rubric string assertions). Added_non_empty_string(value)predicate (isinstance(value, str) and bool(value.strip())); replaced 3 truthy-only assertions uniformly fortarget_capability,discrimination_hypothesis, andmust_NOT_includeper-item checks. Surgical 7+/-3-line diff. Helper extraction appropriate (used 3x). Whitespace-string regression now blocked.
Probed and ruled out
_non_empty_stringedge cases:""→ False;" "→ False;"\t\n"→ False (handles whitespace beyond regular spaces);"a"→ True; non-string values (None,int,dict) → False (defensive). Minimal but correctly typed.- All 3 call sites use the predicate uniformly; no inconsistency.
must_NOT_includeouter assertion (isinstance(excluded, list) and excluded) retained; inner per-item check is now_non_empty_string(item). Two-layer correctness.- DCO re-ran on new commit (not carried): SUCCESS at v3 head.
- PR body unchanged since v2 edit;
## Size and split offer+ clarified## Validationstill present. - Stacked-on-#287 scope discipline preserved. New commit
1cffe54is PR292-specific test-only change; no scope drift. - No new top-level comments, review threads, or inline comments since v1.
Verification at v3 head
uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q→ 7 passed.uv run ruff format --check src/servers/smart_grid/tests/test_scenarios.py→ already formatted.uv run ruff check src/servers/smart_grid/tests/test_scenarios.py→ all checks passed.- Commit chain verified:
a5b35a9→3fb6943→e8b3ab0→c5067b9→5b7d8b0→1cffe54.
Summary counts
- Critical: 0
- High: 0
- Medium: 0
- Low: 0
- Nit: 0 (N1+N3 closed in v2 body edit; N2 closed in v3 commit
1cffe54)
Verdict
LGTM — final-confirmation clean at v3 head 1cffe54. All three v1 Nits closed via the right vehicles: N1+N3 in PR-body edit (no commit cost), N2 in surgical fixup commit. Remaining gate is purely IBM-maintainer external review.
Summary
Adds the remaining SmartGridBench transformer-maintenance scenarios on top of #287, expanding the Smart Grid local corpus from 36 records to 61 records.
This PR adds:
SGT-036..SGT-050: domain-coverage gap-fill scenarios across FMSR, IoT, TSFM, work-order, and multi-tool workflows.SGT-051..SGT-060: capability-targeted discrimination checks covering calibration/abstention, prompt-premise contradiction, cross-tool reconciliation, strict output formatting, and truncated-tool-result discipline.AOB-FMSR-001+SGT-001..SGT-060ID set and guard the newbenchmark_design/ground_truth.must_NOT_includerubric fields.Relationship to #287
This is a follow-on to #287 and is intentionally based on the #287 branch head. Until #287 lands, GitHub will show the domain/server port plus this additional-scenario commit together in this PR's diff. The new commit here is only:
5b7d8b0 feat(smart_grid): add 25 additional transformer scenariosIf maintainers prefer, #287 can be reviewed first and this PR can then be rebased so the visible diff contains only the 25-scenario expansion.
Size and split offer
This follow-on is larger than IBM's preferred small-PR guideline because it updates one canonical scenario array, its count/metadata tests, and the matching provenance documentation as one invariant-preserving unit. The new PR-specific code/data delta is the single 25-scenario expansion commit on top of #287. If maintainers prefer a smaller review path after #287 lands, I can split this into separate gap-fill (
SGT-036..SGT-050) and capability-targeted (SGT-051..SGT-060) scenario PRs.Data policy
No raw or processed SmartGridBench CSV files are included. The added records use the same
SG_DATA_DIRruntime data contract already documented in #287.Validation
uv run pytest src/servers/smart_grid/tests/test_scenarios.py -q— 7 passed.uv run pytest src/servers/smart_grid/ -q— 44 passed.uv run ruff format --check src/servers/smart_grid/— clean.uv run ruff check src/servers/smart_grid/— clean.src/scenarios/local/smart_grid.jsoncontains 61 unique records:AOB-FMSR-001plusSGT-001..SGT-060.benchmark_designand 13 records withground_truth.must_NOT_include. Tests assert these as minimum floors so future additions or harmless reorders do not require test rewrites.Acknowledgments
Source-project authors (Columbia SmartGridBench, Spring 2026): Akshat Bhandari, Aaron Fan, Tanisha Rathod, Wei Alexander Xin.
References