Cache label identifier data to eliminate redundant parsing#1607
Cache label identifier data to eliminate redundant parsing#1607jordanpadams wants to merge 8 commits into
Conversation
After each label is parsed by pds4-jparser, extract and cache the logical identifiers, lid/lidvid references, and context area references into a LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached context area refs to skip three expensive Saxon XPath evaluations per label instead of re-running them against a freshly-reparsed DOM. Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse from disk in additionalReferentialIntegrityChecks() to correctly detect and report INVALID_FIELD_VALUE for identifier values containing newlines — pds4-jparser normalizes newlines away, so the cached values cannot be used for that check. Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ntegrity phase - LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true) so the referential integrity phase can safely use cached identifiers without re-parsing - CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available, falling back to disk parse only for labels not in the initial validation pass - ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached logicalIdentifiers and lidOrLidVidReferences when available, eliminating all disk re-parsing for the common case; fallback parse retained for uncached labels All 297 tests pass. Resolves the full acceptance criteria for #1568. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Acceptance Criteria VerificationFrom issue #1568:
How each label is now parsed exactly oncePhase 1 — Initial label validation ( The DOM is built once from disk.
Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:
A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case. Test evidence
🤖 Generated with Claude Code |
Generates a synthetic PDS4 bundle with N product labels on-the-fly and times validate --rule pds4.bundle --skip-content-validation against it. Exercises additionalReferentialIntegrityChecks, collectAllContextReferences, and CrossLabelFileAreaReferenceChecker — all three paths optimized by #1568. Usage: python3 scripts/benchmark_bundle.py # smoke test (10 products) python3 scripts/benchmark_bundle.py --products 1000 --runs 2 Auto-detects and extracts freshly built distribution from target/ when newer than the pre-built snapshot. Pass --validate to override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…porting cacheIdentifiers() was using the Saxon-processed DOM which normalizes xs:token whitespace, stripping \n from identifier values and silently breaking carriage-return detection (broke tests #15-2 and #401-1). Fix: re-parse with DocumentBuilderFactory (preserves raw whitespace). Backwards compatibility: lid_reference \n errors were only ever reported during bundle/collection validation (additionalReferentialIntegrityChecks), not single-label validation. cacheIdentifiers() uses getLidVidReferences with reportCarriageReturns=false to preserve this behavior; the cache path in additionalReferentialIntegrityChecks re-parses for error reporting only (using cache for accumulation), matching the pre-cache code path exactly. Also removes incidental \n whitespace from 5 lid_reference values in github15 test data that were never part of that test's intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fallout fix: carriage-return detection broken by Saxon DOM normalizationTwo CI test failures (#15-2, #401-1) were caused by a subtle interaction between the new caching code and Saxon's XML processing: Root cause: Effect on tests:
Fix: Re-parse with plain Backwards compatibility: The old code only reported Also removed incidental 🤖 Generated with Claude Code |
|
Testing using the new benchmark script: |
Summary
Resolves #1568
LabelCacheEntryPOJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed labelLabelValidationRule, cache identifiers (with\ndetection enabled) intoReferentialIntegrityUtil'slabelIdentifierCacheadditionalReferentialIntegrityChecks()now uses cachedlogicalIdentifiersandlidOrLidVidReferenceswhen available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation passCrossLabelFileAreaReferenceChecker.add()uses cachedlogicalIdentifierswhen available — no disk re-parse for file area reference checkscollectAllContextReferences()uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists\ndetection (INVALID_FIELD_VALUE) happens once duringcacheIdentifiers()(withreportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errorsCrossLabelFileAreaReferenceChecker.reset()to clear theisObservationalmap alongsideknownRefs, preventing static state from leaking across validation runsCrossLabelFileAreaReferenceChecker.reset()fromValidateLauncheralongsideReferentialIntegrityUtil.reset()Test plan
NASA-PDS/validate#15-2passes (\ninlogical_identifier— 1INVALID_FIELD_VALUEreported, no double-reporting)NASA-PDS/validate#401-1passes (\ninlid_reference— 3INVALID_FIELD_VALUEdetected from cache, no re-parse needed)🤖 Generated with Claude Code