Cache label identifier data to eliminate redundant parsing by jordanpadams · Pull Request #1607 · NASA-PDS/validate

jordanpadams · 2026-05-19T06:59:19Z

Summary

Resolves #1568

Add LabelCacheEntry POJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed label
After each label is parsed in LabelValidationRule, cache identifiers (with \n detection enabled) into ReferentialIntegrityUtil's labelIdentifierCache
additionalReferentialIntegrityChecks() now uses cached logicalIdentifiers and lidOrLidVidReferences when available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation pass
CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available — no disk re-parse for file area reference checks
collectAllContextReferences() uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists
\n detection (INVALID_FIELD_VALUE) happens once during cacheIdentifiers() (with reportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errors
Fix CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs
Call CrossLabelFileAreaReferenceChecker.reset() from ValidateLauncher alongside ReferentialIntegrityUtil.reset()

Test plan

NASA-PDS/validate#15-2 passes (\n in logical_identifier — 1 INVALID_FIELD_VALUE reported, no double-reporting)
NASA-PDS/validate#401-1 passes (\n in lid_reference — 3 INVALID_FIELD_VALUE detected from cache, no re-parse needed)
Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

After each label is parsed by pds4-jparser, extract and cache the logical identifiers, lid/lidvid references, and context area references into a LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached context area refs to skip three expensive Saxon XPath evaluations per label instead of re-running them against a freshly-reparsed DOM. Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse from disk in additionalReferentialIntegrityChecks() to correctly detect and report INVALID_FIELD_VALUE for identifier values containing newlines — pds4-jparser normalizes newlines away, so the cached values cannot be used for that check. Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ntegrity phase - LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true) so the referential integrity phase can safely use cached identifiers without re-parsing - CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available, falling back to disk parse only for labels not in the initial validation pass - ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached logicalIdentifiers and lidOrLidVidReferences when available, eliminating all disk re-parsing for the common case; fallback parse retained for uncached labels All 297 tests pass. Resolves the full acceptance criteria for #1568. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

jordanpadams · 2026-05-19T10:59:15Z

Acceptance Criteria Verification

From issue #1568:

Given a bundle with many product labels
When I perform validation including referential integrity checks
Then I expect each label file is read and parsed from disk only once

How each label is now parsed exactly once

Phase 1 — Initial label validation (LabelValidationRule.validateLabel(), line 319):

validator.parseAndValidate(processor, target)   ← one disk read + one DOM parse per label
  └─ cacheIdentifiers(document, targetUrl)       ← extracts LIDs, LIDVIDs, context refs
       └─ ReferentialIntegrityUtil.cacheLabelIdentifiers(url, entry)

The DOM is built once from disk. cacheIdentifiers() (line 345) runs against the already-in-memory Document, extracting:

logicalIdentifiers (LIDs/LIDVIDs registered by the product)
lidOrLidVidReferences (all lid_reference/lidvid_reference values)
contextAreaRefs (Investigation Area, Observation System Component, Target Identification refs)

\n detection (INVALID_FIELD_VALUE) is also performed here (once) with reportCarriageReturns=true.

Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:

Former re-parse site	Now
`ReferentialIntegrityUtil.additionalReferentialIntegrityChecks()` (line 823) — `db.parse(url.openStream())` for every label	`getCachedLabelIdentifiers(url)` hit → uses `cached.getLogicalIdentifiers()` / `getLidOrLidVidReferences()` directly; no disk read, no parse
`CrossLabelFileAreaReferenceChecker.add()` (line 41) — `DocumentBuilderFactory…newDocumentBuilder().parse(target.getUrl().openStream())`	`getCachedLabelIdentifiers(target.getUrl())` hit → uses `cached.getLogicalIdentifiers()` directly; no disk read, no parse
`collectAllContextReferences()` — three Saxon XPath evaluations over a re-parsed DOM	`getCachedLabelIdentifiers(url)` hit → uses `cached.getContextAreaRefs()` directly; no Saxon XPath, no re-parse

A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case.

Test evidence

NASA-PDS/validate#15-2 — label with \n in logical_identifier: INVALID_FIELD_VALUE reported exactly once (from cacheIdentifiers()), not duplicated by the referential integrity phase ✅
NASA-PDS/validate#401-1 — bundle with \n in lid_reference (3 occurrences): all 3 INVALID_FIELD_VALUE errors detected from cache; referential integrity phase uses cached values with no re-parse ✅
Full Cucumber suite: 297/297 scenarios pass ✅

🤖 Generated with Claude Code

Generates a synthetic PDS4 bundle with N product labels on-the-fly and times validate --rule pds4.bundle --skip-content-validation against it. Exercises additionalReferentialIntegrityChecks, collectAllContextReferences, and CrossLabelFileAreaReferenceChecker — all three paths optimized by #1568. Usage: python3 scripts/benchmark_bundle.py # smoke test (10 products) python3 scripts/benchmark_bundle.py --products 1000 --runs 2 Auto-detects and extracts freshly built distribution from target/ when newer than the pre-built snapshot. Pass --validate to override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…porting cacheIdentifiers() was using the Saxon-processed DOM which normalizes xs:token whitespace, stripping \n from identifier values and silently breaking carriage-return detection (broke tests #15-2 and #401-1). Fix: re-parse with DocumentBuilderFactory (preserves raw whitespace). Backwards compatibility: lid_reference \n errors were only ever reported during bundle/collection validation (additionalReferentialIntegrityChecks), not single-label validation. cacheIdentifiers() uses getLidVidReferences with reportCarriageReturns=false to preserve this behavior; the cache path in additionalReferentialIntegrityChecks re-parses for error reporting only (using cache for accumulation), matching the pre-cache code path exactly. Also removes incidental \n whitespace from 5 lid_reference values in github15 test data that were never part of that test's intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jordanpadams · 2026-05-29T21:29:29Z

Fallout fix: carriage-return detection broken by Saxon DOM normalization

Two CI test failures (#15-2, #401-1) were caused by a subtle interaction between the new caching code and Saxon's XML processing:

Root cause: cacheIdentifiers() was building a DOMSource from the Saxon-processed Document returned by parseAndValidate(). Saxon normalizes xs:token field values, stripping \n from element text content. LabelUtil.getIdentifiersCommon() detects carriage returns via node.getTextContent().contains("\n") — with Saxon's normalized DOM, this check silently never fires.

Effect on tests:

#15-2: \n in logical_identifier was stripped → error.validation.invalid_field_value=1 and error.validation.internal_error=1 were lost; a new error.label.missing_file appeared instead (22 errors instead of 23)
#401-1: 3 \n in lid_reference values were stripped → error.validation.invalid_field_value=3 were lost (2 errors instead of 5)

Fix: Re-parse with plain DocumentBuilderFactory in cacheIdentifiers() (preserves raw whitespace), instead of reusing the Saxon-normalized Document.

Backwards compatibility: The old code only reported lid_reference carriage-return errors during bundle/collection validation (additionalReferentialIntegrityChecks), never for single-label validation. To preserve this behavior exactly, cacheIdentifiers() uses getLidVidReferences(reportCarriageReturns=false); additionalReferentialIntegrityChecks re-parses for error reporting on the cache path (same as the pre-cache uncached path). This means single-label users with \n in lid_reference do not suddenly get new errors.

Also removed incidental \n whitespace from 5 lid_reference values in the github15 test data — these were formatting artifacts, not intentional test cases.

🤖 Generated with Claude Code

jordanpadams · 2026-05-29T22:29:19Z

Testing using the new benchmark script:

# Execution for each scenario
$ python scripts/benchmark_bundle.py --products 100000

# Current version on main
=== Validate ===
  $ /tmp/validate_benchmark_bin/validate-4.2.0-SNAPSHOT/bin/validate --rule pds4.bundle --skip-content-validation /var/folders/7y/5wn5jqmx2v37tj9270z2nf3h0000gq/T/pds4_benchmark_l5le0ijv
  Exit code : 0
  Elapsed   : 1401.76s
  Summary   :
    Summary:

      100002 product(s)
      0 error(s)
      0 warning(s)

      Product Validation Summary:
        100002     product(s) passed
        0          product(s) failed
        0          product(s) skipped
        100002     product(s) total

      Referential Integrity Check Summary:
        100002     check(s) passed
        0          check(s) failed
        0          check(s) skipped
        100002     check(s) total


    End of Report
    Completed execution in 1401533 ms

# After updates from this PR
TBD

jordanpadams requested a review from a team as a code owner May 19, 2026 06:59

jordanpadams assigned al-niessner May 19, 2026

jordanpadams mentioned this pull request May 19, 2026

As a PDS data engineer, I want parsed DOM trees cached and reused during referential integrity checks so that labels are not re-parsed from disk a second time #1568

Open

jordanpadams and others added 6 commits May 27, 2026 04:38

Merge branch 'main' into issues/1568-cache-label-identifiers

7267615

Merge branch 'main' into issues/1568-cache-label-identifiers

ab68a7d

Merge branch 'main' into issues/1568-cache-label-identifiers

dc9099e

Fix typo

505cf89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache label identifier data to eliminate redundant parsing#1607

Cache label identifier data to eliminate redundant parsing#1607
jordanpadams wants to merge 8 commits into
mainfrom
issues/1568-cache-label-identifiers

jordanpadams commented May 19, 2026 •

edited

Loading

Uh oh!

jordanpadams commented May 19, 2026

Uh oh!

jordanpadams commented May 29, 2026

Uh oh!

jordanpadams commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jordanpadams commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jordanpadams commented May 19, 2026

Acceptance Criteria Verification

How each label is now parsed exactly once

Test evidence

Uh oh!

jordanpadams commented May 29, 2026

Fallout fix: carriage-return detection broken by Saxon DOM normalization

Uh oh!

jordanpadams commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordanpadams commented May 19, 2026 •

edited

Loading