Skip to content

Cache label identifier data to eliminate redundant parsing#1607

Open
jordanpadams wants to merge 8 commits into
mainfrom
issues/1568-cache-label-identifiers
Open

Cache label identifier data to eliminate redundant parsing#1607
jordanpadams wants to merge 8 commits into
mainfrom
issues/1568-cache-label-identifiers

Conversation

@jordanpadams
Copy link
Copy Markdown
Member

@jordanpadams jordanpadams commented May 19, 2026

Summary

Resolves #1568

  • Add LabelCacheEntry POJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed label
  • After each label is parsed in LabelValidationRule, cache identifiers (with \n detection enabled) into ReferentialIntegrityUtil's labelIdentifierCache
  • additionalReferentialIntegrityChecks() now uses cached logicalIdentifiers and lidOrLidVidReferences when available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation pass
  • CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available — no disk re-parse for file area reference checks
  • collectAllContextReferences() uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists
  • \n detection (INVALID_FIELD_VALUE) happens once during cacheIdentifiers() (with reportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errors
  • Fix CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs
  • Call CrossLabelFileAreaReferenceChecker.reset() from ValidateLauncher alongside ReferentialIntegrityUtil.reset()

Test plan

  • NASA-PDS/validate#15-2 passes (\n in logical_identifier — 1 INVALID_FIELD_VALUE reported, no double-reporting)
  • NASA-PDS/validate#401-1 passes (\n in lid_reference — 3 INVALID_FIELD_VALUE detected from cache, no re-parse needed)
  • Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

After each label is parsed by pds4-jparser, extract and cache the logical
identifiers, lid/lidvid references, and context area references into a
LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached
context area refs to skip three expensive Saxon XPath evaluations per label
instead of re-running them against a freshly-reparsed DOM.

Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse
from disk in additionalReferentialIntegrityChecks() to correctly detect and
report INVALID_FIELD_VALUE for identifier values containing newlines —
pds4-jparser normalizes newlines away, so the cached values cannot be used
for that check.

Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the
isObservational map alongside knownRefs, preventing static state from leaking
across validation runs.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ntegrity phase

- LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true)
  so the referential integrity phase can safely use cached identifiers without re-parsing
- CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available,
  falling back to disk parse only for labels not in the initial validation pass
- ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached
  logicalIdentifiers and lidOrLidVidReferences when available, eliminating all
  disk re-parsing for the common case; fallback parse retained for uncached labels

All 297 tests pass. Resolves the full acceptance criteria for #1568.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@jordanpadams
Copy link
Copy Markdown
Member Author

Acceptance Criteria Verification

From issue #1568:

Given a bundle with many product labels
When I perform validation including referential integrity checks
Then I expect each label file is read and parsed from disk only once


How each label is now parsed exactly once

Phase 1 — Initial label validation (LabelValidationRule.validateLabel(), line 319):

validator.parseAndValidate(processor, target)   ← one disk read + one DOM parse per label
  └─ cacheIdentifiers(document, targetUrl)       ← extracts LIDs, LIDVIDs, context refs
       └─ ReferentialIntegrityUtil.cacheLabelIdentifiers(url, entry)

The DOM is built once from disk. cacheIdentifiers() (line 345) runs against the already-in-memory Document, extracting:

  • logicalIdentifiers (LIDs/LIDVIDs registered by the product)
  • lidOrLidVidReferences (all lid_reference/lidvid_reference values)
  • contextAreaRefs (Investigation Area, Observation System Component, Target Identification refs)

\n detection (INVALID_FIELD_VALUE) is also performed here (once) with reportCarriageReturns=true.


Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:

Former re-parse site Now
ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() (line 823) — db.parse(url.openStream()) for every label getCachedLabelIdentifiers(url) hit → uses cached.getLogicalIdentifiers() / getLidOrLidVidReferences() directly; no disk read, no parse
CrossLabelFileAreaReferenceChecker.add() (line 41) — DocumentBuilderFactory…newDocumentBuilder().parse(target.getUrl().openStream()) getCachedLabelIdentifiers(target.getUrl()) hit → uses cached.getLogicalIdentifiers() directly; no disk read, no parse
collectAllContextReferences() — three Saxon XPath evaluations over a re-parsed DOM getCachedLabelIdentifiers(url) hit → uses cached.getContextAreaRefs() directly; no Saxon XPath, no re-parse

A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case.


Test evidence

  • NASA-PDS/validate#15-2 — label with \n in logical_identifier: INVALID_FIELD_VALUE reported exactly once (from cacheIdentifiers()), not duplicated by the referential integrity phase ✅
  • NASA-PDS/validate#401-1 — bundle with \n in lid_reference (3 occurrences): all 3 INVALID_FIELD_VALUE errors detected from cache; referential integrity phase uses cached values with no re-parse ✅
  • Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

jordanpadams and others added 6 commits May 27, 2026 04:38
Generates a synthetic PDS4 bundle with N product labels on-the-fly and
times validate --rule pds4.bundle --skip-content-validation against it.
Exercises additionalReferentialIntegrityChecks, collectAllContextReferences,
and CrossLabelFileAreaReferenceChecker — all three paths optimized by #1568.

Usage:
  python3 scripts/benchmark_bundle.py              # smoke test (10 products)
  python3 scripts/benchmark_bundle.py --products 1000 --runs 2

Auto-detects and extracts freshly built distribution from target/ when newer
than the pre-built snapshot. Pass --validate to override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…porting

cacheIdentifiers() was using the Saxon-processed DOM which normalizes
xs:token whitespace, stripping \n from identifier values and silently
breaking carriage-return detection (broke tests #15-2 and #401-1).

Fix: re-parse with DocumentBuilderFactory (preserves raw whitespace).

Backwards compatibility: lid_reference \n errors were only ever reported
during bundle/collection validation (additionalReferentialIntegrityChecks),
not single-label validation. cacheIdentifiers() uses getLidVidReferences
with reportCarriageReturns=false to preserve this behavior; the cache path
in additionalReferentialIntegrityChecks re-parses for error reporting only
(using cache for accumulation), matching the pre-cache code path exactly.

Also removes incidental \n whitespace from 5 lid_reference values in
github15 test data that were never part of that test's intent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordanpadams
Copy link
Copy Markdown
Member Author

Fallout fix: carriage-return detection broken by Saxon DOM normalization

Two CI test failures (#15-2, #401-1) were caused by a subtle interaction between the new caching code and Saxon's XML processing:

Root cause: cacheIdentifiers() was building a DOMSource from the Saxon-processed Document returned by parseAndValidate(). Saxon normalizes xs:token field values, stripping \n from element text content. LabelUtil.getIdentifiersCommon() detects carriage returns via node.getTextContent().contains("\n") — with Saxon's normalized DOM, this check silently never fires.

Effect on tests:

  • #15-2: \n in logical_identifier was stripped → error.validation.invalid_field_value=1 and error.validation.internal_error=1 were lost; a new error.label.missing_file appeared instead (22 errors instead of 23)
  • #401-1: 3 \n in lid_reference values were stripped → error.validation.invalid_field_value=3 were lost (2 errors instead of 5)

Fix: Re-parse with plain DocumentBuilderFactory in cacheIdentifiers() (preserves raw whitespace), instead of reusing the Saxon-normalized Document.

Backwards compatibility: The old code only reported lid_reference carriage-return errors during bundle/collection validation (additionalReferentialIntegrityChecks), never for single-label validation. To preserve this behavior exactly, cacheIdentifiers() uses getLidVidReferences(reportCarriageReturns=false); additionalReferentialIntegrityChecks re-parses for error reporting on the cache path (same as the pre-cache uncached path). This means single-label users with \n in lid_reference do not suddenly get new errors.

Also removed incidental \n whitespace from 5 lid_reference values in the github15 test data — these were formatting artifacts, not intentional test cases.

🤖 Generated with Claude Code

@jordanpadams
Copy link
Copy Markdown
Member Author

Testing using the new benchmark script:

# Execution for each scenario
$ python scripts/benchmark_bundle.py --products 100000

# Current version on main
=== Validate ===
  $ /tmp/validate_benchmark_bin/validate-4.2.0-SNAPSHOT/bin/validate --rule pds4.bundle --skip-content-validation /var/folders/7y/5wn5jqmx2v37tj9270z2nf3h0000gq/T/pds4_benchmark_l5le0ijv
  Exit code : 0
  Elapsed   : 1401.76s
  Summary   :
    Summary:

      100002 product(s)
      0 error(s)
      0 warning(s)

      Product Validation Summary:
        100002     product(s) passed
        0          product(s) failed
        0          product(s) skipped
        100002     product(s) total

      Referential Integrity Check Summary:
        100002     check(s) passed
        0          check(s) failed
        0          check(s) skipped
        100002     check(s) total


    End of Report
    Completed execution in 1401533 ms

# After updates from this PR
TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants