Skip to content

Phase 33: ArticleProvenance + FeedbackLog write paths (dev → master)#613

Open
paulalbert1 wants to merge 6 commits intomasterfrom
development
Open

Phase 33: ArticleProvenance + FeedbackLog write paths (dev → master)#613
paulalbert1 wants to merge 6 commits intomasterfrom
development

Conversation

@paulalbert1
Copy link
Copy Markdown
Contributor

Summary

Bundles the three Phase 33 PRs that have been deployed and validated on dev EKS:

git cherry origin/master origin/development confirms these three commits are the only delta between master and development as of 2026-04-30. Master was last updated 2026-04-27 via PR #609; nothing else has accumulated since.

Net diff: 11 files changed, 1,153 insertions, 12 deletions (mostly tests).

What this enables on prod

  1. ArticleProvenance populated on every retrieval cycle — established users get fresh rs/frd/ads writes per strategy hit; previously the table was only populated by Phase 31's one-time Python backfill and would have aged out organically.

  2. FeedbackLog populated on every curator action — curator history mirrored from MySQL admin_feedback_log to the new DynamoDB table during the 4-week dual-write transition window per Phase 33 D-07. PM's existing direct-MySQL writes are unchanged; we add a side-effect write inside DynamoDbGoldStandardService.save().

  3. Algo-coverage signal queryable — composite src values (MAN_FROM_PM, MAN_FROM_CTSC) now populate forward-going so analysts can run:

    • WHERE src='MAN' → curator-only PMIDs (algo missed entirely)
    • WHERE src='MAN_FROM_PM' AND rs='PM_UI_SEARCH' → PM UI's PubMed search rediscovered
    • WHERE src='MAN_FROM_PM' AND contains(ads,'PM_UI_SEARCH') AND rs!='PM_UI_SEARCH' → sub-threshold rediscovery
    • WHERE src='MAN_FROM_PM' AND NOT contains(ads,'PM_UI_SEARCH') → regular candidate-list curation

Dev validation evidence

All recorded in ~/Dropbox/Projects/ReCiter Research/analysis/phase_33_local_smoke_test.md:

  • Local smoke test (pre-deploy): 4 calls exercising all D-11/D-13 paths against real DynamoDB; all expected writes verified; cleanup left 0 residual items
  • Dev EKS deploy validation (post-deploy):
    • kubectl rollout status clean for both image deploys (586.2026-04-30.16.02.55.2e11b461 and 587.2026-04-30.17.02.31.e78508fd)
    • scripts/spot_check_feedback_log.py --strict — 4/4 continuity uids pass schema invariants
    • 4 continuity uids exercised with full feature-generator/by/uid?refreshFlag=ALL_PUBLICATIONS runs (meb7002, sfs2002, rsb2005, ajg9004); all HTTP 200; AP writes verified per uid
    • 615 src-absent items discovered and backfilled out-of-band before fix Phase 33-01b: write src=PM on retrieval-found AP items via if_not_exists #612 landed; 0 residual src-absent rs-present items

Backwards compatibility

  • POST/PUT /reciter/goldstandard keeps the existing 3-arg shape; the new ?entryPath=CANDIDATE_LIST|PUBMED_SEARCH query param is optional and defaults to CANDIDATE_LIST. PM Next.js callers don't need to change to keep working.
  • IDynamoDbGoldStandardService.save() 3-arg overloads delegate to 4-arg overloads with EntryPath.CANDIDATE_LIST. Any external Java callers (none found in this repo) keep working.
  • PmidProvenance writes are intentionally preserved — both legacy and new tables get writes during the 4-week soak. Removal is a separate post-soak follow-up.
  • EntryPath enum is in this repo's reciter.feedback package (not in the reciter-article-model JAR), avoiding a cross-repo Maven release.

Test plan

  • mvn test — 39 new tests pass (11 FeedbackLogServiceImplTest + 28 ArticleProvenanceServiceImplTest); 2 pre-existing TargetAuthorSelectionTest NPEs unchanged (unrelated)
  • Local smoke against real DynamoDB — all D-11/D-13 paths verified
  • Dev EKS deploy verified at scale — 4 continuity uids, 5,507 AP items examined post-deploy, all schema invariants hold
  • Prod cutover via ReCiter-Production CodePipeline ManualApproval ← Mahender's action
  • 24h prod soak per Phase 33 D-19: compare FeedbackLog write rate to historical admin_feedback_log baseline (~50/day); track AP write volume from first prod inst-client run (1:30 PM UTC the day after deploy)

Follow-ups (not blocking this PR)

  • ReCiter-Publication-Manager Next.js to emit ?entryPath=PUBMED_SEARCH on PubMed-search-originated curator actions — small, separate PR
  • PmidProvenance Java cleanup — open after the 4-week dual-write soak passes
  • ads set strategy-name dedup — Phase 31's short codes (e.g., "FNI") and Phase 33-01's class-name codes (e.g., "FirstNameInitialRetrievalStrategy") coexist; functional no-op, optional cosmetic cleanup

Phase 33-01. Introduces ArticleProvenanceService for the new
ArticleProvenance DynamoDB table. Wires it into trackNewPmids() so every
PMID returned by every retrieval strategy gets a provenance upsert.

Existing trackNewPmids() only attributed strategy to NEW pmids (those
absent from existingPmids). Established users got zero PmidProvenance
writes on refresh because all their pmids were already known. After
Phase 31's Python backfill, ArticleProvenance is populated, but going
forward only brand-new users would update.

This change writes ArticleProvenance for EVERY PMID a strategy returns,
not just new ones. The if_not_exists semantics on rs/frd preserve
original retrieval attribution across re-runs; ads grows via DynamoDB
String Set ADD as new strategies re-find existing PMIDs.

src is NOT touched — curator/CTSC paths own it (Phase 33 D-04).

Service uses raw AmazonDynamoDB.updateItem (not DynamoDBMapper) because
the UpdateExpression combines if_not_exists with ADD on String Sets in
one round-trip; DynamoDBMapper does not support that shape.

Provenance write failures are logged but not thrown — retrieval must
not break if DDB is degraded.

Tests: 11 Mockito-backed JUnit cases covering:
- TableName + key shape
- if_not_exists(rs, ...) and if_not_exists(frd, ...) usage
- ADD ads :strategySet semantics
- src never touched
- Expression attribute value types (S/N/SS)
- AmazonDynamoDBException + RuntimeException both swallowed
- Null/empty uid and strategyCode short-circuit

Build: full mvn test passes the new 11 cases. Two pre-existing
failures in TargetAuthorSelectionTest (NullPointerException in test
setup, unrelated to this change) are unchanged.

Files:
  + src/main/java/reciter/service/ArticleProvenanceService.java
  + src/main/java/reciter/service/dynamo/ArticleProvenanceServiceImpl.java
  + src/test/java/reciter/service/dynamo/ArticleProvenanceServiceImplTest.java
  M src/main/java/reciter/xml/retriever/engine/AliasReCiterRetrievalEngine.java
… on curator actions

Phase 33-02. Wires curator actions (PM UI POST/PUT /reciter/goldstandard)
through to two new DynamoDB tables that the system can later mine for
algo-coverage signal and per-PMID curator history.

New components:

- reciter.feedback.EntryPath enum (CANDIDATE_LIST | PUBMED_SEARCH).
  PM UI emits this discriminator on the goldstandard endpoint via
  ?entryPath=... query param. Default CANDIDATE_LIST keeps existing
  callers backwards-compatible.

- FeedbackLogService + FeedbackLogServiceImpl: writes one item per
  curator action to DynamoDB FeedbackLog (Phase 32 schema). Uses raw
  AmazonDynamoDB.putItem with conditional attribute_not_exists(sk).
  sk = '<epoch>#<8hex>' where the hex is derived from System.nanoTime;
  collisions retry up to 5 times with fresh suffixes. src='MAN'
  (Phase 32 D-03). curatedBy=0 for now (the goldstandard endpoint
  doesn't carry a userID; future work).

- ArticleProvenanceService.upsertCuratorAction extends the existing
  service with the D-11 unified transition rule plus D-13's
  PM_UI_SEARCH retrieval-record write for the PUBMED_SEARCH path:
    existing src absent or 'GS'  -> 'MAN'
    existing src='PM'            -> 'MAN_FROM_PM'
    existing src='CTSC'          -> 'MAN_FROM_CTSC'
    existing src='MAN'/'MAN_FROM_*' -> noop (frd preserved)
  Implementation: GetItem (consistent read), compute new src,
  conditional UpdateItem with one retry on CCE; fire-and-forget.

Wiring:

- DynamoDbGoldStandardService.save() now has 4-arg overloads taking
  EntryPath; the 3-arg signatures default to CANDIDATE_LIST.
- A new private recordFeedbackLogAndArticleProvenance() helper
  computes 3 diff buckets per uid: newlyAccepted (incoming \
  existing accepted), newlyRejected, newlyPending (existing \
  incoming combined). Emits one FeedbackLog row + one D-11 transition
  per pmid. PUBMED_SEARCH adds a D-13 PM_UI_SEARCH retrieval write
  before the D-11 transition.
- Both single-GoldStandard save() and List<GoldStandard> save()
  call the helper inside the existing UPDATE branch where
  existingAccepted/Rejected are in scope.
- Controller accepts ?entryPath=CANDIDATE_LIST|PUBMED_SEARCH and
  passes through to the service.

Tests:
- 17 new ArticleProvenanceServiceImplTest cases (D-11 transition table
  via computeNewSrc; CANDIDATE_LIST / PUBMED_SEARCH path semantics; race
  retry; null/default handling).
- 11 new FeedbackLogServiceImplTest cases (schema, sk format, ACCEPTED/
  REJECTED/PENDING, sk collision retry, exception swallowing, null guards).
- mvn test full suite: 77 total, 39 new tests pass, 0 new failures.
  Pre-existing TargetAuthorSelectionTest (2 NPEs) remain unchanged.

Behavior:
- All curator actions through /reciter/goldstandard now produce
  FeedbackLog rows and ArticleProvenance D-11 transitions.
- The PM UI 'PubMed search' workflow flagged via entryPath=PUBMED_SEARCH
  is recorded with rs='PM_UI_SEARCH' added to ads, then upgraded to
  src=MAN_FROM_PM via D-11 (so the algo-coverage queries in
  33-CONTEXT.md work for PM-search-found articles).
- Provenance write failures never block PM UI requests.
Phase 33-01: write ArticleProvenance on every retrieval strategy hit
Phase 33-02: write FeedbackLog + apply ArticleProvenance D-11/D-13 on curator actions
Phase 33-01 originally wrote rs/frd/ads but not src, on the assumption
(D-04) that src is owned exclusively by curator/CTSC paths. Reality is
inconsistent with that intent: Phase 31's Python backfill wrote src=PM
for every retrieval-found item, so the existing table state has src=PM
on Phase-31-era items.

Without this fix, brand-new ArticleProvenance rows created by Phase
33-01 (when retrieval surfaces a PMID that wasn't in Phase 31's
backfill) get rs/frd/ads but src=absent. A subsequent curator action
would then transition src=null -> 'MAN' (algo-missed) instead of
src=null -> 'MAN_FROM_PM' (algo-found, then curator-touched). That
mis-attribution silently inflates the 'algo came up short' algo-coverage
cohort with PMIDs the algo actually surfaced.

Verified live on dev EKS post-deploy: ajg9004's feature-generator run
created 615 new AP rows, all with rs+frd+ads but src=absent. Backfilled
those 615 with src=PM via the same if_not_exists semantics this code
change applies going forward.

The if_not_exists guard preserves existing src values (CTSC, MAN,
MAN_FROM_PM, MAN_FROM_CTSC) so curator/CTSC ownership of src is
maintained — that's still D-04's spirit; we're just expanding the
write to cover the previously-missing 'first ever retrieval' case.

Tests: existing 'doesNotTouchSrc' assertion is replaced with
'preservesExistingSrc_onlySetsPmIfAbsent' which verifies the new
expression contains 'src = if_not_exists(src, :pm)' and that ':pm' is
'PM'. mvn test: 39/39 pass.

Net behavioral change:
- Brand-new AP rows: src='PM' (was: src absent)
- Existing src=PM rows: unchanged (if_not_exists no-op)
- Existing src=MAN/CTSC/MAN_FROM_*: unchanged (if_not_exists no-op)
- Pre-existing items with rs but no src in dev: backfilled out-of-band
  via one-off Python (see commit message of dev backfill)
Phase 33-01b: write src=PM on retrieval-found AP items via if_not_exists
@paulalbert1
Copy link
Copy Markdown
Contributor Author

Closing in favor of #614 — same diff, single commit instead of 3 commits + 3 merge commits. One thing to merge.

@paulalbert1 paulalbert1 reopened this Apr 30, 2026
@paulalbert1
Copy link
Copy Markdown
Contributor Author

Recommend using Squash and merge so master gets a single Phase 33 commit instead of 3 commits + 3 merge commits. The 3 underlying commits (#610 / #611 / #612) stay visible in development history; squash just consolidates the master entry.

Same diff as before — fully validated on dev EKS:

  • 11 files / 1,154 insertions / 12 deletions
  • 39 new Mockito tests pass
  • Local smoke + dev EKS deploy + 4-uid feature-generator validation
  • IAM verified (dynamodb:* on Resource:*); no policy change needed
  • See full evidence in analysis/phase_33_local_smoke_test.md (planning repo)

@paulalbert1 paulalbert1 requested a review from mrj4001 April 30, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant