Phase 33: ArticleProvenance + FeedbackLog write paths (dev → master)#613
Open
paulalbert1 wants to merge 6 commits intomasterfrom
Open
Phase 33: ArticleProvenance + FeedbackLog write paths (dev → master)#613paulalbert1 wants to merge 6 commits intomasterfrom
paulalbert1 wants to merge 6 commits intomasterfrom
Conversation
Phase 33-01. Introduces ArticleProvenanceService for the new ArticleProvenance DynamoDB table. Wires it into trackNewPmids() so every PMID returned by every retrieval strategy gets a provenance upsert. Existing trackNewPmids() only attributed strategy to NEW pmids (those absent from existingPmids). Established users got zero PmidProvenance writes on refresh because all their pmids were already known. After Phase 31's Python backfill, ArticleProvenance is populated, but going forward only brand-new users would update. This change writes ArticleProvenance for EVERY PMID a strategy returns, not just new ones. The if_not_exists semantics on rs/frd preserve original retrieval attribution across re-runs; ads grows via DynamoDB String Set ADD as new strategies re-find existing PMIDs. src is NOT touched — curator/CTSC paths own it (Phase 33 D-04). Service uses raw AmazonDynamoDB.updateItem (not DynamoDBMapper) because the UpdateExpression combines if_not_exists with ADD on String Sets in one round-trip; DynamoDBMapper does not support that shape. Provenance write failures are logged but not thrown — retrieval must not break if DDB is degraded. Tests: 11 Mockito-backed JUnit cases covering: - TableName + key shape - if_not_exists(rs, ...) and if_not_exists(frd, ...) usage - ADD ads :strategySet semantics - src never touched - Expression attribute value types (S/N/SS) - AmazonDynamoDBException + RuntimeException both swallowed - Null/empty uid and strategyCode short-circuit Build: full mvn test passes the new 11 cases. Two pre-existing failures in TargetAuthorSelectionTest (NullPointerException in test setup, unrelated to this change) are unchanged. Files: + src/main/java/reciter/service/ArticleProvenanceService.java + src/main/java/reciter/service/dynamo/ArticleProvenanceServiceImpl.java + src/test/java/reciter/service/dynamo/ArticleProvenanceServiceImplTest.java M src/main/java/reciter/xml/retriever/engine/AliasReCiterRetrievalEngine.java
… on curator actions
Phase 33-02. Wires curator actions (PM UI POST/PUT /reciter/goldstandard)
through to two new DynamoDB tables that the system can later mine for
algo-coverage signal and per-PMID curator history.
New components:
- reciter.feedback.EntryPath enum (CANDIDATE_LIST | PUBMED_SEARCH).
PM UI emits this discriminator on the goldstandard endpoint via
?entryPath=... query param. Default CANDIDATE_LIST keeps existing
callers backwards-compatible.
- FeedbackLogService + FeedbackLogServiceImpl: writes one item per
curator action to DynamoDB FeedbackLog (Phase 32 schema). Uses raw
AmazonDynamoDB.putItem with conditional attribute_not_exists(sk).
sk = '<epoch>#<8hex>' where the hex is derived from System.nanoTime;
collisions retry up to 5 times with fresh suffixes. src='MAN'
(Phase 32 D-03). curatedBy=0 for now (the goldstandard endpoint
doesn't carry a userID; future work).
- ArticleProvenanceService.upsertCuratorAction extends the existing
service with the D-11 unified transition rule plus D-13's
PM_UI_SEARCH retrieval-record write for the PUBMED_SEARCH path:
existing src absent or 'GS' -> 'MAN'
existing src='PM' -> 'MAN_FROM_PM'
existing src='CTSC' -> 'MAN_FROM_CTSC'
existing src='MAN'/'MAN_FROM_*' -> noop (frd preserved)
Implementation: GetItem (consistent read), compute new src,
conditional UpdateItem with one retry on CCE; fire-and-forget.
Wiring:
- DynamoDbGoldStandardService.save() now has 4-arg overloads taking
EntryPath; the 3-arg signatures default to CANDIDATE_LIST.
- A new private recordFeedbackLogAndArticleProvenance() helper
computes 3 diff buckets per uid: newlyAccepted (incoming \
existing accepted), newlyRejected, newlyPending (existing \
incoming combined). Emits one FeedbackLog row + one D-11 transition
per pmid. PUBMED_SEARCH adds a D-13 PM_UI_SEARCH retrieval write
before the D-11 transition.
- Both single-GoldStandard save() and List<GoldStandard> save()
call the helper inside the existing UPDATE branch where
existingAccepted/Rejected are in scope.
- Controller accepts ?entryPath=CANDIDATE_LIST|PUBMED_SEARCH and
passes through to the service.
Tests:
- 17 new ArticleProvenanceServiceImplTest cases (D-11 transition table
via computeNewSrc; CANDIDATE_LIST / PUBMED_SEARCH path semantics; race
retry; null/default handling).
- 11 new FeedbackLogServiceImplTest cases (schema, sk format, ACCEPTED/
REJECTED/PENDING, sk collision retry, exception swallowing, null guards).
- mvn test full suite: 77 total, 39 new tests pass, 0 new failures.
Pre-existing TargetAuthorSelectionTest (2 NPEs) remain unchanged.
Behavior:
- All curator actions through /reciter/goldstandard now produce
FeedbackLog rows and ArticleProvenance D-11 transitions.
- The PM UI 'PubMed search' workflow flagged via entryPath=PUBMED_SEARCH
is recorded with rs='PM_UI_SEARCH' added to ads, then upgraded to
src=MAN_FROM_PM via D-11 (so the algo-coverage queries in
33-CONTEXT.md work for PM-search-found articles).
- Provenance write failures never block PM UI requests.
Phase 33-01: write ArticleProvenance on every retrieval strategy hit
Phase 33-02: write FeedbackLog + apply ArticleProvenance D-11/D-13 on curator actions
Phase 33-01 originally wrote rs/frd/ads but not src, on the assumption (D-04) that src is owned exclusively by curator/CTSC paths. Reality is inconsistent with that intent: Phase 31's Python backfill wrote src=PM for every retrieval-found item, so the existing table state has src=PM on Phase-31-era items. Without this fix, brand-new ArticleProvenance rows created by Phase 33-01 (when retrieval surfaces a PMID that wasn't in Phase 31's backfill) get rs/frd/ads but src=absent. A subsequent curator action would then transition src=null -> 'MAN' (algo-missed) instead of src=null -> 'MAN_FROM_PM' (algo-found, then curator-touched). That mis-attribution silently inflates the 'algo came up short' algo-coverage cohort with PMIDs the algo actually surfaced. Verified live on dev EKS post-deploy: ajg9004's feature-generator run created 615 new AP rows, all with rs+frd+ads but src=absent. Backfilled those 615 with src=PM via the same if_not_exists semantics this code change applies going forward. The if_not_exists guard preserves existing src values (CTSC, MAN, MAN_FROM_PM, MAN_FROM_CTSC) so curator/CTSC ownership of src is maintained — that's still D-04's spirit; we're just expanding the write to cover the previously-missing 'first ever retrieval' case. Tests: existing 'doesNotTouchSrc' assertion is replaced with 'preservesExistingSrc_onlySetsPmIfAbsent' which verifies the new expression contains 'src = if_not_exists(src, :pm)' and that ':pm' is 'PM'. mvn test: 39/39 pass. Net behavioral change: - Brand-new AP rows: src='PM' (was: src absent) - Existing src=PM rows: unchanged (if_not_exists no-op) - Existing src=MAN/CTSC/MAN_FROM_*: unchanged (if_not_exists no-op) - Pre-existing items with rs but no src in dev: backfilled out-of-band via one-off Python (see commit message of dev backfill)
Phase 33-01b: write src=PM on retrieval-found AP items via if_not_exists
8 tasks
Contributor
Author
|
Closing in favor of #614 — same diff, single commit instead of 3 commits + 3 merge commits. One thing to merge. |
Contributor
Author
|
Recommend using Squash and merge so master gets a single Phase 33 commit instead of 3 commits + 3 merge commits. The 3 underlying commits (#610 / #611 / #612) stay visible in development history; squash just consolidates the master entry. Same diff as before — fully validated on dev EKS:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles the three Phase 33 PRs that have been deployed and validated on dev EKS:
git cherry origin/master origin/developmentconfirms these three commits are the only delta between master and development as of 2026-04-30. Master was last updated 2026-04-27 via PR #609; nothing else has accumulated since.Net diff: 11 files changed, 1,153 insertions, 12 deletions (mostly tests).
What this enables on prod
ArticleProvenancepopulated on every retrieval cycle — established users get freshrs/frd/adswrites per strategy hit; previously the table was only populated by Phase 31's one-time Python backfill and would have aged out organically.FeedbackLogpopulated on every curator action — curator history mirrored from MySQLadmin_feedback_logto the new DynamoDB table during the 4-week dual-write transition window per Phase 33 D-07. PM's existing direct-MySQL writes are unchanged; we add a side-effect write insideDynamoDbGoldStandardService.save().Algo-coverage signal queryable — composite
srcvalues (MAN_FROM_PM,MAN_FROM_CTSC) now populate forward-going so analysts can run:WHERE src='MAN'→ curator-only PMIDs (algo missed entirely)WHERE src='MAN_FROM_PM' AND rs='PM_UI_SEARCH'→ PM UI's PubMed search rediscoveredWHERE src='MAN_FROM_PM' AND contains(ads,'PM_UI_SEARCH') AND rs!='PM_UI_SEARCH'→ sub-threshold rediscoveryWHERE src='MAN_FROM_PM' AND NOT contains(ads,'PM_UI_SEARCH')→ regular candidate-list curationDev validation evidence
All recorded in
~/Dropbox/Projects/ReCiter Research/analysis/phase_33_local_smoke_test.md:kubectl rollout statusclean for both image deploys (586.2026-04-30.16.02.55.2e11b461and587.2026-04-30.17.02.31.e78508fd)scripts/spot_check_feedback_log.py --strict— 4/4 continuity uids pass schema invariantsfeature-generator/by/uid?refreshFlag=ALL_PUBLICATIONSruns (meb7002, sfs2002, rsb2005, ajg9004); all HTTP 200; AP writes verified per uidBackwards compatibility
POST/PUT /reciter/goldstandardkeeps the existing 3-arg shape; the new?entryPath=CANDIDATE_LIST|PUBMED_SEARCHquery param is optional and defaults toCANDIDATE_LIST. PM Next.js callers don't need to change to keep working.IDynamoDbGoldStandardService.save()3-arg overloads delegate to 4-arg overloads withEntryPath.CANDIDATE_LIST. Any external Java callers (none found in this repo) keep working.PmidProvenancewrites are intentionally preserved — both legacy and new tables get writes during the 4-week soak. Removal is a separate post-soak follow-up.EntryPathenum is in this repo'sreciter.feedbackpackage (not in thereciter-article-modelJAR), avoiding a cross-repo Maven release.Test plan
mvn test— 39 new tests pass (11 FeedbackLogServiceImplTest + 28 ArticleProvenanceServiceImplTest); 2 pre-existingTargetAuthorSelectionTestNPEs unchanged (unrelated)ReCiter-ProductionCodePipeline ManualApproval ← Mahender's actionadmin_feedback_logbaseline (~50/day); track AP write volume from first prod inst-client run (1:30 PM UTC the day after deploy)Follow-ups (not blocking this PR)
ReCiter-Publication-ManagerNext.js to emit?entryPath=PUBMED_SEARCHon PubMed-search-originated curator actions — small, separate PRPmidProvenanceJava cleanup — open after the 4-week dual-write soak passesadsset strategy-name dedup — Phase 31's short codes (e.g., "FNI") and Phase 33-01's class-name codes (e.g., "FirstNameInitialRetrievalStrategy") coexist; functional no-op, optional cosmetic cleanup