feat: Claude LLM fallback for PDF metadata extraction#1265
feat: Claude LLM fallback for PDF metadata extraction#1265
Conversation
promote main
Promote Main
promote main
promote main
promote main
promote main
promote main
promote main
promote main
promote main
Promote main
Promote main
Promote main
Promote main
Promote dev
Promote main
Sub model changes
Generic {"error":"failed"} made it impossible to diagnose delete failures.
Now returns distinct messages for: ownership check, manifest not found,
IPFS persist failure, and unhandled exceptions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: specific error messages for data delete endpoint
Data references with null paths caused "Cannot read properties of null (reading 'startsWith')" when filtering refs to delete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: handle null path in data references during delete
chore: merge develop to main (dpid metadata fix)
chore: merge develop to main (cover image fix)
Grobid frequently returns incomplete metadata (missing title, authors, or abstract). This adds Claude Sonnet as a 3-tier fallback: Grobid header → Grobid fulltext → Claude LLM. If Grobid is completely unavailable, falls back to LLM-only extraction. Uses tool_use with forced tool_choice for deterministic structured output. Also passes keywords through the DOI controller and fixes duplicate authors on re-extraction (Set vs Add). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdded Anthropic Claude LLM integration to the automated metadata extraction service. Introduced a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GrobidService
participant IPFSStorage
participant ClaudeLLM
participant MetadataStore
Client->>GrobidService: queryFromGrobid(cid)
GrobidService->>GrobidService: Extract metadata from Grobid
alt Grobid Success & Complete
GrobidService->>MetadataStore: Return metadata
else Grobid Success & Incomplete
GrobidService->>ClaudeLLM: queryFromLLM (missing fields)
ClaudeLLM->>ClaudeLLM: Extract via tool_use
ClaudeLLM->>GrobidService: Return LLM metadata
GrobidService->>GrobidService: Merge results
GrobidService->>MetadataStore: Return merged metadata
else Grobid Connection Fails
GrobidService->>IPFSStorage: Fetch PDF by cid
IPFSStorage->>GrobidService: Return PDF buffer
GrobidService->>ClaudeLLM: queryFromLLM(buffer)
ClaudeLLM->>ClaudeLLM: Extract via tool_use
ClaudeLLM->>GrobidService: Return LLM metadata
GrobidService->>MetadataStore: Return LLM results
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
desci-server/src/services/AutomatedMetadata.ts (2)
302-316: Condition logic could be clearer but is functionally safe.The condition
error?.response?.status || error?.code === 'ECONNREFUSED'works because this is in a catch block (axios throws on non-2xx), but the intent would be clearer with explicit error checking.♻️ Optional: More explicit error condition
- if (error?.response?.status || error?.code === 'ECONNREFUSED') { + // Grobid service error (HTTP error or connection refused) + if (error?.code === 'ECONNREFUSED' || error?.response?.status >= 400) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@desci-server/src/services/AutomatedMetadata.ts` around lines 302 - 316, In the Grobid catch block, make the fallback condition explicit so intent is clear: replace the loose check `error?.response?.status || error?.code === 'ECONNREFUSED'` with an explicit test such as checking for an Axios error or defined response status (e.g. `axios.isAxiosError(error) && typeof error.response?.status !== 'undefined'`) or `error?.code === 'ECONNREFUSED'`; keep the rest of the flow that builds `pdfUrl` (using IPFS_RESOLVER and cid), fetches into a Buffer, calls `this.queryFromLLM(buffer)`, and returns `DEFAULT_GROBID_METADATA` only when LLM fallback fails. Ensure you import/use `axios.isAxiosError` if you choose that form.
336-360: Consider adding a timeout for the Anthropic API call.The
client.messages.createcall has no timeout configured. For large PDFs or network issues, this could block indefinitely. Add a timeout to prevent the request from hanging:const response = await client.messages.create({ model: 'claude-sonnet-4-20250514', max_tokens: 4096, tools: [EXTRACT_METADATA_TOOL], tool_choice: { type: 'tool', name: 'extract_paper_metadata' }, messages: [ { role: 'user', content: [ { type: 'document', source: { type: 'base64', media_type: 'application/pdf', data: pdfBase64, }, }, { type: 'text', text: 'Extract the metadata from this academic paper. Be precise with author names and affiliations. If keywords are listed explicitly, use those; otherwise infer 3-5 key terms from the abstract and content.', }, ], }, ], - }); + }, { timeout: 120000 }); // 2 minute timeout for large PDFs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@desci-server/src/services/AutomatedMetadata.ts` around lines 336 - 360, The Anthropic API call via client.messages.create in AutomatedMetadata.ts has no timeout and can hang on large PDFs or network issues; add a timeout using an AbortController (or the client's built-in timeout option if available) when calling client.messages.create (the block that constructs the model 'claude-sonnet-4-20250514' request), set a reasonable timeout (e.g., 30–120s), ensure you abort the controller on completion/cleanup to avoid leaks, and surface a clear error when the request is aborted so callers of the function can handle timeouts appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@desci-server/src/services/AutomatedMetadata.ts`:
- Around line 362-366: The logger call in AutomatedMetadata.ts uses incorrect
syntax; change logger.error('No tool_use block in LLM response') to the pino
convention by passing structured context as the first argument and the message
as the second (for example include response or toolBlock details), e.g., call
logger.error({ response, toolBlock }, 'No tool_use block in LLM response');
update the error branch around the toolBlock check to use this form so pino
receives the object payload first.
- Around line 326-330: The logger.warn call in AutomatedMetadata.ts uses the old
string-only syntax; update the ANTHROPIC_API_KEY check to follow pino's (params,
message) convention by passing a structured param object (e.g., indicating the
missing env key or context) as the first argument and the human-readable message
as the second; locate the apiKey variable and the logger.warn call near the
ANTHROPIC_API_KEY check and replace the single-string logger.warn with the
two-argument pino-style call for consistent logging.
---
Nitpick comments:
In `@desci-server/src/services/AutomatedMetadata.ts`:
- Around line 302-316: In the Grobid catch block, make the fallback condition
explicit so intent is clear: replace the loose check `error?.response?.status ||
error?.code === 'ECONNREFUSED'` with an explicit test such as checking for an
Axios error or defined response status (e.g. `axios.isAxiosError(error) &&
typeof error.response?.status !== 'undefined'`) or `error?.code ===
'ECONNREFUSED'`; keep the rest of the flow that builds `pdfUrl` (using
IPFS_RESOLVER and cid), fetches into a Buffer, calls
`this.queryFromLLM(buffer)`, and returns `DEFAULT_GROBID_METADATA` only when LLM
fallback fails. Ensure you import/use `axios.isAxiosError` if you choose that
form.
- Around line 336-360: The Anthropic API call via client.messages.create in
AutomatedMetadata.ts has no timeout and can hang on large PDFs or network
issues; add a timeout using an AbortController (or the client's built-in timeout
option if available) when calling client.messages.create (the block that
constructs the model 'claude-sonnet-4-20250514' request), set a reasonable
timeout (e.g., 30–120s), ensure you abort the controller on completion/cleanup
to avoid leaks, and surface a clear error when the request is aborted so callers
of the function can handle timeouts appropriately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3a61245b-63d0-49cb-b85b-becc8b2dd53a
⛔ Files ignored due to path filters (2)
desci-server/package-lock.jsonis excluded by!**/package-lock.jsondesci-server/yarn.lockis excluded by!**/yarn.lock,!**/*.lock
📒 Files selected for processing (3)
desci-server/package.jsondesci-server/src/controllers/nodes/doi.tsdesci-server/src/services/AutomatedMetadata.ts
| const apiKey = process.env.ANTHROPIC_API_KEY; | ||
| if (!apiKey) { | ||
| logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction'); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Logger syntax should follow pino convention.
The logger calls should use the (params, message) format per coding guidelines.
As per coding guidelines: **/*.{js,ts,jsx,tsx}: Use correct pino logger syntax, e.g. logger.error(params, message)
🔧 Proposed fix
if (!apiKey) {
- logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
+ logger.warn({}, 'ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
return null;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const apiKey = process.env.ANTHROPIC_API_KEY; | |
| if (!apiKey) { | |
| logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction'); | |
| return null; | |
| } | |
| const apiKey = process.env.ANTHROPIC_API_KEY; | |
| if (!apiKey) { | |
| logger.warn({}, 'ANTHROPIC_API_KEY not set, skipping LLM metadata extraction'); | |
| return null; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@desci-server/src/services/AutomatedMetadata.ts` around lines 326 - 330, The
logger.warn call in AutomatedMetadata.ts uses the old string-only syntax; update
the ANTHROPIC_API_KEY check to follow pino's (params, message) convention by
passing a structured param object (e.g., indicating the missing env key or
context) as the first argument and the human-readable message as the second;
locate the apiKey variable and the logger.warn call near the ANTHROPIC_API_KEY
check and replace the single-string logger.warn with the two-argument pino-style
call for consistent logging.
| const toolBlock = response.content.find((block) => block.type === 'tool_use'); | ||
| if (!toolBlock || toolBlock.type !== 'tool_use') { | ||
| logger.error('No tool_use block in LLM response'); | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Logger syntax should follow pino convention.
As per coding guidelines: **/*.{js,ts,jsx,tsx}: Use correct pino logger syntax, e.g. logger.error(params, message)
🔧 Proposed fix
const toolBlock = response.content.find((block) => block.type === 'tool_use');
if (!toolBlock || toolBlock.type !== 'tool_use') {
- logger.error('No tool_use block in LLM response');
+ logger.error({ response: response.content }, 'No tool_use block in LLM response');
return null;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const toolBlock = response.content.find((block) => block.type === 'tool_use'); | |
| if (!toolBlock || toolBlock.type !== 'tool_use') { | |
| logger.error('No tool_use block in LLM response'); | |
| return null; | |
| } | |
| const toolBlock = response.content.find((block) => block.type === 'tool_use'); | |
| if (!toolBlock || toolBlock.type !== 'tool_use') { | |
| logger.error({ response: response.content }, 'No tool_use block in LLM response'); | |
| return null; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@desci-server/src/services/AutomatedMetadata.ts` around lines 362 - 366, The
logger call in AutomatedMetadata.ts uses incorrect syntax; change
logger.error('No tool_use block in LLM response') to the pino convention by
passing structured context as the first argument and the message as the second
(for example include response or toolBlock details), e.g., call logger.error({
response, toolBlock }, 'No tool_use block in LLM response'); update the error
branch around the toolBlock check to use this form so pino receives the object
payload first.
Summary
tool_usewith forcedtool_choicefor deterministic structured output (title, abstract, authors, DOI, keywords)MetadataResponseSet Contributorsinstead ofAdd Contributors)Setup
ANTHROPIC_API_KEYenv var (gracefully no-ops if missing)@anthropic-ai/sdk(added to package.json)Test plan
ANTHROPIC_API_KEYunset → verify graceful fallback to Grobid-only🤖 Generated with Claude Code
Summary by CodeRabbit