Skip to content

feat: Claude LLM fallback for PDF metadata extraction#1265

Open
hubsmoke wants to merge 45 commits intodevelopfrom
feat/llm-metadata-extraction
Open

feat: Claude LLM fallback for PDF metadata extraction#1265
hubsmoke wants to merge 45 commits intodevelopfrom
feat/llm-metadata-extraction

Conversation

@hubsmoke
Copy link
Copy Markdown
Member

@hubsmoke hubsmoke commented Apr 3, 2026

Summary

  • Adds Claude Sonnet as a 3-tier fallback when Grobid returns incomplete metadata: Grobid header → Grobid fulltext → Claude LLM
  • If Grobid is completely unavailable (ECONNREFUSED/HTTP error), falls back to LLM-only extraction
  • Uses tool_use with forced tool_choice for deterministic structured output (title, abstract, authors, DOI, keywords)
  • Passes keywords from LLM/Grobid through the DOI controller to MetadataResponse
  • Fixes duplicate authors on manuscript re-extraction (Set Contributors instead of Add Contributors)

Setup

  • Requires ANTHROPIC_API_KEY env var (gracefully no-ops if missing)
  • Requires @anthropic-ai/sdk (added to package.json)

Test plan

  • Upload a PDF that Grobid handles well → verify existing behavior unchanged
  • Upload a non-standard PDF (scanned, unusual layout) that Grobid returns empty metadata for → verify LLM fallback kicks in and extracts title/authors/abstract
  • Verify keywords appear in metadata response when extracted by LLM
  • Replace manuscript on existing node → verify authors are replaced (not duplicated)
  • Test with ANTHROPIC_API_KEY unset → verify graceful fallback to Grobid-only

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Improved manuscript metadata extraction with intelligent fallback mechanism for enhanced accuracy
    • Added automatic keyword identification from manuscripts to improve discoverability and categorization

hubsmoke and others added 30 commits July 8, 2025 01:22
kadamidev and others added 15 commits January 29, 2026 15:51
Generic {"error":"failed"} made it impossible to diagnose delete failures.
Now returns distinct messages for: ownership check, manifest not found,
IPFS persist failure, and unhandled exceptions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: specific error messages for data delete endpoint
Data references with null paths caused "Cannot read properties of null
(reading 'startsWith')" when filtering refs to delete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: handle null path in data references during delete
chore: merge develop to main (dpid metadata fix)
chore: merge develop to main (cover image fix)
Grobid frequently returns incomplete metadata (missing title, authors, or
abstract). This adds Claude Sonnet as a 3-tier fallback: Grobid header →
Grobid fulltext → Claude LLM. If Grobid is completely unavailable, falls
back to LLM-only extraction. Uses tool_use with forced tool_choice for
deterministic structured output. Also passes keywords through the DOI
controller and fixes duplicate authors on re-extraction (Set vs Add).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Added Anthropic Claude LLM integration to the automated metadata extraction service. Introduced a new queryFromLLM method for PDF-based metadata extraction and extended queryFromGrobid with LLM fallback capabilities when Grobid extraction is incomplete or fails. Updated DOI controller to propagate keywords through the metadata chain.

Changes

Cohort / File(s) Summary
Dependency Addition
desci-server/package.json
Added @anthropic-ai/sdk ^0.82.0 to enable Claude LLM integration.
Metadata Service Core
desci-server/src/services/AutomatedMetadata.ts
Integrated Claude LLM with queryFromLLM() method for PDF metadata extraction using tool_use. Extended queryFromGrobid() with LLM fallback when Grobid results lack required fields or connection fails; fetches PDF from IPFS as needed. Exported GrobidMetadata type with new optional keywords field. Changed action type from 'Add Contributors' to 'Set Contributors'.
DOI Controller
desci-server/src/controllers/nodes/doi.ts
Updated grobidMetadata structure to include optional keywords field. Modified metadata construction logic to propagate keywords from OpenAlex data with fallback to Grobid keywords.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GrobidService
    participant IPFSStorage
    participant ClaudeLLM
    participant MetadataStore

    Client->>GrobidService: queryFromGrobid(cid)
    GrobidService->>GrobidService: Extract metadata from Grobid
    
    alt Grobid Success & Complete
        GrobidService->>MetadataStore: Return metadata
    else Grobid Success & Incomplete
        GrobidService->>ClaudeLLM: queryFromLLM (missing fields)
        ClaudeLLM->>ClaudeLLM: Extract via tool_use
        ClaudeLLM->>GrobidService: Return LLM metadata
        GrobidService->>GrobidService: Merge results
        GrobidService->>MetadataStore: Return merged metadata
    else Grobid Connection Fails
        GrobidService->>IPFSStorage: Fetch PDF by cid
        IPFSStorage->>GrobidService: Return PDF buffer
        GrobidService->>ClaudeLLM: queryFromLLM(buffer)
        ClaudeLLM->>ClaudeLLM: Extract via tool_use
        ClaudeLLM->>GrobidService: Return LLM metadata
        GrobidService->>MetadataStore: Return LLM results
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A bunny hops through metadata streams,
Where Claude now helps extract the dreams,
When Grobid falters, LLM steps in bright,
IPFS and keywords dance through the night,
Smart fallbacks ensure the data's right! 📄✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding Claude LLM as a fallback mechanism for PDF metadata extraction, which is the primary feature added across all modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/llm-metadata-extraction

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
desci-server/src/services/AutomatedMetadata.ts (2)

302-316: Condition logic could be clearer but is functionally safe.

The condition error?.response?.status || error?.code === 'ECONNREFUSED' works because this is in a catch block (axios throws on non-2xx), but the intent would be clearer with explicit error checking.

♻️ Optional: More explicit error condition
-        if (error?.response?.status || error?.code === 'ECONNREFUSED') {
+        // Grobid service error (HTTP error or connection refused)
+        if (error?.code === 'ECONNREFUSED' || error?.response?.status >= 400) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@desci-server/src/services/AutomatedMetadata.ts` around lines 302 - 316, In
the Grobid catch block, make the fallback condition explicit so intent is clear:
replace the loose check `error?.response?.status || error?.code ===
'ECONNREFUSED'` with an explicit test such as checking for an Axios error or
defined response status (e.g. `axios.isAxiosError(error) && typeof
error.response?.status !== 'undefined'`) or `error?.code === 'ECONNREFUSED'`;
keep the rest of the flow that builds `pdfUrl` (using IPFS_RESOLVER and cid),
fetches into a Buffer, calls `this.queryFromLLM(buffer)`, and returns
`DEFAULT_GROBID_METADATA` only when LLM fallback fails. Ensure you import/use
`axios.isAxiosError` if you choose that form.

336-360: Consider adding a timeout for the Anthropic API call.

The client.messages.create call has no timeout configured. For large PDFs or network issues, this could block indefinitely. Add a timeout to prevent the request from hanging:

       const response = await client.messages.create({
         model: 'claude-sonnet-4-20250514',
         max_tokens: 4096,
         tools: [EXTRACT_METADATA_TOOL],
         tool_choice: { type: 'tool', name: 'extract_paper_metadata' },
         messages: [
           {
             role: 'user',
             content: [
               {
                 type: 'document',
                 source: {
                   type: 'base64',
                   media_type: 'application/pdf',
                   data: pdfBase64,
                 },
               },
               {
                 type: 'text',
                 text: 'Extract the metadata from this academic paper. Be precise with author names and affiliations. If keywords are listed explicitly, use those; otherwise infer 3-5 key terms from the abstract and content.',
               },
             ],
           },
         ],
-      });
+      }, { timeout: 120000 }); // 2 minute timeout for large PDFs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@desci-server/src/services/AutomatedMetadata.ts` around lines 336 - 360, The
Anthropic API call via client.messages.create in AutomatedMetadata.ts has no
timeout and can hang on large PDFs or network issues; add a timeout using an
AbortController (or the client's built-in timeout option if available) when
calling client.messages.create (the block that constructs the model
'claude-sonnet-4-20250514' request), set a reasonable timeout (e.g., 30–120s),
ensure you abort the controller on completion/cleanup to avoid leaks, and
surface a clear error when the request is aborted so callers of the function can
handle timeouts appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@desci-server/src/services/AutomatedMetadata.ts`:
- Around line 362-366: The logger call in AutomatedMetadata.ts uses incorrect
syntax; change logger.error('No tool_use block in LLM response') to the pino
convention by passing structured context as the first argument and the message
as the second (for example include response or toolBlock details), e.g., call
logger.error({ response, toolBlock }, 'No tool_use block in LLM response');
update the error branch around the toolBlock check to use this form so pino
receives the object payload first.
- Around line 326-330: The logger.warn call in AutomatedMetadata.ts uses the old
string-only syntax; update the ANTHROPIC_API_KEY check to follow pino's (params,
message) convention by passing a structured param object (e.g., indicating the
missing env key or context) as the first argument and the human-readable message
as the second; locate the apiKey variable and the logger.warn call near the
ANTHROPIC_API_KEY check and replace the single-string logger.warn with the
two-argument pino-style call for consistent logging.

---

Nitpick comments:
In `@desci-server/src/services/AutomatedMetadata.ts`:
- Around line 302-316: In the Grobid catch block, make the fallback condition
explicit so intent is clear: replace the loose check `error?.response?.status ||
error?.code === 'ECONNREFUSED'` with an explicit test such as checking for an
Axios error or defined response status (e.g. `axios.isAxiosError(error) &&
typeof error.response?.status !== 'undefined'`) or `error?.code ===
'ECONNREFUSED'`; keep the rest of the flow that builds `pdfUrl` (using
IPFS_RESOLVER and cid), fetches into a Buffer, calls
`this.queryFromLLM(buffer)`, and returns `DEFAULT_GROBID_METADATA` only when LLM
fallback fails. Ensure you import/use `axios.isAxiosError` if you choose that
form.
- Around line 336-360: The Anthropic API call via client.messages.create in
AutomatedMetadata.ts has no timeout and can hang on large PDFs or network
issues; add a timeout using an AbortController (or the client's built-in timeout
option if available) when calling client.messages.create (the block that
constructs the model 'claude-sonnet-4-20250514' request), set a reasonable
timeout (e.g., 30–120s), ensure you abort the controller on completion/cleanup
to avoid leaks, and surface a clear error when the request is aborted so callers
of the function can handle timeouts appropriately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a61245b-63d0-49cb-b85b-becc8b2dd53a

📥 Commits

Reviewing files that changed from the base of the PR and between 082a440 and c3ca5ce.

⛔ Files ignored due to path filters (2)
  • desci-server/package-lock.json is excluded by !**/package-lock.json
  • desci-server/yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (3)
  • desci-server/package.json
  • desci-server/src/controllers/nodes/doi.ts
  • desci-server/src/services/AutomatedMetadata.ts

Comment on lines +326 to +330
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
return null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Logger syntax should follow pino convention.

The logger calls should use the (params, message) format per coding guidelines.

As per coding guidelines: **/*.{js,ts,jsx,tsx}: Use correct pino logger syntax, e.g. logger.error(params, message)

🔧 Proposed fix
     if (!apiKey) {
-      logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
+      logger.warn({}, 'ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
       return null;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
logger.warn('ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
return null;
}
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
logger.warn({}, 'ANTHROPIC_API_KEY not set, skipping LLM metadata extraction');
return null;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@desci-server/src/services/AutomatedMetadata.ts` around lines 326 - 330, The
logger.warn call in AutomatedMetadata.ts uses the old string-only syntax; update
the ANTHROPIC_API_KEY check to follow pino's (params, message) convention by
passing a structured param object (e.g., indicating the missing env key or
context) as the first argument and the human-readable message as the second;
locate the apiKey variable and the logger.warn call near the ANTHROPIC_API_KEY
check and replace the single-string logger.warn with the two-argument pino-style
call for consistent logging.

Comment on lines +362 to +366
const toolBlock = response.content.find((block) => block.type === 'tool_use');
if (!toolBlock || toolBlock.type !== 'tool_use') {
logger.error('No tool_use block in LLM response');
return null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Logger syntax should follow pino convention.

As per coding guidelines: **/*.{js,ts,jsx,tsx}: Use correct pino logger syntax, e.g. logger.error(params, message)

🔧 Proposed fix
       const toolBlock = response.content.find((block) => block.type === 'tool_use');
       if (!toolBlock || toolBlock.type !== 'tool_use') {
-        logger.error('No tool_use block in LLM response');
+        logger.error({ response: response.content }, 'No tool_use block in LLM response');
         return null;
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const toolBlock = response.content.find((block) => block.type === 'tool_use');
if (!toolBlock || toolBlock.type !== 'tool_use') {
logger.error('No tool_use block in LLM response');
return null;
}
const toolBlock = response.content.find((block) => block.type === 'tool_use');
if (!toolBlock || toolBlock.type !== 'tool_use') {
logger.error({ response: response.content }, 'No tool_use block in LLM response');
return null;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@desci-server/src/services/AutomatedMetadata.ts` around lines 362 - 366, The
logger call in AutomatedMetadata.ts uses incorrect syntax; change
logger.error('No tool_use block in LLM response') to the pino convention by
passing structured context as the first argument and the message as the second
(for example include response or toolBlock details), e.g., call logger.error({
response, toolBlock }, 'No tool_use block in LLM response'); update the error
branch around the toolBlock check to use this form so pino receives the object
payload first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants