feat: optional pre-extraction webhook for custom text enrichment#282
Open
gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
Open
feat: optional pre-extraction webhook for custom text enrichment#282gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
Conversation
Adds a tiny hook at the end of the document-loading pipeline that forwards
the original file to an HTTP webhook whenever text extraction produced
effectively-empty pages (e.g. scanned PDFs). The webhook is expected to
return `{"text": "...", "provider": "..."}`; its output replaces the
extraction and is tagged with `ocr_used=True` so downstream pipelines can
tell text-backed chunks from OCR-backed chunks.
Disabled by default — enables only when PRE_EXTRACTION_WEBHOOK_URL is set.
Configurable threshold (PRE_EXTRACTION_WEBHOOK_MIN_CHARS, default 100 chars
per page) and timeout (PRE_EXTRACTION_WEBHOOK_TIMEOUT, default 60 s).
Webhook failures never break ingest — on any error we fall back to the
original extraction with a warning log.
This keeps the core simple while letting external services participate in
ingest (OCR, translation, custom chunking, …) without subclassing loaders.
Tests: 6 new pytest cases covering the disabled path, the threshold
branches, HTTP failures, empty-response fallback, and the empty-documents
edge case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
rag_apiships no fallback for PDFs where PyPDF extraction yields little or no text — typically scanned documents and image-heavy reports. The result is 0 chunks for these files, silently. There's no hook to plug in OCR or any other enrichment.Solution
Optional, env-gated webhook called before the standard text pipeline runs. When set,
rag_apiPOSTs the source file to the configured URL and uses the returned text instead of (or in addition to) PyPDF output, but only if PyPDF's per-page average falls below a threshold.Two new env vars, both optional, both off by default:
PRE_EXTRACTION_WEBHOOK_URLPRE_EXTRACTION_WEBHOOK_MIN_CHARSThe webhook contract is intentionally minimal:
POSTmultipartfile=, response is JSON{\"text\": \"...\"}. Anyone can plug in Azure Document Intelligence, Tesseract, Marker, Mistral OCR, or whatever fits their stack.What this PR does NOT do
It does not bundle an OCR engine. The point is to keep
rag_apilean and let users compose. We run a tiny TypeScript sidecar (~150 LOC) for Azure Document Intelligence behind it; that's our concern, not upstream's.Soft-fail policy
If the webhook is unreachable, returns 5xx, or returns malformed JSON: log a warning, continue with the PyPDF result. The new code path can never break the existing one.
Tests
Unit tests for the threshold logic (above/below char average), the request-shape, and the soft-fail behavior on transport errors.
Why upstream
We've been running this on prod for two weeks, ~50 mixed-content PDFs. It plugged the OCR gap cleanly without bringing OCR into
rag_apiitself. Other users running scanned-document workloads (legal, archival, scientific PDFs) probably want the same hook point.Happy to iterate on the webhook contract / config knobs / docs if anything feels off.
Co-authored-by: Claude (Anthropic)