Skip to content

feat: optional pre-extraction webhook for custom text enrichment#282

Open
gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
gloeckle-direct-ki:feature/pre-extraction-webhook
Open

feat: optional pre-extraction webhook for custom text enrichment#282
gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
gloeckle-direct-ki:feature/pre-extraction-webhook

Conversation

@gloeckle-direct-ki
Copy link
Copy Markdown

Problem

rag_api ships no fallback for PDFs where PyPDF extraction yields little or no text — typically scanned documents and image-heavy reports. The result is 0 chunks for these files, silently. There's no hook to plug in OCR or any other enrichment.

Solution

Optional, env-gated webhook called before the standard text pipeline runs. When set, rag_api POSTs the source file to the configured URL and uses the returned text instead of (or in addition to) PyPDF output, but only if PyPDF's per-page average falls below a threshold.

Two new env vars, both optional, both off by default:

Var Effect
PRE_EXTRACTION_WEBHOOK_URL If set, the webhook is consulted on text-extraction-poor files. Unset → no behavior change.
PRE_EXTRACTION_WEBHOOK_MIN_CHARS Per-page char threshold (default 50). PyPDF results above this go through unchanged.

The webhook contract is intentionally minimal: POST multipart file=, response is JSON {\"text\": \"...\"}. Anyone can plug in Azure Document Intelligence, Tesseract, Marker, Mistral OCR, or whatever fits their stack.

What this PR does NOT do

It does not bundle an OCR engine. The point is to keep rag_api lean and let users compose. We run a tiny TypeScript sidecar (~150 LOC) for Azure Document Intelligence behind it; that's our concern, not upstream's.

Soft-fail policy

If the webhook is unreachable, returns 5xx, or returns malformed JSON: log a warning, continue with the PyPDF result. The new code path can never break the existing one.

Tests

Unit tests for the threshold logic (above/below char average), the request-shape, and the soft-fail behavior on transport errors.

Why upstream

We've been running this on prod for two weeks, ~50 mixed-content PDFs. It plugged the OCR gap cleanly without bringing OCR into rag_api itself. Other users running scanned-document workloads (legal, archival, scientific PDFs) probably want the same hook point.

Happy to iterate on the webhook contract / config knobs / docs if anything feels off.


Co-authored-by: Claude (Anthropic)

Adds a tiny hook at the end of the document-loading pipeline that forwards
the original file to an HTTP webhook whenever text extraction produced
effectively-empty pages (e.g. scanned PDFs). The webhook is expected to
return `{"text": "...", "provider": "..."}`; its output replaces the
extraction and is tagged with `ocr_used=True` so downstream pipelines can
tell text-backed chunks from OCR-backed chunks.

Disabled by default — enables only when PRE_EXTRACTION_WEBHOOK_URL is set.
Configurable threshold (PRE_EXTRACTION_WEBHOOK_MIN_CHARS, default 100 chars
per page) and timeout (PRE_EXTRACTION_WEBHOOK_TIMEOUT, default 60 s).
Webhook failures never break ingest — on any error we fall back to the
original extraction with a warning log.

This keeps the core simple while letting external services participate in
ingest (OCR, translation, custom chunking, …) without subclassing loaders.

Tests: 6 new pytest cases covering the disabled path, the threshold
branches, HTTP failures, empty-response fallback, and the empty-documents
edge case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant