feat: optional pre-extraction webhook for custom text enrichment by gloeckle-direct-ki · Pull Request #282 · danny-avila/rag_api

gloeckle-direct-ki · 2026-04-23T07:30:11Z

Problem

rag_api ships no fallback for PDFs where PyPDF extraction yields little or no text — typically scanned documents and image-heavy reports. The result is 0 chunks for these files, silently. There's no hook to plug in OCR or any other enrichment.

Solution

Optional, env-gated webhook called before the standard text pipeline runs. When set, rag_api POSTs the source file to the configured URL and uses the returned text instead of (or in addition to) PyPDF output, but only if PyPDF's per-page average falls below a threshold.

Two new env vars, both optional, both off by default:

Var	Effect
`PRE_EXTRACTION_WEBHOOK_URL`	If set, the webhook is consulted on text-extraction-poor files. Unset → no behavior change.
`PRE_EXTRACTION_WEBHOOK_MIN_CHARS`	Per-page char threshold (default 50). PyPDF results above this go through unchanged.

The webhook contract is intentionally minimal: POST multipart file=, response is JSON {\"text\": \"...\"}. Anyone can plug in Azure Document Intelligence, Tesseract, Marker, Mistral OCR, or whatever fits their stack.

What this PR does NOT do

It does not bundle an OCR engine. The point is to keep rag_api lean and let users compose. We run a tiny TypeScript sidecar (~150 LOC) for Azure Document Intelligence behind it; that's our concern, not upstream's.

Soft-fail policy

If the webhook is unreachable, returns 5xx, or returns malformed JSON: log a warning, continue with the PyPDF result. The new code path can never break the existing one.

Tests

Unit tests for the threshold logic (above/below char average), the request-shape, and the soft-fail behavior on transport errors.

Why upstream

We've been running this on prod for two weeks, ~50 mixed-content PDFs. It plugged the OCR gap cleanly without bringing OCR into rag_api itself. Other users running scanned-document workloads (legal, archival, scientific PDFs) probably want the same hook point.

Happy to iterate on the webhook contract / config knobs / docs if anything feels off.

Co-authored-by: Claude (Anthropic)

Adds a tiny hook at the end of the document-loading pipeline that forwards the original file to an HTTP webhook whenever text extraction produced effectively-empty pages (e.g. scanned PDFs). The webhook is expected to return `{"text": "...", "provider": "..."}`; its output replaces the extraction and is tagged with `ocr_used=True` so downstream pipelines can tell text-backed chunks from OCR-backed chunks. Disabled by default — enables only when PRE_EXTRACTION_WEBHOOK_URL is set. Configurable threshold (PRE_EXTRACTION_WEBHOOK_MIN_CHARS, default 100 chars per page) and timeout (PRE_EXTRACTION_WEBHOOK_TIMEOUT, default 60 s). Webhook failures never break ingest — on any error we fall back to the original extraction with a warning log. This keeps the core simple while letting external services participate in ingest (OCR, translation, custom chunking, …) without subclassing loaders. Tests: 6 new pytest cases covering the disabled path, the threshold branches, HTTP failures, empty-response fallback, and the empty-documents edge case.

gloeckle-direct-ki mentioned this pull request Apr 23, 2026

feat: optional multimodal ingest + visual retrieval (CLIP cross-modal + text-page coupling) #283

Open

gloeckle-direct-ki marked this pull request as ready for review April 23, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optional pre-extraction webhook for custom text enrichment#282

feat: optional pre-extraction webhook for custom text enrichment#282
gloeckle-direct-ki wants to merge 1 commit intodanny-avila:mainfrom
gloeckle-direct-ki:feature/pre-extraction-webhook

gloeckle-direct-ki commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gloeckle-direct-ki commented Apr 23, 2026

Problem

Solution

What this PR does NOT do

Soft-fail policy

Tests

Why upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant