Context
The current ingester in src/documents/ingester.py has an OCR fallback path that calls
_extract_text_ocr(), but this requires pdf2image + pytesseract as optional extras
and system packages (tesseract-ocr, poppler-utils).
Work Needed
- Document exact system dependency installation steps for Linux, macOS, and Windows
- Test the OCR path against representative scanned USDA APHIS inspection PDFs
- Verify the in-memory decryption → OCR pipeline (BytesIO, never writes temp files)
- Add per-page OCR fallback: only fall back on pages where native text is empty, not the whole doc
- Add quality check: if OCR yields < N chars on a page, flag for manual review
- Wire up to
src/api/server.py document ingest endpoint
- Add integration test with a sample scanned PDF (use synthetic/non-sensitive test fixture)
Security Note
OCR must process in RAM only (BytesIO). No unencrypted temp files — this is a hard requirement
for ag-gag jurisdictions where device seizure is a real threat.
References
AutomatedFOIA/backroom.py — reference implementation (has Windows-specific paths to clean up)
src/documents/ingester.py — current implementation
docs/security.md §Encrypted Storage
Context
The current ingester in
src/documents/ingester.pyhas an OCR fallback path that calls_extract_text_ocr(), but this requirespdf2image+pytesseractas optional extrasand system packages (tesseract-ocr, poppler-utils).
Work Needed
src/api/server.pydocument ingest endpointSecurity Note
OCR must process in RAM only (BytesIO). No unencrypted temp files — this is a hard requirement
for ag-gag jurisdictions where device seizure is a real threat.
References
AutomatedFOIA/backroom.py— reference implementation (has Windows-specific paths to clean up)src/documents/ingester.py— current implementationdocs/security.md§Encrypted Storage