Skip to content

Implement OCR pipeline from AutomatedFOIA for scanned document intake #1

@stuckvgn

Description

@stuckvgn

Context

The current ingester in src/documents/ingester.py has an OCR fallback path that calls
_extract_text_ocr(), but this requires pdf2image + pytesseract as optional extras
and system packages (tesseract-ocr, poppler-utils).

Work Needed

  • Document exact system dependency installation steps for Linux, macOS, and Windows
  • Test the OCR path against representative scanned USDA APHIS inspection PDFs
  • Verify the in-memory decryption → OCR pipeline (BytesIO, never writes temp files)
  • Add per-page OCR fallback: only fall back on pages where native text is empty, not the whole doc
  • Add quality check: if OCR yields < N chars on a page, flag for manual review
  • Wire up to src/api/server.py document ingest endpoint
  • Add integration test with a sample scanned PDF (use synthetic/non-sensitive test fixture)

Security Note

OCR must process in RAM only (BytesIO). No unencrypted temp files — this is a hard requirement
for ag-gag jurisdictions where device seizure is a real threat.

References

  • AutomatedFOIA/backroom.py — reference implementation (has Windows-specific paths to clean up)
  • src/documents/ingester.py — current implementation
  • docs/security.md §Encrypted Storage

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions