Implement OCR pipeline from AutomatedFOIA for scanned document intake

## Context

The current ingester in `src/documents/ingester.py` has an OCR fallback path that calls
`_extract_text_ocr()`, but this requires `pdf2image` + `pytesseract` as optional extras
and system packages (tesseract-ocr, poppler-utils).

## Work Needed

- Document exact system dependency installation steps for Linux, macOS, and Windows
- Test the OCR path against representative scanned USDA APHIS inspection PDFs
- Verify the in-memory decryption → OCR pipeline (BytesIO, never writes temp files)
- Add per-page OCR fallback: only fall back on pages where native text is empty, not the whole doc
- Add quality check: if OCR yields < N chars on a page, flag for manual review
- Wire up to `src/api/server.py` document ingest endpoint
- Add integration test with a sample scanned PDF (use synthetic/non-sensitive test fixture)

## Security Note

OCR must process in RAM only (BytesIO). No unencrypted temp files — this is a hard requirement
for ag-gag jurisdictions where device seizure is a real threat.

## References

- `AutomatedFOIA/backroom.py` — reference implementation (has Windows-specific paths to clean up)
- `src/documents/ingester.py` — current implementation
- `docs/security.md` §Encrypted Storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement OCR pipeline from AutomatedFOIA for scanned document intake #1

Context

Work Needed

Security Note

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement OCR pipeline from AutomatedFOIA for scanned document intake #1

Description

Context

Work Needed

Security Note

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions