-
Notifications
You must be signed in to change notification settings - Fork 135
Feature Proposal: Add VLM (Vision Language Model) as an Optional OCR EngineΒ #14
Copy link
Copy link
Open
Description
π‘ Core Concept
Coexistence, not replacement.
Offer multiple choices for different user preferences:
- π₯οΈ Privacy-first users β continue using local OCR (RapidOCR), where all data stays on-device.
- βοΈ Lightweight or accuracy-focused users β opt for a cloud-based VLM API OCR, reducing local resource load and improving recognition quality.
π Current Limitations of Local OCR
- Memory usage: Resident model consumes 200β500 MB RAM.
- CPU usage: Local inference takes up CPU resources.
- Startup delay: Model loading slows down initialization.
π Advantages of the VLM-based OCR Option
1. Resource Optimization
- π§© Zero memory footprint β no local model required.
- β‘ Zero CPU consumption β inference handled entirely in the cloud.
- π Faster startup β no model initialization delay.
2. Feature Enhancements
- β¨ Higher accuracy in complex or noisy scenarios.
- π§ Contextual understanding of image content and layout.
- π Structured extraction (tables, lists, key-value pairs).
- π Improved multilingual support, especially for mixed-language content.
π Privacy and User Control
- Local OCR: Full data privacy β images and text are processed entirely offline.
- Cloud-based VLM OCR: Opt-in feature β users are clearly informed before any data upload.
This dual approach respects user choice, allowing them to decide between privacy-first and performance-first workflows without compromise.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels