Multimodal-OCR3

Multimodal-OCR3 is a highly capable, experimental optical character recognition and visual processing suite designed for precise text extraction, document parsing, and markdown generation. Leveraging a powerful selection of vision-language and causal language models—including architectures like Nanonets-OCR2-3B, Chandra-OCR, Dots.OCR, and olmOCR-2-7B—this application specializes in deciphering complex document layouts, dense texts, and real-world scene imagery. The tool features a highly customized, interactive web interface that enables users to effortlessly upload screenshots, receipts, and multi-page documents for rapid analysis. With built-in support for fully GPU-accelerated inference and granular manipulation of generation parameters, Multimodal-OCR3 provides researchers and developers with a streamlined environment for building, testing, and deploying robust document intelligence and multimodal workflows.

Key Features

Multi-Model Architecture: Seamlessly switch between specialized models directly from the interface. Supported models include Nanonets-OCR2-3B, Chandra-OCR, Dots.OCR, and olmOCR-2-7B-1025.
Custom User Interface: Features a bespoke, responsive Gradio frontend built with custom HTML, CSS, and JavaScript. It includes a drag-and-drop media zone, real-time output streaming, and an integrated advanced settings panel.
Granular Inference Controls: Fine-tune the AI's output by adjusting text generation parameters such as Maximum New Tokens, Temperature, Top-p, Top-k, and Repetition Penalty.
Output Management: Built-in actions allow users to instantly copy the raw output text to their clipboard or save the generated response directly as a .txt file.
Flash Attention 2 Integration: Utilizes kernels-community/flash-attn2 for optimized, memory-efficient inference on compatible GPUs.

Repository Structure

├── examples/
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── 3.jpg
│   └── 4.jpg
├── app.py
├── LICENSE
├── pre-requirements.txt
├── README.md
└── requirements.txt

Installation and Requirements

To run Multimodal-OCR3 locally, you need to configure a Python environment with the following dependencies. Ensure you have a compatible CUDA-enabled GPU for optimal performance.

1. Install Pre-requirements Run the following command to update pip to the required version:

pip install pip>=23.0.0

2. Install Core Requirements Install the necessary machine learning and UI libraries. You can place these in a requirements.txt file and run pip install -r requirements.txt.

git+https://github.com/huggingface/transformers.git@v4.57.6
git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.8.0
torchvision
matplotlib
requests
kernels
hf_xet
spaces
pillow
gradio
av

Usage

Once your environment is set up and the dependencies are installed, you can launch the application by running the main Python script:

python app.py

After the script initializes the interface, it will provide a local web address (usually http://127.0.0.1:7860/) which you can open in your browser to interact with the models. Note that the selected models will be downloaded and loaded into VRAM upon their first invocation.

License and Source

License: Apache License - Version 2.0
GitHub Repository: https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR3.git

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal-OCR3

Key Features

Repository Structure

Installation and Requirements

Usage

License and Source

About

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multimodal-OCR3

Key Features

Repository Structure

Installation and Requirements

Usage

License and Source

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages