A Gradio-based demonstration for the zai-org/GLM-OCR multimodal OCR model. Supports text, formula, and table recognition from uploaded images, with outputs in plain text and markdown formats. Features custom HotPink theme, GPU acceleration, image orientation handling (EXIF transpose), and temporary file management for processing.
- Recognition Types: Text Recognition, Formula Recognition, Table Recognition with predefined prompts.
- Image Handling: Supports upload/clipboard sources; auto-converts RGBA/LA/P modes to RGB; handles EXIF orientation.
- Outputs: Dual tabs for plain text and markdown rendering.
- Custom Theme: HotPinkTheme with responsive, animated styling via CSS.
- GPU Inference: Uses spaces.GPU decorator for efficient processing.
- Examples: 5 curated images for quick testing.
- Queueing: Up to 50 concurrent jobs.
- Python 3.10 or higher.
- CUDA-compatible GPU (recommended for bfloat16 inference).
- Stable internet for initial model downloads from Hugging Face.
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/GLM-OCR-Demo.git cd GLM-OCR-Demo - Install dependencies:
First, install pre-requirements:
Then, install main requirements:
pip install -r pre-requirements.txtpre-requirements.txt content:pip install -r requirements.txtrequirements.txt content:pip>=23.0.0flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl #skippable git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git huggingface_hub sentencepiece opencv-python torch==2.6.0 torchvision matplotlib markdown requests hf_xet spaces pillow gradio #@gradio6 av - Start the application:
The demo launches at
python app.pyhttp://localhost:7860.
- Upload Image: Add an image via upload or clipboard.
- Select Task: Choose Text, Formula, or Table recognition.
- Recognize: Click "Recognize" to process.
- View Outputs: Check results in Text or Markdown tabs.
| Task | Prompt |
|---|---|
| Text | "Text Recognition:" |
| Formula | "Formula Recognition:" |
| Table | "Table Recognition:" |
| Input Image | Task |
|---|---|
| examples/1.jpg | Text |
| examples/4.jpg | Text |
| examples/5.webp | Text |
| examples/2.jpg | Formula |
| examples/3.jpg | Table |
- Model Loading: First run downloads GLM-OCR; monitor console.
- Image Errors: Ensure valid RGB images; check console for processing issues.
- OOM: Use smaller images or reduce max_new_tokens (default 8192).
- No Output: Upload image first; select task.
- Flash Attention: Requires compatible CUDA; fallback if fails.
Contributions welcome! Add new tasks to TASK_PROMPTS, enhance CSS, or improve processing. Submit pull requests via the repository.
Repository: https://github.com/PRITHIVSAKTHIUR/GLM-OCR-Demo.git
Apache License 2.0. See LICENSE for details. Built by Prithiv Sakthi. Report issues via the repository.