Skip to content

Architecture

Tmob edited this page Jan 28, 2026 · 2 revisions

Model Architecture

Kiri OCR utilizes a state-of-the-art hybrid architecture that combines the spatial feature extraction capabilities of Convolutional Neural Networks (CNNs) with the sequence modeling power of Transformers. This design allows it to achieve high accuracy on both English and Khmer text, robustly handling complex scripts and variable-length sequences.

High-Level Pipeline

The recognition model processes an input image in three main stages:

  1. Visual Feature Extraction (CNN): Converts the input image ($H \times W$) into a sequence of high-level feature vectors.
  2. Sequence Modeling (Transformer): Captures global context and long-range dependencies between characters.
  3. Decoding (Hybrid): Two parallel heads decode the contextualized features into text:
    • CTC Head: Connectionist Temporal Classification for alignment-free decoding.
    • Attention Head: Autoregressive sequence generation (Seq2Seq).

Detailed Components

1. CNN Backbone

The backbone is responsible for extracting visual features from the raw image pixels. It typically uses a custom ResNet-like architecture tailored for OCR tasks.

  • Input: Grayscale image tensor ($1 \times 48 \times W$).
  • Layers: Multiple blocks of Conv2d, BatchNorm, ReLU, and MaxPool.
  • Downsampling: The network aggressively downsamples the height dimension (to 1) while preserving the width dimension, effectively converting the 2D image into a 1D sequence of visual features.
  • Output: Feature sequence of shape ($C \times 1 \times W'$), where $W' = W / 8$.

2. Transformer Encoder

The feature sequence from the CNN is flattened and fed into a standard Transformer Encoder.

  • Positional Encoding: Since Transformers are permutation-invariant, a 2D positional encoding is added to the feature sequence to retain spatial information.
  • Self-Attention: The encoder layers use Multi-Head Self-Attention mechanisms to understand the relationship between different parts of the image. This is crucial for:
    • Khmer Script: Handling vowels that may be placed before, after, above, or below the consonant they modify.
    • Context: Disambiguating similar-looking characters based on surrounding text.

3. Hybrid Decoder & Training

Kiri OCR employs a multi-task learning approach to stabilize training and improve convergence speed.

A. CTC Loss (Connectionist Temporal Classification)

The CTC head projects the encoder output directly to the character vocabulary size.

  • Pros: Fast inference, monotonic alignment.
  • Cons: Assumes conditional independence between characters (can struggle with language dependencies).

B. Attention Decoder (Cross-Entropy Loss)

A standard Transformer Decoder generates characters one by one, attending to the encoder output at each step.

  • Pros: Models strong language dependencies (like a language model), handles complex reordering.
  • Cons: Slower autoregressive inference.

Joint Optimization

During training, the model optimizes a weighted sum of both losses: $$ L = \lambda_{CTC} \cdot L_{CTC} + \lambda_{Attn} \cdot L_{Attn} $$ By default, $\lambda_{CTC} = 0.5$ and $\lambda_{Attn} = 0.5$. The CTC loss helps the encoder learn alignment quickly, while the Attention loss refines the sequence modeling.

4. Inference Strategies

At inference time, you can choose between different decoding strategies:

  • Greedy CTC: Simply picks the most likely character at each timestep. Very fast.
  • Beam Search: Explores multiple possible sequences to find the one with the highest total probability. Enabled by default for best accuracy.
  • Streaming: Supports character-by-character output for real-time feedback (like LLM streaming).

Text Detection Architecture

For detecting text regions in full documents, Kiri OCR supports two backends:

  1. DB (Differentiable Binarization):

    • Predicts a segmentation map and an adaptive threshold map.
    • Robust to curved text and complex layouts.
    • Faster than regression-based methods.
  2. CRAFT (Character Region Awareness for Text Detection):

    • Predicts character regions and affinity scores to link them.
    • Extremely precise for character-level localization.

Clone this wiki locally