Data Generation

To train a robust recognition model, you need a large and diverse dataset. Kiri OCR includes a synthetic data generator that creates realistic images of text lines from text files and fonts. This tool uses PIL (Python Imaging Library) to render text with various augmentations to simulate real-world conditions.

Prerequisites

Text Source: A .txt file containing the text content you want to generate images for. Each line in the file will become one sample image.
Fonts: A directory containing .ttf or .otf font files. The generator will randomly select a font for each image.

Generating Data

Use the generate command:

kiri-ocr generate \
    --train-file corpus.txt \
    --fonts-dir my_fonts/ \
    --output generated_data/ \
    --height 48 \
    --width 640

Command Arguments

Argument	Description	Default
`--train-file`, `-t`	Path to input text file. (Required)	-
`--val-file`, `-v`	Path to validation text file. If not provided, no validation set is generated.	`None`
`--output`, `-o`	Output directory.	`data`
`--fonts-dir`	Directory containing fonts.	`fonts`
`--height`	Output image height.	`32`
`--width`	Output image width.	`512`
`--augment`, `-a`	Number of variations to generate for each line of text.	`1`
`--random-augment`	Apply random noise, rotation, and blur.	`False`
`--language`, `-l`	Rendering mode: `english`, `khmer`, or `mixed`. Adjusts font handling.	`mixed`

Augmentation Details

When --random-augment is enabled, the generator applies a random combination of the following effects to each image:

Rotation: Slight rotation (+/- 2 degrees) to simulate skew.
Noise: Gaussian noise to simulate camera sensor grain.
Blur: Gaussian blur to simulate out-of-focus images.
Distortion: Elastic distortion to simulate paper warping.
Background: Random background color variations (shades of white/gray).

Output Structure

The command creates a directory structure ready for training:

generated_data/
├── train/
│   ├── images/
│   │   ├── 00001_0.jpg  # Original line 1, augmentation 0
│   │   ├── 00001_1.jpg  # Original line 1, augmentation 1
│   │   └── ...
│   └── labels.txt       # Mapping file
├── val/ (Optional)
│   ├── images/
│   └── labels.txt

Format of labels.txt:

train/images/00001_0.jpg	Hello World
train/images/00001_1.jpg	Hello World

Tips for High-Quality Data

Diverse Fonts: Use as many different fonts as possible. For Khmer, ensure you have fonts that handle subscripts and vowels correctly (e.g., Khmer OS, Battambang).
Realism: Always use --random-augment for training data. Clean data is good for initial tests, but models trained on it often fail on real scanned documents.
Content Balance: Ensure your corpus.txt covers the vocabulary, special characters, and numbers expected in your target documents.
Image Size: Match the --height and --width to the settings you plan to use for training (default 48x640 for Kiri OCR).

Kiri OCR Home | GitHub Repository | Report Issue

Home
Getting Started
- Installation
- Quick Start
Usage
Training & Data
About
- Architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Generation

Data Generation

Prerequisites

Generating Data

Command Arguments

Augmentation Details

Output Structure

Tips for High-Quality Data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally