-
Notifications
You must be signed in to change notification settings - Fork 3
Data Generation
Tmob edited this page Jan 28, 2026
·
2 revisions
To train a robust recognition model, you need a large and diverse dataset. Kiri OCR includes a synthetic data generator that creates realistic images of text lines from text files and fonts. This tool uses PIL (Python Imaging Library) to render text with various augmentations to simulate real-world conditions.
-
Text Source: A
.txtfile containing the text content you want to generate images for. Each line in the file will become one sample image. -
Fonts: A directory containing
.ttfor.otffont files. The generator will randomly select a font for each image.
Use the generate command:
kiri-ocr generate \
--train-file corpus.txt \
--fonts-dir my_fonts/ \
--output generated_data/ \
--height 48 \
--width 640| Argument | Description | Default |
|---|---|---|
--train-file, -t
|
Path to input text file. (Required) | - |
--val-file, -v
|
Path to validation text file. If not provided, no validation set is generated. | None |
--output, -o
|
Output directory. | data |
--fonts-dir |
Directory containing fonts. | fonts |
--height |
Output image height. | 32 |
--width |
Output image width. | 512 |
--augment, -a
|
Number of variations to generate for each line of text. | 1 |
--random-augment |
Apply random noise, rotation, and blur. | False |
--language, -l
|
Rendering mode: english, khmer, or mixed. Adjusts font handling. |
mixed |
When --random-augment is enabled, the generator applies a random combination of the following effects to each image:
- Rotation: Slight rotation (+/- 2 degrees) to simulate skew.
- Noise: Gaussian noise to simulate camera sensor grain.
- Blur: Gaussian blur to simulate out-of-focus images.
- Distortion: Elastic distortion to simulate paper warping.
- Background: Random background color variations (shades of white/gray).
The command creates a directory structure ready for training:
generated_data/
├── train/
│ ├── images/
│ │ ├── 00001_0.jpg # Original line 1, augmentation 0
│ │ ├── 00001_1.jpg # Original line 1, augmentation 1
│ │ └── ...
│ └── labels.txt # Mapping file
├── val/ (Optional)
│ ├── images/
│ └── labels.txt
Format of labels.txt:
train/images/00001_0.jpg Hello World
train/images/00001_1.jpg Hello World
- Diverse Fonts: Use as many different fonts as possible. For Khmer, ensure you have fonts that handle subscripts and vowels correctly (e.g., Khmer OS, Battambang).
-
Realism: Always use
--random-augmentfor training data. Clean data is good for initial tests, but models trained on it often fail on real scanned documents. -
Content Balance: Ensure your
corpus.txtcovers the vocabulary, special characters, and numbers expected in your target documents. -
Image Size: Match the
--heightand--widthto the settings you plan to use for training (default 48x640 for Kiri OCR).
Kiri OCR Home | GitHub Repository | Report Issue
© 2026 Kiri OCR. Released under the Apache 2.0 License.