KoViDoRe Data Generator

Synthetic data generation pipeline for KoViDoRe v2 benchmark

Overview

KoViDoRe Data Generator is a synthetic data generation pipeline designed to construct the KoViDoRe v2 benchmark for evaluating Korean Vision Document Retrievers. Inspired by ViDoRe V3, this pipeline addresses a key limitation of KoViDoRe v1—single-page matching—by generating queries that require synthesizing information across multiple pages rather than retrieving answers from a single page in isolation.

Pipeline

The pipeline consists of four main stages: corpus building, summary generation, query generation, and false negative filtering. For detailed documentation, see PIPELINE.md.

Installation

uv sync
source .venv/bin/activate

Quick Start

Export your Upstage API key

export UPSTAGE_API_KEY=your_upstage_api_key_here

Build the corpus from PDF documents

python build_corpus.py --subsets cybersecurity

Run the pipeline for the target task

# KoViDoRe v2 followed the process below:

# -----------------------------------------------
# 1. generate query from cross-section summary
# -----------------------------------------------
# 1-1. generate single-section summary based on corpus
bash scripts/run.sh --subsets cybersecurity --task single_section_summary

# 1-2. generate cross-section summary based on single-section summary
bash scripts/run.sh --subsets cybersecurity --task cross_section_summary

# 1-3. generate query from cross-section summary
bash scripts/run.sh --subsets cybersecurity --task query_from_summary

# 1-4. filter false negatives with LLM
bash scripts/run.sh --subsets cybersecurity --task filter_query_from_summary

# -----------------------------------------------
# 2. generate query from context
# -----------------------------------------------
# 2-1. generate query from context
bash scripts/run.sh --subsets cybersecurity --task query_from_context

# 2-2. filter false negatives with LLM
bash scripts/run.sh --subsets cybersecurity --task filter_query_from_context

# -----------------------------------------------
# 3. quality control and audit checks (by human)
# -----------------------------------------------

Datasets

KoViDoRe v2 includes four subsets, each focusing on a distinct, enterprise-relevant domain:

Subset	Description	Link
HR	Workforce outlook and employment policy	🤗 Dataset
Energy	Energy policy and power market trends	🤗 Dataset
Economic	Quarterly economic trend reports	🤗 Dataset
Cybersecurity	Cyber threat analysis and security guides	🤗 Dataset

License

MIT

Acknowledgements

This pipeline is inspired by the ViDoRe V3, and we thank the original authors for their foundational work. We also extend our gratitude to the NVIDIA NeMo Data Designer team for open-sourcing their library. Finally, We thank the Upstage x AWS AI Initiative for granting us free access to their API services.

We also acknowledge the datasets provided by the Public Data Portal(공공데이터포털), which were utilized to construct the tasks in KoViDoRe v2.

Contact

For questions or suggestions, please open an issue on the GitHub repository or contact the maintainers:

Yongbin Choi - [email protected]

Citation

If you use KoViDoRe v2 in your research, please cite as follows:

@misc{choi2026kovidorev2,
  author = {Yongbin Choi},
  title = {KoViDoRe v2: a comprehensive evaluation of vision document retrieval for enterprise use-cases},
  year = {2026},
  url = {https://github.com/whybe-choi/kovidore-data-generator},
  note = {A benchmark for evaluating Korean vision document retrieval with multi-page reasoning queries in practical domains}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
data/cybersecurity/pdfs		data/cybersecurity/pdfs
scripts		scripts
src/kovidore_data_generator		src/kovidore_data_generator
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
PIPELINE.md		PIPELINE.md
README.md		README.md
build_corpus.py		build_corpus.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KoViDoRe Data Generator

Overview

Pipeline

Installation

Quick Start

Datasets

License

Acknowledgements

Contact

Citation

About

Uh oh!

Releases 1

Packages

Languages

License

whybe-choi/kovidore-data-generator

Folders and files

Latest commit

History

Repository files navigation

KoViDoRe Data Generator

Overview

Pipeline

Installation

Quick Start

Datasets

License

Acknowledgements

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages