Synthetic data generation pipeline for KoViDoRe v2 benchmark
KoViDoRe Data Generator is a synthetic data generation pipeline designed to construct the KoViDoRe v2 benchmark for evaluating Korean Vision Document Retrievers. Inspired by ViDoRe V3, this pipeline addresses a key limitation of KoViDoRe v1—single-page matching—by generating queries that require synthesizing information across multiple pages rather than retrieving answers from a single page in isolation.
The pipeline consists of four main stages: corpus building, summary generation, query generation, and false negative filtering. For detailed documentation, see PIPELINE.md.
uv sync
source .venv/bin/activate-
Export your Upstage API key
export UPSTAGE_API_KEY=your_upstage_api_key_here -
Build the corpus from PDF documents
python build_corpus.py --subsets cybersecurity
-
Run the pipeline for the target task
# KoViDoRe v2 followed the process below: # ----------------------------------------------- # 1. generate query from cross-section summary # ----------------------------------------------- # 1-1. generate single-section summary based on corpus bash scripts/run.sh --subsets cybersecurity --task single_section_summary # 1-2. generate cross-section summary based on single-section summary bash scripts/run.sh --subsets cybersecurity --task cross_section_summary # 1-3. generate query from cross-section summary bash scripts/run.sh --subsets cybersecurity --task query_from_summary # 1-4. filter false negatives with LLM bash scripts/run.sh --subsets cybersecurity --task filter_query_from_summary # ----------------------------------------------- # 2. generate query from context # ----------------------------------------------- # 2-1. generate query from context bash scripts/run.sh --subsets cybersecurity --task query_from_context # 2-2. filter false negatives with LLM bash scripts/run.sh --subsets cybersecurity --task filter_query_from_context # ----------------------------------------------- # 3. quality control and audit checks (by human) # -----------------------------------------------
KoViDoRe v2 includes four subsets, each focusing on a distinct, enterprise-relevant domain:
| Subset | Description | Link |
|---|---|---|
| HR | Workforce outlook and employment policy | 🤗 Dataset |
| Energy | Energy policy and power market trends | 🤗 Dataset |
| Economic | Quarterly economic trend reports | 🤗 Dataset |
| Cybersecurity | Cyber threat analysis and security guides | 🤗 Dataset |
MIT
This pipeline is inspired by the ViDoRe V3, and we thank the original authors for their foundational work. We also extend our gratitude to the NVIDIA NeMo Data Designer team for open-sourcing their library. Finally, We thank the Upstage x AWS AI Initiative for granting us free access to their API services.
We also acknowledge the datasets provided by the Public Data Portal(공공데이터포털), which were utilized to construct the tasks in KoViDoRe v2.
For questions or suggestions, please open an issue on the GitHub repository or contact the maintainers:
If you use KoViDoRe v2 in your research, please cite as follows:
@misc{choi2026kovidorev2,
author = {Yongbin Choi},
title = {KoViDoRe v2: a comprehensive evaluation of vision document retrieval for enterprise use-cases},
year = {2026},
url = {https://github.com/whybe-choi/kovidore-data-generator},
note = {A benchmark for evaluating Korean vision document retrieval with multi-page reasoning queries in practical domains}
}