DiaFill Toolkit

A toolkit for synthesizing filler-rich, short-utterance Japanese dialogue scripts for speech-based interaction using Large Language Models (LLMs) This project is designed to generate data in two phases: Seed Generation (metadata creation) and Dialogue Generation (script creation).

Example of a Generated Dialogue Script

Below is an example of a Japanese call-center dialogue script automatically generated by DiaFill. The dialogue is characterized by filler-rich, short utterances, and incremental spoken-style structure.

オペレーター: お電話ありがとうございます。Aモバイル担当の加藤でございます本日はいかがなさいましたか？
カスタマー: はい。えっとー最近電力料金がすごく上がってしまってちょっと使用量が多いのかなと思いまして、
オペレーター: あはい。
カスタマー: 使用量を確認したいんですけども、どのようにしたらいいでしょうか？
オペレーター: かしこまりました。ではえっとー電力料金の使用量がえーと確認されたいということでございますね。はい。ではお調べいたしますので少々お待ちくださいませ。はい。お待たせいたしました。はい。ではあのー電気の使用量の確認方法でございますが、こちらのご案内でよろしいですか？
カスタマー: はい大丈夫です。
オペレーター: かしこまりました。ではあのーご案内の前にですね、あのーご確認でございますが、えーっとー弊社のあのーサービスあーのご契約でいらっしゃいますか？
カスタマー: はいそうです。
オペレーター: あはい。かしこまりました。ではあのーご使用量のあのー確認の方法でございますが、まずですねマイページの方にログインをしていただきます。
カスタマー: マイページはい。
オペレーター: はい。でその後にあのー使用量とま料金が表示されますのでそちらから確認ができます。
・・・

Directory Structure

.
├── dialogue_generation/ ... Module for generating dialogue scripts
├── seed_generation/     ... Module for generating dialogue seeds (metadata)
├── examples/            ... Configuration examples (Jsonnet)
│   └── configs/
│       ├── dialogue_generation/
│       └── seed_generation/
├── samples/             ... Sample output directory for generated files
└── requirements.txt     ... Dependency definitions

Usage

Step 1: Seed Generation

Generates the metadata for dialogues (topics, speaker names, tones, summaries). This module includes features to filter out Chinese-specific characters (to ensure natural Japanese) and avoid duplication by tracking generation history.

python -m seed_generation.main \
    --settings examples/configs/seed_generation/callcenter_unseen.jsonnet

Configuration Parameters (inside .jsonnet):

Key	Description	Example
`total`	Number of seeds to generate.	`50`
`type`	Generation type (`chit_chat`, `call_center_seen`, or `call_center_unseen`).	`"call_center_unseen"`
`model`	Hugging Face model ID used for generation.	`"Qwen/Qwen2.5-32B-Instruct"`
`output_file`	Path to save the output JSONL file.	`"samples/seeds.jsonl"`

Step 2: Dialogue Script Generation

Generates the actual dialogue scripts based on input seeds. You can use the output generated in Step 1, or use your own manually created seeds (JSONL format). This module includes a logic to detect exact phrase repetitions and automatically retries generation with a higher penalty if detected.

We provide sbintuitions/diafill-sarashina2.2-3b-instruct (3B) and sbintuitions/diafill-llm-jp-3.1-13b-instruct4 (13B) as the generation models.

python -m dialogue_generation.main \
    --settings examples/configs/dialogue_generation/callcenter_unseen.jsonnet

Configuration Parameters (inside .jsonnet):

Key	Description	Example
`input_file`	Path to the seed file (Step 1 output or manual file).	`"samples/seeds.jsonl"`
`output_file`	Path to save the final scripts.	`"samples/scripts.jsonl"`
`model_name`	Hugging Face model ID.	`"sbintuitions/diafill-sarashina2.2-3b-instruct"`
`repetition_penalty`	Penalty score for repetition (Default: `1.1`). A value slightly higher than `1.0` (e.g., the default setting) is generally sufficient.	`1.1`

Licence

Apache2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiaFill Toolkit

Example of a Generated Dialogue Script

Directory Structure

Usage

Step 1: Seed Generation

Step 2: Dialogue Script Generation

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dialogue_generation		dialogue_generation
examples/configs		examples/configs
samples		samples
seed_generation		seed_generation
tests		tests
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DiaFill Toolkit

Example of a Generated Dialogue Script

Directory Structure

Usage

Step 1: Seed Generation

Step 2: Dialogue Script Generation

Licence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages