Adapting Chat Language Models Using Only Target Unlabeled Language Data

This is a repository for the paper "Adapting Chat Language Models Using Only Target Unlabeled Language Data" accepted at TMLR 2025.

Requirements

See requirements.txt for the required packages. Also, we require PyTorch v2 or higher.

If you are using the conda package manager, you can create a new environment with the required packages by running:

# Create a new env for training and evaluation
conda create --name feb2025 python=3.12
conda activate feb2025
conda install conda-forge::pytorch
mkdir -m 700 src
cd src && git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken lighteval

# Create a new env for IFEval/MGSM/GSM8K evaluation
conda create --name feb2025_eval python=3.12
conda activate feb2025_eval
conda install conda-forge::pytorch
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken
cd ..
git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout main
pip3 install -e .
pip3 install langdetect immutabledict nltk

Reproducing the results

1. Preprocessing

Please visit the preprocessing directory.

2. Initializing the model

Please visit the instantiation directory.

3. Training the model

Please visit the training directory.

4. Model merging

Please visit the merging directory.

5. Evaluation

Please visit the evaluation directory.

Models

All models are available on the Hugging Face Hub.

Llama 3.1 8B

Model	Links
Base+CPT	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Base+VE	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
CV	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+CPT	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+VE	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat\Merge	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat\Copy	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat (L)	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu

Qwen2.5 7B

Model	Links
Base+CPT	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Base+VE	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
CV	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+CPT	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+VE	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat	Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu

Qwen3 14B

Model	Links
Base+CPT	Amharic / Bengali / Telugu
Base+VE	Amharic / Bengali / Telugu
CV	Amharic / Bengali / Telugu
Chat+CPT	Amharic / Bengali / Telugu
Chat+VE	Amharic / Bengali / Telugu
ElChat	Amharic / Bengali / Telugu

Citation

If you use this code or the models in your research, please cite the following paper:

@article{yamaguchi2025adapting,
      title={Adapting Chat Language Models Using Only Target Unlabeled Language Data}, 
      author={Atsuki Yamaguchi and Terufumi Morishita and Aline Villavicencio and Nikolaos Aletras},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=6IdoIKowfe},
      note={}
}

License

This code is licensed under the MIT License unless otherwise stated in the file.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
evaluation		evaluation
instantiation		instantiation
merging		merging
preprocessing		preprocessing
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
motivation.png		motivation.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapting Chat Language Models Using Only Target Unlabeled Language Data

Requirements

Reproducing the results

1. Preprocessing

2. Initializing the model

3. Training the model

4. Model merging

5. Evaluation

Models

Llama 3.1 8B

Qwen2.5 7B

Qwen3 14B

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adapting Chat Language Models Using Only Target Unlabeled Language Data

Requirements

Reproducing the results

1. Preprocessing

2. Initializing the model

3. Training the model

4. Model merging

5. Evaluation

Models

Llama 3.1 8B

Qwen2.5 7B

Qwen3 14B

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages