This is a repository for the paper "Adapting Chat Language Models Using Only Target Unlabeled Language Data" accepted at TMLR 2025.
See requirements.txt for the required packages. Also, we require PyTorch v2 or higher.
If you are using the conda package manager, you can create a new environment with the required packages by running:
# Create a new env for training and evaluation
conda create --name feb2025 python=3.12
conda activate feb2025
conda install conda-forge::pytorch
mkdir -m 700 src
cd src && git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken lighteval
# Create a new env for IFEval/MGSM/GSM8K evaluation
conda create --name feb2025_eval python=3.12
conda activate feb2025_eval
conda install conda-forge::pytorch
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken
cd ..
git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout main
pip3 install -e .
pip3 install langdetect immutabledict nltkPlease visit the preprocessing directory.
Please visit the instantiation directory.
Please visit the training directory.
Please visit the merging directory.
Please visit the evaluation directory.
All models are available on the Hugging Face Hub.
| Model | Links |
|---|---|
| Base+CPT | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Base+VE | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| CV | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Chat+CPT | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Chat+VE | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| ElChat | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| ElChat\Merge | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| ElChat\Copy | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| ElChat (L) | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Model | Links |
|---|---|
| Base+CPT | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Base+VE | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| CV | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Chat+CPT | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Chat+VE | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| ElChat | Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu |
| Model | Links |
|---|---|
| Base+CPT | Amharic / Bengali / Telugu |
| Base+VE | Amharic / Bengali / Telugu |
| CV | Amharic / Bengali / Telugu |
| Chat+CPT | Amharic / Bengali / Telugu |
| Chat+VE | Amharic / Bengali / Telugu |
| ElChat | Amharic / Bengali / Telugu |
If you use this code or the models in your research, please cite the following paper:
@article{yamaguchi2025adapting,
title={Adapting Chat Language Models Using Only Target Unlabeled Language Data},
author={Atsuki Yamaguchi and Terufumi Morishita and Aline Villavicencio and Nikolaos Aletras},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=6IdoIKowfe},
note={}
}
This code is licensed under the MIT License unless otherwise stated in the file.
