Skip to content

gucci-j/chat-cve

Repository files navigation

Adapting Chat Language Models Using Only Target Unlabeled Language Data

This is a repository for the paper "Adapting Chat Language Models Using Only Target Unlabeled Language Data" accepted at TMLR 2025.

motivation

Requirements

See requirements.txt for the required packages. Also, we require PyTorch v2 or higher.

If you are using the conda package manager, you can create a new environment with the required packages by running:

# Create a new env for training and evaluation
conda create --name feb2025 python=3.12
conda activate feb2025
conda install conda-forge::pytorch
mkdir -m 700 src
cd src && git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken lighteval

# Create a new env for IFEval/MGSM/GSM8K evaluation
conda create --name feb2025_eval python=3.12
conda activate feb2025_eval
conda install conda-forge::pytorch
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken
cd ..
git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout main
pip3 install -e .
pip3 install langdetect immutabledict nltk

Reproducing the results

1. Preprocessing

Please visit the preprocessing directory.

2. Initializing the model

Please visit the instantiation directory.

3. Training the model

Please visit the training directory.

4. Model merging

Please visit the merging directory.

5. Evaluation

Please visit the evaluation directory.

Models

All models are available on the Hugging Face Hub.

Llama 3.1 8B

Model Links
Base+CPT Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Base+VE Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
CV Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+CPT Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+VE Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat\Merge Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat\Copy Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat (L) Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu

Qwen2.5 7B

Model Links
Base+CPT Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Base+VE Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
CV Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+CPT Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
Chat+VE Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu
ElChat Amharic / Bengali / Burmese / Gujarati / Sinhala / Tamil / Telugu

Qwen3 14B

Model Links
Base+CPT Amharic / Bengali / Telugu
Base+VE Amharic / Bengali / Telugu
CV Amharic / Bengali / Telugu
Chat+CPT Amharic / Bengali / Telugu
Chat+VE Amharic / Bengali / Telugu
ElChat Amharic / Bengali / Telugu

Citation

If you use this code or the models in your research, please cite the following paper:

@article{yamaguchi2025adapting,
      title={Adapting Chat Language Models Using Only Target Unlabeled Language Data}, 
      author={Atsuki Yamaguchi and Terufumi Morishita and Aline Villavicencio and Nikolaos Aletras},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=6IdoIKowfe},
      note={}
}

License

This code is licensed under the MIT License unless otherwise stated in the file.

About

TMLR - Adapting Chat Language Models Using Only Target Unlabeled Language Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors