-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Hello, and thank you for your amazing work! @zanussbaum
I have been using your repository extensively for domain-specific adaptations, but I've encountered an issue that I couldn't find an explanation for.
To clarify, I fine-tuned the model starting from nomic-ai/nomic-embed-text-v1-unsupervised using the contrastive_finetune.yaml configuration with the exact data you used for the Weakly-Supervised Contrastive Pretraining stage. My aim was simply to verify my fine-tuning process before proceeding further.
However, during evaluation on a custom MTEB-like retrieval task, I observed a notable performance gap between:
-
The final model uploaded to Hugging Face Hub with
convert_to_hf.py(loaded viamodeling_hf_nomic_bert.pyusingSentenceTransformerandmteb.get_model). -
A model loaded using
modeling_nomic_bert.pywith the following snippet (inspired by your MTEB evaluation method here, and suppose I have written a wrapper for it inspired bynomic_models.pyin mteb):
from contrastors import BiEncoder, BiEncoderConfig
config = BiEncoderConfig(
model_name='src/ckpts/test/final_model',
encoder=True,
pooling="mean"
)
model = BiEncoder(config).to(torch.bfloat16)To double-check, I encoded the same sentence using both methods and computed their cosine similarity. Surprisingly, the similarity was nearly zero.
After reviewing my training setups, I suspect this issue might relate to the differing sequence lengths (seq_len: 2048 in contrastive_pretrain.yaml vs. seq_len: 512 in contrastive_finetune.yaml). Could this discrepancy cause such divergence, perhaps due to rotary embeddings or flash attention? Or is there possibly another argument I missed in the configuration?
Notably, this issue does not happen when the same seq_len value is maintained across training stages (e.g., fine-tuning from the nomic unsupervised model using a seq_len of 2048).
Also, could you clarify the recommended approach for loading and benchmarking models? My experience suggests that loading via modeling_nomic_bert.py more accurately reflects the effects of in-domain training compared to modeling_hf_nomic_bert.py, which tends to show similar performance across most models, although the first one is in bf16 and the second one is in fp32.
For reference, I am using the version before your Mixture-of-Experts (MoE) commit.
Any suggestions or insights would be greatly appreciated.
Thank you!