Skip to content

Performance gap between models loaded with modeling_hf_nomic_bert.py and modeling_nomic_bert.py #73

@HSILA

Description

@HSILA

Hello, and thank you for your amazing work! @zanussbaum

I have been using your repository extensively for domain-specific adaptations, but I've encountered an issue that I couldn't find an explanation for.

To clarify, I fine-tuned the model starting from nomic-ai/nomic-embed-text-v1-unsupervised using the contrastive_finetune.yaml configuration with the exact data you used for the Weakly-Supervised Contrastive Pretraining stage. My aim was simply to verify my fine-tuning process before proceeding further.

However, during evaluation on a custom MTEB-like retrieval task, I observed a notable performance gap between:

  • The final model uploaded to Hugging Face Hub with convert_to_hf.py (loaded via modeling_hf_nomic_bert.py using SentenceTransformer and mteb.get_model).

  • A model loaded using modeling_nomic_bert.py with the following snippet (inspired by your MTEB evaluation method here, and suppose I have written a wrapper for it inspired by nomic_models.py in mteb):

from contrastors import BiEncoder, BiEncoderConfig

config = BiEncoderConfig(
    model_name='src/ckpts/test/final_model',
    encoder=True,
    pooling="mean"
)
model = BiEncoder(config).to(torch.bfloat16)

To double-check, I encoded the same sentence using both methods and computed their cosine similarity. Surprisingly, the similarity was nearly zero.

After reviewing my training setups, I suspect this issue might relate to the differing sequence lengths (seq_len: 2048 in contrastive_pretrain.yaml vs. seq_len: 512 in contrastive_finetune.yaml). Could this discrepancy cause such divergence, perhaps due to rotary embeddings or flash attention? Or is there possibly another argument I missed in the configuration?

Notably, this issue does not happen when the same seq_len value is maintained across training stages (e.g., fine-tuning from the nomic unsupervised model using a seq_len of 2048).

Also, could you clarify the recommended approach for loading and benchmarking models? My experience suggests that loading via modeling_nomic_bert.py more accurately reflects the effects of in-domain training compared to modeling_hf_nomic_bert.py, which tends to show similar performance across most models, although the first one is in bf16 and the second one is in fp32.

For reference, I am using the version before your Mixture-of-Experts (MoE) commit.

Any suggestions or insights would be greatly appreciated.

Thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions