Performance gap between models loaded with `modeling_hf_nomic_bert.py` and `modeling_nomic_bert.py`

Hello, and thank you for your amazing work! @zanussbaum 

I have been using your repository extensively for domain-specific adaptations, but I've encountered an issue that I couldn't find an explanation for.

To clarify, I fine-tuned the model starting from `nomic-ai/nomic-embed-text-v1-unsupervised` using the [contrastive_finetune.yaml](https://github.com/nomic-ai/contrastors/blob/main/src/contrastors/configs/train/contrastive_finetune.yaml) configuration with the exact data you used for the **Weakly-Supervised Contrastive Pretraining** stage. My aim was simply to verify my fine-tuning process before proceeding further.

However, during evaluation on a custom MTEB-like retrieval task, I observed a notable performance gap between:

- The final model uploaded to Hugging Face Hub with `convert_to_hf.py` (loaded via `modeling_hf_nomic_bert.py` using `SentenceTransformer` and `mteb.get_model`).

- A model loaded using `modeling_nomic_bert.py` with the following snippet (inspired by your MTEB evaluation method [here](https://github.com/nomic-ai/contrastors/blob/613ddfd37309e538cceadb05b1e6423e7b09f603/src/contrastors/eval/encoder.py#L172), and suppose I have written a wrapper for it inspired by [`nomic_models.py`](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/nomic_models.py) in mteb):

```python
from contrastors import BiEncoder, BiEncoderConfig

config = BiEncoderConfig(
    model_name='src/ckpts/test/final_model',
    encoder=True,
    pooling="mean"
)
model = BiEncoder(config).to(torch.bfloat16)
```
To double-check, I encoded the same sentence using both methods and computed their cosine similarity. Surprisingly, the similarity was nearly zero.

After reviewing my training setups, I suspect this issue might relate to the differing sequence lengths (`seq_len: 2048` in `contrastive_pretrain.yaml` vs. `seq_len: 512` in `contrastive_finetune.yaml`). Could this discrepancy cause such divergence, perhaps due to rotary embeddings or flash attention? Or is there possibly another argument I missed in the configuration?

Notably, this issue does not happen when the same `seq_len` value is maintained across training stages (e.g., fine-tuning from the nomic unsupervised model using a `seq_len` of 2048).

Also, could you clarify the recommended approach for loading and benchmarking models? My experience suggests that loading via `modeling_nomic_bert.py` more accurately reflects the effects of in-domain training compared to `modeling_hf_nomic_bert.py`, which tends to show similar performance across most models, although the first one is in bf16 and the second one is in fp32.

For reference, I am using the version before your Mixture-of-Experts (MoE) commit.

Any suggestions or insights would be greatly appreciated. 

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gap between models loaded with `modeling_hf_nomic_bert.py` and `modeling_nomic_bert.py` #73

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance gap between models loaded with modeling_hf_nomic_bert.py and modeling_nomic_bert.py #73

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Performance gap between models loaded with `modeling_hf_nomic_bert.py` and `modeling_nomic_bert.py` #73