[REQUEST] Ulysses support for VLMs

**Feature description**
`UlyssesSPAttentionHF.register_with_transformers` assumes that a text-only model will be used. I am hoping to use it with a VLM (for example, `Qwen/Qwen3-VL-8B-Instruct`.)

`UlyssesSPAttentionHF.register_with_transformers` assumes certain keys in the config, for example: [`attn_head_count=hf_model_config.num_attention_heads,`](https://github.com/deepspeedai/DeepSpeed/blob/0ccb2bb6746bd8c5294ea9dd4761d72c8d7f48e7/deepspeed/runtime/sequence_parallel/ulysses_sp.py#L440). However for Qwen3VL, the correct key would be `hf_model_config.text_config.num_attention_heads`. This would be easy enough to change to make it specific to the language model (or the vision encoder, as necessary.)

However the broader problem is that this new attention implementation is architecture dependent (ie. depends on the number of heads) and is used to overwrite the global attention implementation:
```
# So instead we hack `ALL_ATTENTION_FUNCTIONS` to override all existing keys with our implementation, since it only gets used at the point of calling the attention and that's what we want, all other code branches relying on the original core `attn_implementation` will still be executed. This is what we called "Being John Malkovich"
for key in ALL_ATTENTION_FUNCTIONS.keys():
    ALL_ATTENTION_FUNCTIONS[key] = uattn_wrapper
```
This enables loading the model with no changes - ie. `AutoModel.from_pretrained`. However the vision encoder has a different architecture with a different number of heads. When it is loaded, it will have the same sequence parallel attention as the language model, which would be incorrect. Further, it is likely unnecessary to implement sequence parallel attention for the vision encoder because it is much smaller than the language model. But because the global attention implementation is overwritten with the Ulysses attention, changing the attention implementation for the language model will change it for the vision encoder as well.

**Describe the solution you'd like**
The ability to use `UlyssesSPAttentionHF.register_with_transformers` to enable sequence parallel for the language model, but not the vision encoder, of a VLM.

**Describe alternatives you've considered**
For selecting the correct config keys, an additional parameter could be added to `UlyssesSPAttentionHF.register_with_transformers` that allows for specifying the config key that contains the architectural details for the new attention function.

For the global overwriting of the attention implementations, it's less clear to me. Maybe the vision encoder should be loaded before the global overwriting, and then the language model is loaded after? And then the two are connected back together? However this would mean messing with the model loading.

I am happy to make the necessary changes and submit a PR, but direction on the best approach to solving these issues would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Ulysses support for VLMs #7834

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[REQUEST] Ulysses support for VLMs #7834

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions