-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Feature description
UlyssesSPAttentionHF.register_with_transformers assumes that a text-only model will be used. I am hoping to use it with a VLM (for example, Qwen/Qwen3-VL-8B-Instruct.)
UlyssesSPAttentionHF.register_with_transformers assumes certain keys in the config, for example: attn_head_count=hf_model_config.num_attention_heads,. However for Qwen3VL, the correct key would be hf_model_config.text_config.num_attention_heads. This would be easy enough to change to make it specific to the language model (or the vision encoder, as necessary.)
However the broader problem is that this new attention implementation is architecture dependent (ie. depends on the number of heads) and is used to overwrite the global attention implementation:
# So instead we hack `ALL_ATTENTION_FUNCTIONS` to override all existing keys with our implementation, since it only gets used at the point of calling the attention and that's what we want, all other code branches relying on the original core `attn_implementation` will still be executed. This is what we called "Being John Malkovich"
for key in ALL_ATTENTION_FUNCTIONS.keys():
ALL_ATTENTION_FUNCTIONS[key] = uattn_wrapper
This enables loading the model with no changes - ie. AutoModel.from_pretrained. However the vision encoder has a different architecture with a different number of heads. When it is loaded, it will have the same sequence parallel attention as the language model, which would be incorrect. Further, it is likely unnecessary to implement sequence parallel attention for the vision encoder because it is much smaller than the language model. But because the global attention implementation is overwritten with the Ulysses attention, changing the attention implementation for the language model will change it for the vision encoder as well.
Describe the solution you'd like
The ability to use UlyssesSPAttentionHF.register_with_transformers to enable sequence parallel for the language model, but not the vision encoder, of a VLM.
Describe alternatives you've considered
For selecting the correct config keys, an additional parameter could be added to UlyssesSPAttentionHF.register_with_transformers that allows for specifying the config key that contains the architectural details for the new attention function.
For the global overwriting of the attention implementations, it's less clear to me. Maybe the vision encoder should be loaded before the global overwriting, and then the language model is loaded after? And then the two are connected back together? However this would mean messing with the model loading.
I am happy to make the necessary changes and submit a PR, but direction on the best approach to solving these issues would be appreciated.