Skip to content

[Hopper] Support sink attention (learnable sink)#4

Open
aoxy wants to merge 20 commits into
mainfrom
feature/hopper_attention_with_sink
Open

[Hopper] Support sink attention (learnable sink)#4
aoxy wants to merge 20 commits into
mainfrom
feature/hopper_attention_with_sink

Conversation

@aoxy

@aoxy aoxy commented Mar 4, 2026

Copy link
Copy Markdown
Owner

Description

This PR adds support for sink attention (learnable sink) in FlashAttention-3 (Hopper architecture), matching the functionality already available in FlashAttention-2.

Sink attention allows a learnable parameter per head to be added to the attention scores before the softmax operation. This provides a "sink" that can absorb attention probability, which has been shown to be useful for models like GPT-OSS.

Key Changes:

  • Core Implementation: Added support for learnable_sink in the Hopper forward and backward kernels.
  • Python Interface: Updated flash_attn_func, flash_attn_varlen_func, and flash_attn_with_kvcache to accept the learnable_sink tensor.
  • Backward Pass: Implemented the gradient calculation for learnable_sink, enabling end-to-end training of the sink parameters.
  • C++/CUDA Bindings: Updated the mha_fwd and mha_bwd bindings in hopper/flash_api.cpp to handle the new learnable_sink and dsink tensors.
  • Compatibility: The implementation is consistent with the sink attention support in FAV2.

Testing:

  • Added comprehensive tests for sink attention in hopper/test_flash_attn.py.
  • Updated the reference attention implementation in hopper/test_util.py to include sink attention logic.
  • Verified both forward and backward passes across various configurations (MHA, MQA, GQA, causal, varlen, KV cache, etc.).
  • All tests for sink attention have passed successfully.

I have already added the tests to tests/ and confirmed they pass.

@aoxy aoxy force-pushed the feature/hopper_attention_with_sink branch from 9df7964 to 1b090a4 Compare March 4, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants