Why not use ScatterMoE as the fused MoE kernel? #548

serdardoesml · 2026-02-20T21:55:15Z

serdardoesml
Feb 20, 2026

The dev log mentions "What's really needed is a fused "FlashMoE" kernel that handles routing + expert dispatch + matmul in one shot (like FlashAttention did for attention), with all the needed features. This doesn't exist yet. "

I think ScatterMoE should be able to handle all of those requirements. It's also quite lightweight and could be included and tinkered with directly in the nanochat repository. What requirement is needed that scattermoe does not fulfill or won't fulfill with very light tinkering?

https://github.com/shawntan/scattermoe

serdardoesml · 2026-02-20T21:57:25Z

serdardoesml
Feb 20, 2026
Author

The reason i'm curious is because i'm using scatterMoE for some of my own training experiments at a smaller scale than nanochat, and for me the overhead is small enough that MoE became free lunch (less train time + higher core score). I don't really have the compute to try it out with nanochat though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not use ScatterMoE as the fused MoE kernel? #548

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why not use ScatterMoE as the fused MoE kernel? #548

Uh oh!

serdardoesml Feb 20, 2026

Replies: 1 comment

Uh oh!

serdardoesml Feb 20, 2026 Author

serdardoesml
Feb 20, 2026

serdardoesml
Feb 20, 2026
Author