Why not use ScatterMoE as the fused MoE kernel? #548
serdardoesml
started this conversation in
General
Replies: 1 comment
-
|
The reason i'm curious is because i'm using scatterMoE for some of my own training experiments at a smaller scale than nanochat, and for me the overhead is small enough that MoE became free lunch (less train time + higher core score). I don't really have the compute to try it out with nanochat though. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The dev log mentions "What's really needed is a fused "FlashMoE" kernel that handles routing + expert dispatch + matmul in one shot (like FlashAttention did for attention), with all the needed features. This doesn't exist yet. "
I think ScatterMoE should be able to handle all of those requirements. It's also quite lightweight and could be included and tinkered with directly in the nanochat repository. What requirement is needed that scattermoe does not fulfill or won't fulfill with very light tinkering?
https://github.com/shawntan/scattermoe
Beta Was this translation helpful? Give feedback.
All reactions