-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi authors,
I recently read your paper on (arXiv:2502.07529v2), and it's an awesome piece of work! The idea of unifying optimizers like Muon and SignSGD under a single lmo framework is super cool.
In the paper, you contrast your a priori approach of setting a norm with on-the-fly methods like Adam. This got me thinking, and I'd love to hear your thoughts on this: Is it possible to also understand the Adam optimizer from a norm-based perspective?
I understand that Adam's update step size is variable, which is different from the fixed-norm output of an lmo. However, the operation in Adam that divides by the square root of squared gradients (\sqrt(v_t)) feels like it's performing some kind of geometric normalization, which might have an intrinsic connection to norms.
More specifically, I was wondering:
- Could Adam be viewed as using a "dynamic norm-ball," where the norm-ball itself changes at each step based on the history of gradients?
- I found the connection you drew between weight decay and norm constraints very insightful. It immediately made me think of AdamW. You also cited the paper linking AdamW to the
$l_\infty$ norm, which seems to suggest a real connection here. Do you think this could be a potential path to fit Adam into your framework? - Alternatively, is there a fundamental reason why this idea wouldn't work? For instance, is the key difference that prevents unification the fact that the lmo is scale-invariant, while Adam's update is sensitive to the gradient's scale?
Sorry for the long question; I'm just a student who is very interested in optimizers, and your paper has been very inspiring.
Thanks for any thoughts you might have and for the great work!
Best regards.