Hi, I see gradient norm curve comparison between Adam and Muon hybrid, do you also have evaluation loss curve? Is it expected for Muon optimizer have better loss curve than Adam optimizer? Want to hear your insights on this, thanks!
Guokai Ma
delock
AI & ML interests
None yet
Recent Activity
commentedon an article 13 days ago
Muon vs MuonClip vs Muon+AdamW for Fine-Tuning new activity about 1 month ago
moonshotai/Moonlight-16B-A3B:fix(modeling): add training-path MoE dispatch and KV cache API compat updated a model about 1 month ago
delock/Moonlight-16B-A3B-finetune-fixedOrganizations
None yet