Guokai Ma

delock

2 1

delock

AI & ML interests

None yet

Recent Activity

commentedon an article 2 months ago

Muon vs MuonClip vs Muon+AdamW for Fine-Tuning

new activity 3 months ago

moonshotai/Moonlight-16B-A3B:fix(modeling): add training-path MoE dispatch and KV cache API compat

updated a model 3 months ago

delock/Moonlight-16B-A3B-finetune-fixed

View all activity

Organizations

None yet

commented on Muon vs MuonClip vs Muon+AdamW for Fine-Tuning 2 months ago

Hi, I see gradient norm curve comparison between Adam and Muon hybrid, do you also have evaluation loss curve? Is it expected for Muon optimizer have better loss curve than Adam optimizer? Want to hear your insights on this, thanks!

New activity in moonshotai/Moonlight-16B-A3B 3 months ago

fix(modeling): add training-path MoE dispatch and KV cache API compat

#9 opened 3 months ago by

delock

updated a model 3 months ago

delock/Moonlight-16B-A3B-finetune-fixed

Updated Apr 13

published a model 3 months ago

delock/Moonlight-16B-A3B-finetune-fixed

Updated Apr 13

New activity in microsoft/Phi-3-small-128k-instruct almost 2 years ago

Move flash_attn assert from init into calling func

👍 1

#32 opened almost 2 years ago by

rogerxfeng8

liked a model over 2 years ago

Qwen/Qwen-14B-Chat

Text Generation • 14B • Updated Dec 13, 2023 • 4.69k • 373

Guokai Ma

AI & ML interests

Recent Activity

Organizations

delock's activity

fix(modeling): add training-path MoE dispatch and KV cache API compat

Move flash_attn assert from __init__ into calling func

Move flash_attn assert from init into calling func