Per-step record (every 500 steps, 40 ckpts) of the 0.6B Qwen3 softmax-attention baseline trained AdamW + WSD on 80B tokens.
Yifei Zuo
YifeiZuo
AI & ML interests
None yet
Recent Activity
authored a paper about 10 hours ago
Parallax: Parameterized Local Linear Attention for Language Modeling upvoted a paper about 15 hours ago
Parallax: Parameterized Local Linear Attention for Language Modeling submitted a paper about 16 hours ago
Parallax: Parameterized Local Linear Attention for Language ModelingOrganizations
None yet