Article
NEO-unify: Building Native Multimodal Unified Models End to End
โข
151
Same, but 40k on hardware and then train hard
Could sell my goats and get some api subscriptions.
Final recipe locked: Qwen3-MoE, 3 experts top-1, vocab 262144 (Gemma 3 SP, per-digit input wrap), GQA 3:1, Muon for hidden 2D weights and AdamW for embed and router, WSD with sqrt cooldown, beta2 ramp from 0.95 to 0.97, z-loss 1e-4 with gradients this time (the last build had a no_grad bug that silently killed it), Qwen3 aux loss coefficient 0.001, expert-load monitor that warns on starvation. Three phases: 8K pretrain, then 32K continued pretrain, then 8K SFT.
I might. Join https://discord.gg/vaEquJ6UJT if you need to contact me.