NeuroMamba v5-NoFastWeight -- FineWeb-Edu Validation
Architecture
5:1 sliding-window-local / sparse-global hybrid inspired by Gemma 4 + Jamba.
- 12 layers: 10x LocalBlock (sliding window GQA, w=128) + 2x GlobalBlock (full causal GQA, unified KV)
- GQA: 6 query heads / 2 KV heads (3:1 compression)
- SwiGLU FFN + RoPE + weight-tied lm_head
- d_model=312, d_ffn=896, ~29M params
Results vs Baseline
| Model | Params | Best eval_loss | Perplexity | vs Baseline |
|---|---|---|---|---|
| GPT-2 style baseline | 30M | 4.3645 | 78.6 | -- |
| NeuroMamba v5 | 29M | 4.4426 | 85.0 | +0.0781 Below baseline |
Training
- Dataset: FineWeb-Edu 10BT sample
- Tokens: 600M / 600M
- Hardware: L4x1 GPU
Training curve
| Step | Train loss | Eval loss | Tokens |
|---|---|---|---|
| 500 | 8.0368 | 6.3743 | 16M |
| 1,000 | 5.9692 | 5.6882 | 33M |
| 1,500 | 5.5035 | 5.4098 | 49M |
| 2,000 | 5.3346 | 5.2246 | 66M |
| 2,500 | 5.1988 | 5.0980 | 82M |
| 3,000 | 5.0113 | 5.0130 | 98M |
| 3,500 | 4.9781 | 4.9107 | 115M |
| 4,000 | 4.9170 | 4.8307 | 131M |
| 4,500 | 4.7849 | 4.7754 | 147M |
| 5,000 | 4.7676 | 4.7213 | 164M |
| 5,500 | 4.7555 | 4.6904 | 180M |
| 6,000 | 4.6316 | 4.6546 | 197M |
| 6,500 | 4.6721 | 4.6264 | 213M |
| 7,000 | 4.6754 | 4.6041 | 229M |
| 7,500 | 4.5705 | 4.5837 | 246M |
| 8,000 | 4.5953 | 4.5623 | 262M |
| 8,500 | 4.6099 | 4.5519 | 279M |
| 9,000 | 4.5118 | 4.5389 | 295M |
| 9,500 | 4.5603 | 4.5234 | 311M |
| 10,000 | 4.5763 | 4.5162 | 328M |
| 10,500 | 4.4792 | 4.5065 | 344M |
| 11,000 | 4.5335 | 4.4955 | 360M |
| 11,500 | 4.5467 | 4.4885 | 377M |
| 12,000 | 4.4512 | 4.4827 | 393M |
| 12,500 | 4.5102 | 4.4778 | 410M |
| 13,000 | 4.5238 | 4.4703 | 426M |
| 13,500 | 4.4315 | 4.4659 | 442M |
| 14,000 | 4.5033 | 4.4602 | 459M |
| 14,500 | 4.5134 | 4.4567 | 475M |
| 15,000 | 4.4316 | 4.4535 | 492M |
| 15,500 | 4.4869 | 4.4502 | 508M |
| 16,000 | 4.4946 | 4.4484 | 524M |
| 16,500 | 4.4123 | 4.4462 | 541M |
| 17,000 | 4.4870 | 4.4441 | 557M |
| 17,500 | 4.4845 | 4.4433 | 573M |
| 18,000 | 4.3965 | 4.4426 | 590M |
- Downloads last month
- 24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support