HuggingFaceFW/fineweb-edu
Viewer โข Updated โข 3.5B โข 623k โข 1.1k
Pretrained 100M-parameter GLA (Gated Linear Attention) with low-rank parameterization
(rfull) on FineWeb-Edu. Part of a 16-cell ablation
(4 archs ร 4 ranks: r8, r32, r64, rfull) studying whether constraining
the q/k/v fast-weight projections (or LaCT's SwiGLU MLP) to low rank can match
or exceed full-rank performance.
| Architecture | GLA (Gated Linear Attention) |
| Rank | rfull |
| Params | ~100M |
| Dataset | HuggingFaceFW/fineweb-edu (streaming) |
| Steps | 5000 |
| Effective batch | 256 |
| Sequence length | 8000 |
| Optimizer | AdamW (lr=3e-4, eps=1e-15) |
| LR schedule | Cosine, 256-step warmup, decay to 10% |
| Precision | bf16 |
| Activation checkpointing | selective (option 1) |
| Tokens | ~10.24 B |
Code: see run_main_100M.sh.
22.11K=4: 0.131K=16: 0.700K=64: 0.859K=256: 0.896nlproj) with repo names matching the local dump folder, e.g.
nlproj/gla_100M_{r8|r32|r64|rfull}_bs256_lr3e-4_steps5000.Run name: gla_100M_rfull_bs256_lr3e-4_steps5000