GLA (Gated Linear Attention) 100M (full rank) — Low-rank Fast-Weight Ablation

Pretrained 100M-parameter GLA (Gated Linear Attention) with low-rank parameterization (rfull) on FineWeb-Edu. Part of a 16-cell ablation (4 archs × 4 ranks: r8, r32, r64, rfull) studying whether constraining the q/k/v fast-weight projections (or LaCT's SwiGLU MLP) to low rank can match or exceed full-rank performance.

Training


Architecture	GLA (Gated Linear Attention)
Rank	`rfull`
Params	~100M
Dataset	`HuggingFaceFW/fineweb-edu` (streaming)
Steps	5000
Effective batch	256
Sequence length	8000
Optimizer	AdamW (lr=3e-4, eps=1e-15)
LR schedule	Cosine, 256-step warmup, decay to 10%
Precision	bf16
Activation checkpointing	selective (option 1)
Tokens	~10.24 B

Code: see run_main_100M.sh.

Eval results

FineWeb-Edu val PPL: 22.11
MQAR (multi-query associative recall):
- K=4: 0.131
- K=16: 0.700
- K=64: 0.859
- K=256: 0.896
LAMBADA acc: 0.128
HellaSwag acc_norm: 0.287
ARC-Easy acc_norm: 0.423
PIQA acc_norm: 0.597
WinoGrande acc: 0.504

Notes

This is one of 16 cells; the other rank/arch combinations are uploaded under the same HF org (nlproj) with repo names matching the local dump folder, e.g. nlproj/gla_100M_{r8|r32|r64|rfull}_bs256_lr3e-4_steps5000.
Key finding of the ablation: at this scale, low rank often matches or beats full rank on downstream tasks (LoRA-style "adaptation is intrinsically low-rank" hypothesis). GatedDeltaNet is the exception — its rfull is the strongest in the whole sweep on PPL / LAMBADA / HellaSwag / ARC-Easy.

Run name: gla_100M_rfull_bs256_lr3e-4_steps5000

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nlproj
/

gla_100M_rfull_bs256_lr3e-4_steps5000

GLA (Gated Linear Attention) 100M (full rank) — Low-rank Fast-Weight Ablation

Training

Eval results

Notes

Dataset used to train nlproj/gla_100M_rfull_bs256_lr3e-4_steps5000