Laguna-XS.2 — Fully-NVFP4 Attention + a Quantization Study
Poolside Research Hackathon submission (Foundations / Quantization track).
Submitted by: Kannappan Sirchabesan (@kannappans) ·
companion to the primary submission patchrecoverygym-laguna.
We asked a simple question about the released Laguna-XS.2-NVFP4: it quantizes the
256 MoE experts to NVFP4 but leaves all attention in bf16 — is that untapped
optimization headroom, or a deliberate choice? We built the missing quantization end
to end on an NVIDIA B200 and measured it. Answer: it's the right choice — and
we explain why, with data.
What's here
- A hand-built fully-NVFP4-attention Laguna (q/k/v/o → NVFP4, experts NVFP4) that serves in vLLM — including solving the qkv-fusion quantized-scale loading, which no reference checkpoint demonstrates.
- A reproducible quantization study (latency, GEMM microbench, compression frontier) with all scripts.
Key results (B200, decode latency, 20×256 tok)
| Variant | decode tok/s | vs baseline | quality |
|---|---|---|---|
| baseline (experts NVFP4, attn bf16) | 295.4 | — | reference |
| attn FP8 W8A8 (o_proj) | 213.5 | −28% | act-quant overhead |
| attn NVFP4 (o_proj) | 271.5 | −8% | — |
| attn NVFP4 (full q/k/v/o) | 278.9 | −5.6% | healthy (256 tok) |
Finding 1 — quantizing attention doesn't speed up decode. The clean progression (FP8 −28% → NVFP4 o_proj −8% → NVFP4 full −5.6%) shows even the optimal scheme never crosses positive: at decode (bandwidth/overhead-bound), per-layer quant overhead + attention's modest share of per-token time outweigh the saving. This empirically explains why Poolside left attention in bf16.
Finding 2 — 4-bit is Laguna's practical floor. Compression frontier (reconstruction error vs size):
| scheme | recon. error | size |
|---|---|---|
| 4-bit NVFP4 | 8.6% | 21.6 GB |
| 2-bit (data-free) | 60% | 13.7 GB |
| 1.58-bit ternary (data-free) | 44% | 12.1 GB |
The ~40% size headroom is real but unreachable today — data-free sub-4-bit destroys
the weights (needs calibration), and vLLM has no sub-4-bit kernel. The blockers are
ecosystem gaps, not the math: usable 2-bit needs GPTQ/AWQ calibration, which is
blocked because Laguna's remote code requires transformers 5.x while the entire quant
stack (llmcompressor) pins transformers ≤4.57.6.
Model weights & quality
The fully-NVFP4-attention weights are included in this repo (config is
nvfp4-pack-quantized over experts + attention; serve with vLLM). The variant
generates healthy full-length output (256 tokens), matching the baseline.
On the strict raw recovery-bench harness, single-attempt base Laguna scores 0/6
— at floor (which is exactly why the companion primary submission
patchrecoverygym-laguna
uses pass@k + apply-aware selection to reach pass@8 0.833). Because the task harness
is at floor for every variant, quantization fidelity here is better captured by
reconstruction error (8.6% for NVFP4) than by task pass-rate.
Why this matters
A concrete, reproducible map of exactly what's missing to push Laguna past 4-bit (a transformers-5.x-compatible calibration path + a sub-4-bit vLLM kernel) — and proof that the fully-NVFP4-attention variant is the most aggressive viable Laguna today.
Reproduce
All scripts are included: quant_attn_nvfp4.py (build), serve_and_bench.sh,
attn_precision_microbench.py, quant_frontier.py, quality_eval.sh. Built with
compressed_tensors' own NVFP4 packing (correct by construction), validated at 8.6%
reconstruction error.
Honest non-claims
- No decode speedup (we explain why it's not achievable, which is the contribution).
- Not retrained; post-hoc quantization of the released checkpoint.
- Sub-4-bit shown only as a fidelity/size frontier (not servable on current vLLM).
- Downloads last month
- 36