Laguna-XS.2 — Fully-NVFP4 Attention + a Quantization Study

Poolside Research Hackathon submission (Foundations / Quantization track).

Submitted by: Kannappan Sirchabesan (@kannappans) · companion to the primary submission patchrecoverygym-laguna.

We asked a simple question about the released Laguna-XS.2-NVFP4: it quantizes the 256 MoE experts to NVFP4 but leaves all attention in bf16 — is that untapped optimization headroom, or a deliberate choice? We built the missing quantization end to end on an NVIDIA B200 and measured it. Answer: it's the right choice — and we explain why, with data.

What's here

  • A hand-built fully-NVFP4-attention Laguna (q/k/v/o → NVFP4, experts NVFP4) that serves in vLLM — including solving the qkv-fusion quantized-scale loading, which no reference checkpoint demonstrates.
  • A reproducible quantization study (latency, GEMM microbench, compression frontier) with all scripts.

Key results (B200, decode latency, 20×256 tok)

Variant decode tok/s vs baseline quality
baseline (experts NVFP4, attn bf16) 295.4 reference
attn FP8 W8A8 (o_proj) 213.5 −28% act-quant overhead
attn NVFP4 (o_proj) 271.5 −8%
attn NVFP4 (full q/k/v/o) 278.9 −5.6% healthy (256 tok)

Finding 1 — quantizing attention doesn't speed up decode. The clean progression (FP8 −28% → NVFP4 o_proj −8% → NVFP4 full −5.6%) shows even the optimal scheme never crosses positive: at decode (bandwidth/overhead-bound), per-layer quant overhead + attention's modest share of per-token time outweigh the saving. This empirically explains why Poolside left attention in bf16.

Finding 2 — 4-bit is Laguna's practical floor. Compression frontier (reconstruction error vs size):

scheme recon. error size
4-bit NVFP4 8.6% 21.6 GB
2-bit (data-free) 60% 13.7 GB
1.58-bit ternary (data-free) 44% 12.1 GB

The ~40% size headroom is real but unreachable today — data-free sub-4-bit destroys the weights (needs calibration), and vLLM has no sub-4-bit kernel. The blockers are ecosystem gaps, not the math: usable 2-bit needs GPTQ/AWQ calibration, which is blocked because Laguna's remote code requires transformers 5.x while the entire quant stack (llmcompressor) pins transformers ≤4.57.6.

Model weights & quality

The fully-NVFP4-attention weights are included in this repo (config is nvfp4-pack-quantized over experts + attention; serve with vLLM). The variant generates healthy full-length output (256 tokens), matching the baseline.

On the strict raw recovery-bench harness, single-attempt base Laguna scores 0/6 — at floor (which is exactly why the companion primary submission patchrecoverygym-laguna uses pass@k + apply-aware selection to reach pass@8 0.833). Because the task harness is at floor for every variant, quantization fidelity here is better captured by reconstruction error (8.6% for NVFP4) than by task pass-rate.

Why this matters

A concrete, reproducible map of exactly what's missing to push Laguna past 4-bit (a transformers-5.x-compatible calibration path + a sub-4-bit vLLM kernel) — and proof that the fully-NVFP4-attention variant is the most aggressive viable Laguna today.

Reproduce

All scripts are included: quant_attn_nvfp4.py (build), serve_and_bench.sh, attn_precision_microbench.py, quant_frontier.py, quality_eval.sh. Built with compressed_tensors' own NVFP4 packing (correct by construction), validated at 8.6% reconstruction error.

Honest non-claims

  • No decode speedup (we explain why it's not achievable, which is the contribution).
  • Not retrained; post-hoc quantization of the released checkpoint.
  • Sub-4-bit shown only as a fidelity/size frontier (not servable on current vLLM).
Downloads last month
36
Safetensors
Model size
19B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-nvfp4-attention

Quantized
(1)
this model