Laguna-XS.2 — Fully-NVFP4 Attention + a Quantization Study

Poolside Research Hackathon submission (Foundations / Quantization track).

Submitted by: Kannappan Sirchabesan (@kannappans) · companion to the primary submission patchrecoverygym-laguna.

We asked a simple question about the released Laguna-XS.2-NVFP4: it quantizes the 256 MoE experts to NVFP4 but leaves all attention in bf16 — is that untapped optimization headroom, or a deliberate choice? We built the missing quantization end to end on an NVIDIA B200 and measured it. Answer: it's the right choice — and we explain why, with data.

What's here

A hand-built fully-NVFP4-attention Laguna (q/k/v/o → NVFP4, experts NVFP4) that serves in vLLM — including solving the qkv-fusion quantized-scale loading, which no reference checkpoint demonstrates.
A reproducible quantization study (latency, GEMM microbench, compression frontier) with all scripts.

Key results (B200, decode latency, 20×256 tok)

Variant	decode tok/s	vs baseline	quality
baseline (experts NVFP4, attn bf16)	295.4	—	reference
attn FP8 W8A8 (o_proj)	213.5	−28%	act-quant overhead
attn NVFP4 (o_proj)	271.5	−8%	—
attn NVFP4 (full q/k/v/o)	278.9	−5.6%	healthy (256 tok)

Finding 1 — quantizing attention doesn't speed up decode. The clean progression (FP8 −28% → NVFP4 o_proj −8% → NVFP4 full −5.6%) shows even the optimal scheme never crosses positive: at decode (bandwidth/overhead-bound), per-layer quant overhead + attention's modest share of per-token time outweigh the saving. This empirically explains why Poolside left attention in bf16.

Finding 2 — 4-bit is Laguna's practical floor. Compression frontier (reconstruction error vs size):

scheme	recon. error	size
4-bit NVFP4	8.6%	21.6 GB
2-bit (data-free)	60%	13.7 GB
1.58-bit ternary (data-free)	44%	12.1 GB

The ~40% size headroom is real but unreachable today — data-free sub-4-bit destroys the weights (needs calibration), and vLLM has no sub-4-bit kernel. The blockers are ecosystem gaps, not the math: usable 2-bit needs GPTQ/AWQ calibration, which is blocked because Laguna's remote code requires transformers 5.x while the entire quant stack (llmcompressor) pins transformers ≤4.57.6.

Model weights & quality

The fully-NVFP4-attention weights are included in this repo (config is nvfp4-pack-quantized over experts + attention; serve with vLLM). The variant generates healthy full-length output (256 tokens), matching the baseline.

On the strict raw recovery-bench harness, single-attempt base Laguna scores 0/6 — at floor (which is exactly why the companion primary submission patchrecoverygym-laguna uses pass@k + apply-aware selection to reach pass@8 0.833). Because the task harness is at floor for every variant, quantization fidelity here is better captured by reconstruction error (8.6% for NVFP4) than by task pass-rate.

Why this matters

A concrete, reproducible map of exactly what's missing to push Laguna past 4-bit (a transformers-5.x-compatible calibration path + a sub-4-bit vLLM kernel) — and proof that the fully-NVFP4-attention variant is the most aggressive viable Laguna today.

Reproduce

All scripts are included: quant_attn_nvfp4.py (build), serve_and_bench.sh, attn_precision_microbench.py, quant_frontier.py, quality_eval.sh. Built with compressed_tensors' own NVFP4 packing (correct by construction), validated at 8.6% reconstruction error.

Honest non-claims

No decode speedup (we explain why it's not achievable, which is the contribution).
Not retrained; post-hoc quantization of the released checkpoint.
Sub-4-bit shown only as a fidelity/size frontier (not servable on current vLLM).

Downloads last month: 6

Safetensors

Model size

19B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-nvfp4-attention

Base model

poolside/Laguna-XS.2

Quantized

poolside/Laguna-XS.2-NVFP4

Quantized

(1)

this model