Laguna XS.2 - GGUF Quantizations

First-ever GGUF conversion of poolside/Laguna-XS.2, produced as part of the Poolside Research Hackathon.

Laguna XS.2 is a 33.4B total parameter (3B activated) Mixture-of-Experts model built for long-horizon agentic coding. These GGUF files enable local deployment via llama.cpp and compatible inference engines.

โš ๏ธ Inference status: Full CPU/GPU inference requires 6 C++ patches to llama-model.cpp (documented below). All benchmarks were run using vLLM on the original BF16 checkpoint.

Available Files

Filename Quant Size Notes
Laguna-XS.2-F16.gguf F16 63 GB Fixed source (experts F16, attention F16)
Laguna-XS.2-Q4_K_M.gguf Q4_K_M ~60 GB Attention Q4, experts F16

Note on quantization sizes: llama.cpp does not quantize 3D MoE expert tensors. Attention/embedding weights are fully quantized; expert weights remain F16. This is a known llama.cpp limitation for large MoE models. Proper small quants (~18GB Q4_K_M) require upstream llama.cpp changes.

Benchmark Results

All benchmarks: H200 MIG 71GB, vLLM 0.21.0, BF16, temperature=0.

Benchmark Score Details
HumanEval pass@1 90.2% (148/164) thinking=off
GPQA Diamond 42.6% (84/197) thinking=on, +17.6% above random
MATH500 46.6% (233/500) full 500 problems, thinking=on

MATH500 by Subject

Subject Score
Algebra 69.4%
Prealgebra 65.9%
Number Theory 54.8%
Counting & Probability 36.8%
Geometry 31.7%
Precalculus 21.4%
Intermediate Algebra 20.6%

Throughput (BF16, vLLM, H200 71GB)

Mode tok/s
Single request 57โ€“74
5 concurrent 100โ€“106

Benchmark Visualizations

Capability Benchmarks

Throughput

MATH500 by Subject

Architecture - Key Differences from Mixtral

Feature Laguna XS.2 Mixtral
Routing Sigmoid Softmax
Shared expert Yes (always runs) No
Routed scaling 2.5ร— 1.0ร—
Attention Interleaved SWA/GA 3:1 Global only
RoPE ฮธ 500k (GA) / 10k (SWA) Single value
Per-head gating Softplus g_proj None
Expert count 256, top-8 8, top-2
Q-heads per layer 48 (GA) / 64 (SWA) Uniform

Custom GGUF Metadata Keys

Key Value
laguna.attention.layer_types [0,1,1,1,0,...] โ€” GA=0, SWA=1
laguna.attention.heads_per_layer [48,64,64,64,48,...]
laguna.rope.theta_swa 10000.0
laguna.rope.partial_rotary_factor 0.5
laguna.moe.routed_scaling_factor 2.5
laguna.moe.sigmoid_routing true
laguna.attention.softplus_gating true

C++ Patches for llama.cpp Inference

  1. Sigmoid routing: ggml_soft_max โ†’ ggml_sigmoid in MoE router
  2. Routed scaling: multiply routed output by laguna.moe.routed_scaling_factor (2.5)
  3. e_score_correction_bias: add to router scores before top-k
  4. Dual RoPE ฮธ: use laguna.rope.theta_swa for SWA layers
  5. Partial RoPE: rotate only 50% of head_dim on GA layers
  6. Softplus g_proj gating: softplus(g_proj(x)) per head after attention, before o_proj

Conversion

Registration in conversion/__init__.py:

"LagunaForCausalLM": "laguna"

The conversion script (conversion/laguna.py) handles:

  • Stacking 256 individual expert weights โ†’ 3D tensors [256, dim_in, dim_out]
  • Per-layer variable Q-heads (48 GA / 64 SWA)
  • Sigmoid router, shared expert, g_proj, e_score_correction_bias
  • All Laguna metadata as custom GGUF keys

Original Model

Produced by Saurabh Mallik for the Poolside Research Hackathon.

Downloads last month
163
GGUF
Model size
33B params
Architecture
qwen2moe
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for poolside-laguna-hackathon/Laguna-XS.2-GGUF

Quantized
(16)
this model