Laguna-XS.2 GGUF

GGUF quantizations of poolside/Laguna-XS.2. Converted and tested with a patched fork of llama.cpp that implements LLM_ARCH_LAGUNA from scratch.

Requires the custom fork. Upstream llama.cpp does not support this architecture. Fork: https://github.com/linuxid10t/llama.cpp-add-laguna

Files

File Quant Size Notes
Laguna-XS.2-f16.gguf f16 ~67 GB Full precision, reference
Laguna-XS.2-Q4_K_M.gguf Q4_K_M ~20 GB Recommended for most users

Usage

git clone https://github.com/linuxid10t/llama.cpp-add-laguna
cd llama.cpp-add-laguna && cmake -B build && cmake --build build -j$(nproc)

./build/bin/llama-cli \
  -m Laguna-XS.2-Q4_K_M.gguf \
  --ctx-size 131072 \
  --temp 0 \
  -p "The capital of France is"

Chat / thinking mode

# Thinking on (default) β€” model prefills <think> and generates a reasoning trace
./build/bin/llama-cli -m Laguna-XS.2-Q4_K_M.gguf -cnv --ctx-size 32768

# Thinking off β€” direct answer, no reasoning trace
./build/bin/llama-cli -m Laguna-XS.2-Q4_K_M.gguf -cnv --ctx-size 32768 --reasoning off

The chat template, thinking-mode prefix, EOT token (</assistant>, token 24), and streaming parser are all handled automatically β€” no manual prompt engineering needed.

Architecture

Property Value
Parameters 33B total, ~3B active per token
Layers 40 (10 global + 30 SWA, 3:1 pattern)
Attention heads 64 (global layers) / 48 (SWA layers)
KV heads 8 (GQA)
Q/K norm RMSNorm per head
Attention gate Head-wise softplus gate on o_proj input, SWA layers only
SWA window 512 tokens
Experts 256 routed (top-8) + 1 shared
Dense layers Layer 0 only
MoE router Sigmoid + e_score_correction_bias (added at selection, not routing)
Routing scale moe_routed_scaling_factor with weight normalization
Global RoPE YaRN: base 500K, factor 32, original_max 4096, Ξ²_fast 64, Ξ²_slow 1, partial_rotary 0.5
SWA RoPE Default: base 10K, full rotary
Context length 131,072 tokens

Implementation Notes

Attention gate: SWA layers have an extra self_attn.g_proj that produces a per-head scalar. This is applied as a softplus gate to the SDPA output before o_proj. Global layers have no gate.

Partial rotary: Global layers rotate only the first half of head dimensions (matching partial_rotary_factor=0.5). SWA layers use full rotary. This is handled via llama.cpp's existing n_rot_full / 2 mechanism.

Dual RoPE: Global and SWA layers use entirely different RoPE configs. The fork adds per-layer-type YaRN siblings to llama_hparams / llama_cparams and selects at graph build time via hparams.is_swa(il).

MoE routing: Router logits pass through sigmoid (not softmax), biased by a per-expert e_score_correction_bias that is added only during top-k selection, not during weight computation. Routing weights are L1-normalized before scaling by moe_routed_scaling_factor.

Thinking mode prefill: The chat template injects <think> as the generation prompt prefix when thinking is enabled. The streaming parser is patched to treat this as a delimiter-style boundary (not a generated token), so the <think> tag never appears in streaming output.

Stop token: </assistant> is token 24, a regular vocabulary token (not a special token). The fork adds it to antiprompt so the stop-word erase logic strips it before output.

Tested

  • Greedy decode on "The capital of France is" β†’ " Paris.\nThe capital of Germany is" βœ“
  • 10K-token prompt (exercises SWA layers) β€” no corruption βœ“
  • 35K needle-in-haystack (SWORDFISH99) β€” correctly retrieved (exercises global YaRN layers) βœ“

Known Limitations

  • Numerical validation against HF Transformers not yet done (requires CUDA/ROCm).
  • Q4_K_M vs f16 top-1 token agreement not formally checked (f16 GGUF exceeds available RAM for full eval).
Downloads last month
116
GGUF
Model size
33B params
Architecture
laguna
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for linuxid10t/Laguna-XS.2-GGUF

Quantized
(14)
this model