Piscina-XS.2-GGUF β€” Laguna XS.2 with Active-Path-Precision Quantization

Piscina shrinks Poolside's 33B-A3B MoE coding model, Laguna XS.2, to as low as ~2.1 bits per weight (8.8 GB) β€” small enough for consumer GPUs and laptops β€” using Active-Path-Precision, a MoE-aware mixed-precision recipe, and shows with benchmarks that it stays near-lossless. Built for the Poolside Research Hackathon (Foundations track).

A laguna is a lagoon; a piscina is a pool you can fit at home.

TL;DR

  • Result (head-to-head at ~equal size): Piscina-IQ2 cuts KL-divergence 28% (0.555 vs 0.772) and lifts top-1 token agreement +5.3 pts (72.6% vs 67.3%) vs generic IQ2_M at the same size. HumanEval pass@1 90.0% vs 80.0%.
  • Active-Path-Precision β€” keep Laguna's always-active path (attention, router, shared expert, embeddings, output, dense layer 0) near-full-precision (Q6_K/Q8_0) and quantize only the 256 dormant routed experts to ~2-bit. Result: near-lossless quality at the size of a generic 2-bit GGUF.
  • Portable GGUF (llama.cpp / Ollama, any NVIDIA/CPU box) and reproducible with llama.cpp --tensor-type overrides (no custom kernels).
  • Head-to-head vs generic 2-bit at equal size, measured by KL-divergence + top-1 token agreement vs Q8_0.

Quality vs the Q8_0 reference (wiki corpus, n_ctx=512)

Variant Method bpw Size (GB) Mean KL-div ↓ Top-1 agree ↑ PPL ↓ tok/s ↑
Q8_0 (ref) reference ~8.5 35.6 0.000 100% 14.38 160
Piscina-IQ2 Active-Path-Precision ~2.9 11.9 0.555 72.6% 17.06 149
IQ2_M generic uniform ~2.7 11.0 0.772 67.3% 17.19 156
Piscina-IQ1 Active-Path-Precision ~2.1 8.8 1.347 58.0% 26.31 151
IQ1_M generic uniform ~1.75 7.6 2.454 42.6% 56.89 164

Pareto frontier: size vs quality, generic vs Piscina

Recommended pick: Piscina-IQ2 β€” near-lossless on a 16 GB GPU. Piscina-IQ1 for 12 GB. Generic IQ1_* only when memory is the hard constraint.

Usage (llama.cpp)

hf download poolside-laguna-hackathon/Piscina-XS.2-GGUF Piscina-XS.2-IQ2.gguf --local-dir .
# requires a laguna-aware llama.cpp build (see Credits)
./llama-server -m Piscina-XS.2-IQ2.gguf -ngl 99 -c 8192 --port 8080

Method β€” Active-Path-Precision

Plain sub-4-bit quantization treats every tensor equally and crushes the parts of a Mixture-of-Experts that are most fragile. Active-Path-Precision allocates bits by how often a tensor is on the active compute path:

  • High precision (Q6_K / Q8_0): attention (q/k/v/output), the router (ffn_gate_inp, the most quantization-sensitive component), the shared expert (fires every token), the leading dense layer 0, token embeddings and the output head.
  • Aggressive (~2-bit, IQ2_M / IQ1_M): the 256 dormant routed experts (ffn_{gate,up,down}_exps), which hold most parameters but each activate rarely.

Routed experts dominate the parameter count, so the model still lands at ~2-3 bpw and fits a 16 GB GPU, while the per-token compute path stays near-full-precision β€” hence near-lossless quality. Stock llama.cpp --tensor-type overrides + an imatrix; no custom kernels.

Recipe (flagship Piscina-IQ2):

llama-quantize --imatrix author.imatrix \
  --token-embedding-type q6_K --output-tensor-type q6_K \
  --tensor-type attn=q6_K --tensor-type ffn_gate_inp=q8_0 \
  --tensor-type shexp=q6_K --tensor-type blk.0.ffn=q6_K \
  Laguna-XS.2-f16.gguf Piscina-XS.2-IQ2.gguf IQ2_M

Grounded in recent MoE-quantization literature: MoQE (arXiv:2310.02410), Examining MoE quantization (arXiv:2406.08155), QMoE (arXiv:2310.16795), MxMoE (arXiv:2505.05799), EAQuant (arXiv:2506.13329, router fragility β†’ validate with KL-divergence).

Limitations (honest)

  • 1-bit variants (IQ1_*) show a sharp quality cliff β€” published to map where Laguna breaks, not for production.
  • KL-div/top-1 measured on a wiki corpus at n_ctx=512; treat as directional.
  • KV cache / long-context memory is separate from weight size; budget VRAM accordingly.

Functional code quality β€” HumanEval pass@1

Directional subset (n=20 problems), greedy decoding, served via the laguna-aware llama.cpp llama-server with the model's chat template.

Variant HumanEval pass@1 ↑
Q8_0 (reference) 95.0%
Piscina-IQ2 (active-path) 90.0%
IQ2_M (generic) 80.0%

The Laguna XS.2 model card reports SWE-bench Verified/Multilingual/Pro and Terminal-Bench 2.0. Those require full Dockerized agentic harnesses (hours of compute) and are out of scope for this hackathon's time/compute budget. HumanEval pass@1 is included as a lightweight functional proxy for how well code-generation quality survives quantization.

Credits

Base model: poolside/Laguna-XS.2 (Apache 2.0). Source f16 GGUF: linuxid10t/Laguna-XS.2-GGUF. Runtime: laguna-aware llama.cpp fork linuxid10t/llama.cpp-add-laguna (mainline llama.cpp does not yet support the laguna architecture). imatrix recomputed from the f16 on wiki calibration text. Built for the Poolside Research Hackathon.

Downloads last month
647
GGUF
Model size
33B params
Architecture
laguna
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for poolside-laguna-hackathon/Piscina-XS.2-GGUF

Quantized
(16)
this model

Papers for poolside-laguna-hackathon/Piscina-XS.2-GGUF