Instructions to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="poolside-laguna-hackathon/Piscina-XS.2-GGUF",
	filename="Piscina-XS.2-IQ1.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Use Docker

docker model run hf.co/poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside-laguna-hackathon/Piscina-XS.2-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside-laguna-hackathon/Piscina-XS.2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Ollama
How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Ollama:
```
ollama run hf.co/poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
```

Unsloth Studio

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for poolside-laguna-hackathon/Piscina-XS.2-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for poolside-laguna-hackathon/Piscina-XS.2-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for poolside-laguna-hackathon/Piscina-XS.2-GGUF to start chatting

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Docker Model Runner:
```
docker model run hf.co/poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0
```

Lemonade

How to use poolside-laguna-hackathon/Piscina-XS.2-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull poolside-laguna-hackathon/Piscina-XS.2-GGUF:Q8_0

Run and chat with the model

lemonade run user.Piscina-XS.2-GGUF-Q8_0

List all available models

lemonade list

Piscina-XS.2-GGUF — Laguna XS.2 with Active-Path-Precision Quantization

Piscina shrinks Poolside's 33B-A3B MoE coding model, Laguna XS.2, to as low as ~2.1 bits per weight (8.8 GB) — small enough for consumer GPUs and laptops — using Active-Path-Precision, a MoE-aware mixed-precision recipe, and shows with benchmarks that it stays near-lossless. Built for the Poolside Research Hackathon (Foundations track).

A laguna is a lagoon; a piscina is a pool you can fit at home.

TL;DR

Result (head-to-head at ~equal size): Piscina-IQ2 cuts KL-divergence 28% (0.555 vs 0.772) and lifts top-1 token agreement +5.3 pts (72.6% vs 67.3%) vs generic IQ2_M at the same size. HumanEval pass@1 90.0% vs 80.0%.
Active-Path-Precision — keep Laguna's always-active path (attention, router, shared expert, embeddings, output, dense layer 0) near-full-precision (Q6_K/Q8_0) and quantize only the 256 dormant routed experts to ~2-bit. Result: near-lossless quality at the size of a generic 2-bit GGUF.
Portable GGUF (llama.cpp / Ollama, any NVIDIA/CPU box) and reproducible with llama.cpp --tensor-type overrides (no custom kernels).
Head-to-head vs generic 2-bit at equal size, measured by KL-divergence + top-1 token agreement vs Q8_0.

Quality vs the Q8_0 reference (wiki corpus, n_ctx=512)

Variant	Method	bpw	Size (GB)	Mean KL-div ↓	Top-1 agree ↑	PPL ↓	tok/s ↑
Q8_0 (ref)	reference	~8.5	35.6	0.000	100%	14.38	160
Piscina-IQ2	Active-Path-Precision	~2.9	11.9	0.555	72.6%	17.06	149
IQ2_M	generic uniform	~2.7	11.0	0.772	67.3%	17.19	156
Piscina-IQ1	Active-Path-Precision	~2.1	8.8	1.347	58.0%	26.31	151
IQ1_M	generic uniform	~1.75	7.6	2.454	42.6%	56.89	164

Recommended pick: Piscina-IQ2 — near-lossless on a 16 GB GPU. Piscina-IQ1 for 12 GB. Generic IQ1_* only when memory is the hard constraint.

Usage (llama.cpp)

hf download poolside-laguna-hackathon/Piscina-XS.2-GGUF Piscina-XS.2-IQ2.gguf --local-dir .
# requires a laguna-aware llama.cpp build (see Credits)
./llama-server -m Piscina-XS.2-IQ2.gguf -ngl 99 -c 8192 --port 8080

Method — Active-Path-Precision

Plain sub-4-bit quantization treats every tensor equally and crushes the parts of a Mixture-of-Experts that are most fragile. Active-Path-Precision allocates bits by how often a tensor is on the active compute path:

High precision (Q6_K / Q8_0): attention (q/k/v/output), the router (ffn_gate_inp, the most quantization-sensitive component), the shared expert (fires every token), the leading dense layer 0, token embeddings and the output head.
Aggressive (~2-bit, IQ2_M / IQ1_M): the 256 dormant routed experts (ffn_{gate,up,down}_exps), which hold most parameters but each activate rarely.

Routed experts dominate the parameter count, so the model still lands at ~2-3 bpw and fits a 16 GB GPU, while the per-token compute path stays near-full-precision — hence near-lossless quality. Stock llama.cpp --tensor-type overrides + an imatrix; no custom kernels.

Recipe (flagship Piscina-IQ2):

llama-quantize --imatrix author.imatrix \
  --token-embedding-type q6_K --output-tensor-type q6_K \
  --tensor-type attn=q6_K --tensor-type ffn_gate_inp=q8_0 \
  --tensor-type shexp=q6_K --tensor-type blk.0.ffn=q6_K \
  Laguna-XS.2-f16.gguf Piscina-XS.2-IQ2.gguf IQ2_M

Grounded in recent MoE-quantization literature: MoQE (arXiv:2310.02410), Examining MoE quantization (arXiv:2406.08155), QMoE (arXiv:2310.16795), MxMoE (arXiv:2505.05799), EAQuant (arXiv:2506.13329, router fragility → validate with KL-divergence).

Limitations (honest)

1-bit variants (IQ1_*) show a sharp quality cliff — published to map where Laguna breaks, not for production.
KL-div/top-1 measured on a wiki corpus at n_ctx=512; treat as directional.
KV cache / long-context memory is separate from weight size; budget VRAM accordingly.

Functional code quality — HumanEval pass@1

Directional subset (n=20 problems), greedy decoding, served via the laguna-aware llama.cpp llama-server with the model's chat template.

Variant	HumanEval pass@1 ↑
Q8_0 (reference)	95.0%
Piscina-IQ2 (active-path)	90.0%
IQ2_M (generic)	80.0%

The Laguna XS.2 model card reports SWE-bench Verified/Multilingual/Pro and Terminal-Bench 2.0. Those require full Dockerized agentic harnesses (hours of compute) and are out of scope for this hackathon's time/compute budget. HumanEval pass@1 is included as a lightweight functional proxy for how well code-generation quality survives quantization.

Credits

Base model: poolside/Laguna-XS.2 (Apache 2.0). Source f16 GGUF: linuxid10t/Laguna-XS.2-GGUF. Runtime: laguna-aware llama.cpp fork linuxid10t/llama.cpp-add-laguna (mainline llama.cpp does not yet support the laguna architecture). imatrix recomputed from the f16 on wiki calibration text. Built for the Poolside Research Hackathon.