---
library_name: vllm
inference: false
extra_gated_description: >-
To learn more about how we process your personal data, please read our Privacy Policy.
tags:
- laguna-xs.2
license: apache-2.0
pipeline_tag: text-generation
base_model:
- poolside/Laguna-XS.2
---
Try Laguna XS.2 in Shimmer ·
Get an API key ·
Release blog post
# Laguna XS.2-NVFP4
Laguna XS.2-NVFP4 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token designed for agentic coding and long-horizon work on a local machine. It uses Sliding Window Attention with per-head gating in 30 out of 40 layers for fast inference and low KV cache requirements.
> [!NOTE]
> This is the NVFP4 variant with an FP8-quantized KV cache. The [BF16](https://huggingface.co/poolside/Laguna-XS.2), [FP8](https://huggingface.co/poolside/Laguna-XS.2-FP8) and [INT4](https://huggingface.co/poolside/Laguna-XS.2-INT4) variants are also available on Hugging Face.
## Highlights
- **Mixed SWA and global attention layout**: Laguna XS.2 uses sigmoid gating with per-layer rotary scales, enabling mixed SWA (Sliding Window Attention) and global attention layers in a 3:1 ratio (across 40 total layers)
- **KV cache in FP8**: KV cache quantized to FP8, reducing memory per token
- **Native reasoning support**: Interleaved thinking between tool calls with support for enabling and disabling thinking per-request
- **Local-ready**: At 33B total parameters and 3B activated, Laguna XS.2 is compact enough to run on a Mac with 36 GB of RAM. [Available on Ollama](https://ollama.com/library/laguna-xs.2)
- **Apache 2.0 license**: Use and modify freely for commercial and non-commercial purposes
---
## Model overview
- Training: pre-training, post-training and reinforcement learning stages
- Number of parameters: 33B total with 3B activated per token
- Optimizer: Muon
- Layers: 40 layers (10 layers with global attention, 30 layers with sliding window attention)
- Experts: 256 experts with 1 shared expert
- Sliding Window: 512 tokens
- Modality: text-to-text
- Context window: 262,144 tokens
- Reasoning support: interleaved thinking with preserved thinking
## Benchmark results
| Model | Size (total params.) | SWE-bench Verified | SWE-bench Multilingual | SWE-bench Pro (Public Dataset) | Terminal-Bench 2.0 |
|---------------------------|----------------------|--------------------|------------------------|--------------------------------|--------------------|
| **Laguna XS.2 (BF16)** | 33B | 69.9% | 57.7% | 46.3% | 35.7% |
| Devstral Small 2 | 24B dense | 68.0% | 55.7% | - | 22.5% |
| Gemma 4 31B IT | 31B dense | 52.0% | 51.7% | 35.7% | 42.9% |
| Qwen3.5-35B-A3B | 35B | 69.2% | 60.3% | 44.6% | 40.5% |
| Qwen3.6-35B-A3B | 35B | 73.4% | 67.2% | 49.5% | 51.5% |
| Claude Haiku 4.5 | - | 73.3% | - | 39.5% | 29.8% |
| GPT-5.4 Nano | - | - | - | 52.4% | 46.3% |
*We used the highest publicly-referenced scores for all comparison models across each benchmark. In almost all cases these were official scores published in release blog posts or equivalent, with the exception of Gemma 4 31B IT where the highest published scores were [reported by the Qwen team](https://qwen.ai/blog?id=qwen3.6-35b-a3b) and Claude Haiku 4.5 where the highest published (verified) scores for SWE-bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.*
Expand for benchmarking methodology
All benchmarking for Laguna XS.2 was completed using the Laude Institute’s Harbor Framework with our [agent harness](https://github.com/poolsideai/pool), using a maximum of 500 steps and sandboxed execution using 8 GB RAM/2 CPUs (with the exception of Terminal-Bench 2.0; see below). The same sampling parameters were used for all benchmarking: temperature=0.7 and top_k=20. Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. More details outlining these updates and other findings will follow in a future technical blog post.
- SWE-bench Verified: mean pass@1 averaged over 4 runs.
- SWE-bench Multilingual: mean pass@1 averaged over 7 runs.
- SWE-bench Pro: mean pass@1 averaged over 3 runs.
- Terminal-Bench 2.0: mean pass@1 averaged over 5 runs. 48GB RAM/32 CPUs.
## Usage
Laguna XS.2 has launch-day support in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA.
The fastest way to get started is with our API, directly or using OpenRouter.
> [!NOTE]
> For complete usage instructions, see the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2).
### Local deployment
Laguna XS.2 is supported in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA. Use Laguna-XS.2 with Ollama (with MLX support) and the mlx-lm framework for the best experience on your local machine.
#### vLLM
The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2) and on the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-NVFP4` substituted for the model ID. No extra flags required.
> [!NOTE]
> The FP8-quantized KV cache requires vLLM >= 0.22.0. Earlier versions produce scrambled output on non-Hopper GPUs because of a per-layer attention-head count bug, fixed in [vllm#42650](https://github.com/vllm-project/vllm/pull/42650). On older vLLM, disable the FP8 KV cache by adding `--kv-cache-dtype-skip-layers $(seq 0 39)`.
#### Transformers
The full Transformers recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Substitute `poolside/Laguna-XS.2-NVFP4` for the model ID; quantization is detected automatically from `quantization_config`.
#### TRT-LLM
> [!NOTE]
> Requires building TensorRT-LLM from the upstream PR that adds Laguna XS.2 support
> ([NVIDIA/TensorRT-LLM#13559](https://github.com/NVIDIA/TensorRT-LLM/pull/13559)).
> Once that PR merges, the same code will work on a released `tensorrt-llm` wheel.
The full TRT-LLM recipe, including the `laguna_minimal_overlay.sh` step needed for `transformers 4.57` compatibility, is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2).
Quantization is detected automatically from `quantization_config` in this checkpoint, so no extra flags are required:
```python
from tensorrt_llm import LLM
# OVERLAY built from poolside/Laguna-XS.2-NVFP4 via laguna_minimal_overlay.sh
llm = LLM(model=OVERLAY, trust_remote_code=True)
```
#### Ollama
Visit [Ollama's model library](https://ollama.com/library/laguna-xs.2) to pull to your local machine.
## Controlling reasoning
Laguna XS.2 has native reasoning support and is designed to work best with *preserved thinking*, where `reasoning` content from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls.
Expand for example
```python
import json
from openai import OpenAI
client = OpenAI(
base_url="https://inference.poolside.ai/v1",
api_key="...",
)
model = "poolside/laguna-xs.2"
tools = [{"type": "function", "function": {
"name": "shell",
"description": "Execute a bash command and return the output.",
"parameters": {"type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]},
}}]
messages = [
{"role": "system", "content": "You are a coding agent with access to a shell tool."},
{"role": "user", "content": "Run uname -a"},
]
# Thinking is enabled by default when the server sets --default-chat-template-kwargs {"enable_thinking": True}
# When using the Poolside API (https://inference.poolside.ai/v1), this flag is set by default
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
stream=True,
)
reasoning, content, tool_calls = "", "", []
for chunk in response:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
reasoning += delta.reasoning_content
if hasattr(delta, "content") and delta.content:
content += delta.content
if hasattr(delta, "tool_calls") and delta.tool_calls:
for tc in delta.tool_calls:
if tc.index >= len(tool_calls):
tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}})
if tc.function.name:
tool_calls[tc.index]["function"]["name"] = tc.function.name
if tc.function.arguments:
tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments
print(f"Reasoning: {reasoning}\nContent: {content}\nTool calls: {tool_calls}\n")
# Return reasoning in the next request for best performance
messages.append({
"role": "assistant",
"content": content,
"reasoning_content": reasoning,
"tool_calls": [{"id": tc["id"], "type": "function", "function": tc["function"]} for tc in tool_calls]
})
messages.append({
"role": "tool",
"tool_call_id": tool_calls[0]["id"],
"content": json.dumps({"stdout": "Darwin arm64", "exit_code": "0"})
})
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
stream=True,
)
reasoning, content = "", ""
for chunk in response:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
reasoning += delta.reasoning_content
if hasattr(delta, "content") and delta.content:
content += delta.content
print(f"Reasoning: {reasoning}\nContent: {content}")
```
### Disabling reasoning
You can disable thinking by setting `enable_thinking` to `False` in a request or by not providing `--default-chat-template-kwargs {"enable_thinking": True}` or equivalent when starting the server.
Expand for example
```python
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="poolside/laguna-xs.2",
messages=[
{"role": "user", "content": "Write a retry wrapper with exponential backoff."}
],
extra_body={
"chat_template_kwargs": { "enable_thinking": False },
},
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta)
```
For agentic coding use cases, we recommend enabling thinking and preserving reasoning in message history as outlined in the [Controlling reasoning] section.
## License
This model is licensed under the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md).
## Intended and Responsible Use
Laguna XS.2-NVFP4 is designed for software engineering and agentic coding use cases, and you are responsible for confirming that it is appropriate for your intended application. Laguna XS.2-NVFP4 is subject to the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md), and should be used consistently with Poolside's [Acceptable Use Policy](https://poolside.ai/legal/acceptable-use-policy). We advise against circumventing Laguna XS.2-NVFP4 safety guardrails without implementing substantially equivalent mitigations appropriate for your use case.
Please report security vulnerabilities or safety concerns to [security@poolside.ai](mailto:security@poolside.ai).