Note FP8 KV cache needs vLLM 0.22.0; drop scrambled-output workaround (vllm#42650)

571346d 4 days ago

12.6 kB

	---
	library_name: vllm
	inference: false
	extra_gated_description: >-
	To learn more about how we process your personal data, please read our <a
	href="https://poolside.ai/legal/privacy">Privacy Policy</a>.
	tags:
	- laguna-xs.2
	license: apache-2.0
	pipeline_tag: text-generation
	base_model:
	- poolside/Laguna-XS.2
	---

	<p align="center">
	<img alt="poolside-banner" src="https://poolside.ai/assets/laguna/laguna-xs2-banner.svg" width="800px">
	</p>

	<p align="center">
	<a href="https://shimmer.poolside.ai"><strong>Try Laguna XS.2 in Shimmer</strong></a> ·
	<a href="https://platform.poolside.ai"><strong>Get an API key</strong></a> ·
	<a href="https://poolside.ai/blog/laguna-a-deeper-dive"><strong>Release blog post</strong></a>
	</p>

	<br>

	# Laguna XS.2-NVFP4
	Laguna XS.2-NVFP4 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token designed for agentic coding and long-horizon work on a local machine. It uses Sliding Window Attention with per-head gating in 30 out of 40 layers for fast inference and low KV cache requirements.

	> [!NOTE]
	> This is the NVFP4 variant with an FP8-quantized KV cache. The [BF16](https://huggingface.co/poolside/Laguna-XS.2), [FP8](https://huggingface.co/poolside/Laguna-XS.2-FP8) and [INT4](https://huggingface.co/poolside/Laguna-XS.2-INT4) variants are also available on Hugging Face.

	## Highlights
	- Mixed SWA and global attention layout: Laguna XS.2 uses sigmoid gating with per-layer rotary scales, enabling mixed SWA (Sliding Window Attention) and global attention layers in a 3:1 ratio (across 40 total layers)
	- KV cache in FP8: KV cache quantized to FP8, reducing memory per token
	- Native reasoning support: Interleaved thinking between tool calls with support for enabling and disabling thinking per-request
	- Local-ready: At 33B total parameters and 3B activated, Laguna XS.2 is compact enough to run on a Mac with 36 GB of RAM. [Available on Ollama](https://ollama.com/library/laguna-xs.2)
	- Apache 2.0 license: Use and modify freely for commercial and non-commercial purposes

	---

	## Model overview

	- Training: pre-training, post-training and reinforcement learning stages
	- Number of parameters: 33B total with 3B activated per token
	- Optimizer: Muon
	- Layers: 40 layers (10 layers with global attention, 30 layers with sliding window attention)
	- Experts: 256 experts with 1 shared expert
	- Sliding Window: 512 tokens
	- Modality: text-to-text
	- Context window: 262,144 tokens
	- Reasoning support: interleaved thinking with preserved thinking

	## Benchmark results

	<p align="center">
	<img alt="benchmarks" src="https://poolside.ai/assets/laguna/laguna-xs2-chart.svg" width="800px">
	</p>

	\| Model \| Size (total params.) \| SWE-bench Verified \| SWE-bench Multilingual \| SWE-bench Pro (Public Dataset) \| Terminal-Bench 2.0 \|
	\|---------------------------\|----------------------\|--------------------\|------------------------\|--------------------------------\|--------------------\|
	\| Laguna XS.2 (BF16) \| 33B \| 69.9% \| 57.7% \| 46.3% \| 35.7% \|
	\| Devstral Small 2 \| 24B dense \| 68.0% \| 55.7% \| - \| 22.5% \|
	\| Gemma 4 31B IT \| 31B dense \| 52.0% \| 51.7% \| 35.7% \| 42.9% \|
	\| Qwen3.5-35B-A3B \| 35B \| 69.2% \| 60.3% \| 44.6% \| 40.5% \|
	\| Qwen3.6-35B-A3B \| 35B \| 73.4% \| 67.2% \| 49.5% \| 51.5% \|
	\| Claude Haiku 4.5 \| - \| 73.3% \| - \| 39.5% \| 29.8% \|
	\| GPT-5.4 Nano \| - \| - \| - \| 52.4% \| 46.3% \|

	We used the highest publicly-referenced scores for all comparison models across each benchmark. In almost all cases these were official scores published in release blog posts or equivalent, with the exception of Gemma 4 31B IT where the highest published scores were [reported by the Qwen team](https://qwen.ai/blog?id=qwen3.6-35b-a3b) and Claude Haiku 4.5 where the highest published (verified) scores for SWE-bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.

	<details>
	<summary>Expand for benchmarking methodology</summary>

	All benchmarking for Laguna XS.2 was completed using the Laude Institute’s Harbor Framework with our [agent harness](https://github.com/poolsideai/pool), using a maximum of 500 steps and sandboxed execution using 8 GB RAM/2 CPUs (with the exception of Terminal-Bench 2.0; see below). The same sampling parameters were used for all benchmarking: temperature=0.7 and top_k=20. Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. More details outlining these updates and other findings will follow in a future technical blog post.

	- SWE-bench Verified: mean pass@1 averaged over 4 runs.
	- SWE-bench Multilingual: mean pass@1 averaged over 7 runs.
	- SWE-bench Pro: mean pass@1 averaged over 3 runs.
	- Terminal-Bench 2.0: mean pass@1 averaged over 5 runs. 48GB RAM/32 CPUs.

	</details>

	## Usage

	Laguna XS.2 has launch-day support in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA.

	The fastest way to get started is with our API, directly or using OpenRouter.

	> [!NOTE]
	> For complete usage instructions, see the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2).

	### Local deployment

	Laguna XS.2 is supported in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA. Use Laguna-XS.2 with Ollama (with MLX support) and the mlx-lm framework for the best experience on your local machine.

	#### vLLM

	The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2) and on the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-NVFP4` substituted for the model ID. No extra flags required.

	> [!NOTE]
	> The FP8-quantized KV cache requires vLLM >= 0.22.0. Earlier versions produce scrambled output on non-Hopper GPUs because of a per-layer attention-head count bug, fixed in [vllm#42650](https://github.com/vllm-project/vllm/pull/42650). On older vLLM, disable the FP8 KV cache by adding `--kv-cache-dtype-skip-layers $(seq 0 39)`.

	#### Transformers

	The full Transformers recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Substitute `poolside/Laguna-XS.2-NVFP4` for the model ID; quantization is detected automatically from `quantization_config`.

	#### TRT-LLM

	> [!NOTE]
	> Requires building TensorRT-LLM from the upstream PR that adds Laguna XS.2 support
	> ([NVIDIA/TensorRT-LLM#13559](https://github.com/NVIDIA/TensorRT-LLM/pull/13559)).
	> Once that PR merges, the same code will work on a released `tensorrt-llm` wheel.

	The full TRT-LLM recipe, including the `laguna_minimal_overlay.sh` step needed for `transformers 4.57` compatibility, is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2).
	Quantization is detected automatically from `quantization_config` in this checkpoint, so no extra flags are required:

	```python
	from tensorrt_llm import LLM

	# OVERLAY built from poolside/Laguna-XS.2-NVFP4 via laguna_minimal_overlay.sh
	llm = LLM(model=OVERLAY, trust_remote_code=True)
	```

	#### Ollama

	Visit [Ollama's model library](https://ollama.com/library/laguna-xs.2) to pull to your local machine.

	## Controlling reasoning

	Laguna XS.2 has native reasoning support and is designed to work best with preserved thinking, where `reasoning` content from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls.

	<details>
	<summary>Expand for example</summary>

	```python
	import json
	from openai import OpenAI

	client = OpenAI(
	base_url="https://inference.poolside.ai/v1",
	api_key="...",
	)

	model = "poolside/laguna-xs.2"

	tools = [{"type": "function", "function": {
	"name": "shell",
	"description": "Execute a bash command and return the output.",
	"parameters": {"type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]},
	}}]

	messages = [
	{"role": "system", "content": "You are a coding agent with access to a shell tool."},
	{"role": "user", "content": "Run uname -a"},
	]

	# Thinking is enabled by default when the server sets --default-chat-template-kwargs {"enable_thinking": True}
	# When using the Poolside API (https://inference.poolside.ai/v1), this flag is set by default
	response = client.chat.completions.create(
	model=model,
	messages=messages,
	tools=tools,
	stream=True,
	)

	reasoning, content, tool_calls = "", "", []
	for chunk in response:
	delta = chunk.choices[0].delta
	if hasattr(delta, "reasoning_content") and delta.reasoning_content:
	reasoning += delta.reasoning_content
	if hasattr(delta, "content") and delta.content:
	content += delta.content
	if hasattr(delta, "tool_calls") and delta.tool_calls:
	for tc in delta.tool_calls:
	if tc.index >= len(tool_calls):
	tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}})
	if tc.function.name:
	tool_calls[tc.index]["function"]["name"] = tc.function.name
	if tc.function.arguments:
	tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments

	print(f"Reasoning: {reasoning}\nContent: {content}\nTool calls: {tool_calls}\n")

	# Return reasoning in the next request for best performance
	messages.append({
	"role": "assistant",
	"content": content,
	"reasoning_content": reasoning,
	"tool_calls": [{"id": tc["id"], "type": "function", "function": tc["function"]} for tc in tool_calls]
	})

	messages.append({
	"role": "tool",
	"tool_call_id": tool_calls[0]["id"],
	"content": json.dumps({"stdout": "Darwin arm64", "exit_code": "0"})
	})

	response = client.chat.completions.create(
	model=model,
	messages=messages,
	tools=tools,
	stream=True,
	)

	reasoning, content = "", ""
	for chunk in response:
	delta = chunk.choices[0].delta
	if hasattr(delta, "reasoning_content") and delta.reasoning_content:
	reasoning += delta.reasoning_content
	if hasattr(delta, "content") and delta.content:
	content += delta.content

	print(f"Reasoning: {reasoning}\nContent: {content}")
	```

	</details>

	### Disabling reasoning

	You can disable thinking by setting `enable_thinking` to `False` in a request or by not providing `--default-chat-template-kwargs {"enable_thinking": True}` or equivalent when starting the server.

	<details>
	<summary>Expand for example</summary>

	```python
	from openai import OpenAI
	client = OpenAI()

	completion = client.chat.completions.create(
	model="poolside/laguna-xs.2",
	messages=[
	{"role": "user", "content": "Write a retry wrapper with exponential backoff."}
	],
	extra_body={
	"chat_template_kwargs": { "enable_thinking": False },
	},
	stream=True
	)

	for chunk in completion:
	print(chunk.choices[0].delta)
	```

	</details>

	For agentic coding use cases, we recommend enabling thinking and preserving reasoning in message history as outlined in the [Controlling reasoning] section.

	## License

	This model is licensed under the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md).

	## Intended and Responsible Use

	Laguna XS.2-NVFP4 is designed for software engineering and agentic coding use cases, and you are responsible for confirming that it is appropriate for your intended application. Laguna XS.2-NVFP4 is subject to the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md), and should be used consistently with Poolside's [Acceptable Use Policy](https://poolside.ai/legal/acceptable-use-policy). We advise against circumventing Laguna XS.2-NVFP4 safety guardrails without implementing substantially equivalent mitigations appropriate for your use case.

	Please report security vulnerabilities or safety concerns to [security@poolside.ai](mailto:security@poolside.ai).