--- library_name: vllm inference: false extra_gated_description: >- To learn more about how we process your personal data, please read our Privacy Policy. tags: - laguna-xs.2 license: apache-2.0 pipeline_tag: text-generation base_model: - poolside/Laguna-XS.2 ---

poolside-banner

Try Laguna XS.2 in Shimmer · Get an API key · Release blog post


# Laguna XS.2-NVFP4 Laguna XS.2-NVFP4 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token designed for agentic coding and long-horizon work on a local machine. It uses Sliding Window Attention with per-head gating in 30 out of 40 layers for fast inference and low KV cache requirements. > [!NOTE] > This is the NVFP4 variant with an FP8-quantized KV cache. The [BF16](https://huggingface.co/poolside/Laguna-XS.2), [FP8](https://huggingface.co/poolside/Laguna-XS.2-FP8) and [INT4](https://huggingface.co/poolside/Laguna-XS.2-INT4) variants are also available on Hugging Face. ## Highlights - **Mixed SWA and global attention layout**: Laguna XS.2 uses sigmoid gating with per-layer rotary scales, enabling mixed SWA (Sliding Window Attention) and global attention layers in a 3:1 ratio (across 40 total layers) - **KV cache in FP8**: KV cache quantized to FP8, reducing memory per token - **Native reasoning support**: Interleaved thinking between tool calls with support for enabling and disabling thinking per-request - **Local-ready**: At 33B total parameters and 3B activated, Laguna XS.2 is compact enough to run on a Mac with 36 GB of RAM. [Available on Ollama](https://ollama.com/library/laguna-xs.2) - **Apache 2.0 license**: Use and modify freely for commercial and non-commercial purposes --- ## Model overview - Training: pre-training, post-training and reinforcement learning stages - Number of parameters: 33B total with 3B activated per token - Optimizer: Muon - Layers: 40 layers (10 layers with global attention, 30 layers with sliding window attention) - Experts: 256 experts with 1 shared expert - Sliding Window: 512 tokens - Modality: text-to-text - Context window: 262,144 tokens - Reasoning support: interleaved thinking with preserved thinking ## Benchmark results

benchmarks

| Model | Size (total params.) | SWE-bench Verified | SWE-bench Multilingual | SWE-bench Pro (Public Dataset) | Terminal-Bench 2.0 | |---------------------------|----------------------|--------------------|------------------------|--------------------------------|--------------------| | **Laguna XS.2 (BF16)** | 33B | 69.9% | 57.7% | 46.3% | 35.7% | | Devstral Small 2 | 24B dense | 68.0% | 55.7% | - | 22.5% | | Gemma 4 31B IT | 31B dense | 52.0% | 51.7% | 35.7% | 42.9% | | Qwen3.5-35B-A3B | 35B | 69.2% | 60.3% | 44.6% | 40.5% | | Qwen3.6-35B-A3B | 35B | 73.4% | 67.2% | 49.5% | 51.5% | | Claude Haiku 4.5 | - | 73.3% | - | 39.5% | 29.8% | | GPT-5.4 Nano | - | - | - | 52.4% | 46.3% | *We used the highest publicly-referenced scores for all comparison models across each benchmark. In almost all cases these were official scores published in release blog posts or equivalent, with the exception of Gemma 4 31B IT where the highest published scores were [reported by the Qwen team](https://qwen.ai/blog?id=qwen3.6-35b-a3b) and Claude Haiku 4.5 where the highest published (verified) scores for SWE-bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.*
Expand for benchmarking methodology All benchmarking for Laguna XS.2 was completed using the Laude Institute’s Harbor Framework with our [agent harness](https://github.com/poolsideai/pool), using a maximum of 500 steps and sandboxed execution using 8 GB RAM/2 CPUs (with the exception of Terminal-Bench 2.0; see below). The same sampling parameters were used for all benchmarking: temperature=0.7 and top_k=20. Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. More details outlining these updates and other findings will follow in a future technical blog post. - SWE-bench Verified: mean pass@1 averaged over 4 runs. - SWE-bench Multilingual: mean pass@1 averaged over 7 runs. - SWE-bench Pro: mean pass@1 averaged over 3 runs. - Terminal-Bench 2.0: mean pass@1 averaged over 5 runs. 48GB RAM/32 CPUs.
## Usage Laguna XS.2 has launch-day support in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA. The fastest way to get started is with our API, directly or using OpenRouter. > [!NOTE] > For complete usage instructions, see the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). ### Local deployment Laguna XS.2 is supported in vLLM and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA. Use Laguna-XS.2 with Ollama (with MLX support) and the mlx-lm framework for the best experience on your local machine. #### vLLM The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2) and on the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-NVFP4` substituted for the model ID. No extra flags required. > [!NOTE] > The FP8-quantized KV cache requires vLLM >= 0.22.0. Earlier versions produce scrambled output on non-Hopper GPUs because of a per-layer attention-head count bug, fixed in [vllm#42650](https://github.com/vllm-project/vllm/pull/42650). On older vLLM, disable the FP8 KV cache by adding `--kv-cache-dtype-skip-layers $(seq 0 39)`. #### Transformers The full Transformers recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Substitute `poolside/Laguna-XS.2-NVFP4` for the model ID; quantization is detected automatically from `quantization_config`. #### TRT-LLM > [!NOTE] > Requires building TensorRT-LLM from the upstream PR that adds Laguna XS.2 support > ([NVIDIA/TensorRT-LLM#13559](https://github.com/NVIDIA/TensorRT-LLM/pull/13559)). > Once that PR merges, the same code will work on a released `tensorrt-llm` wheel. The full TRT-LLM recipe, including the `laguna_minimal_overlay.sh` step needed for `transformers 4.57` compatibility, is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so no extra flags are required: ```python from tensorrt_llm import LLM # OVERLAY built from poolside/Laguna-XS.2-NVFP4 via laguna_minimal_overlay.sh llm = LLM(model=OVERLAY, trust_remote_code=True) ``` #### Ollama Visit [Ollama's model library](https://ollama.com/library/laguna-xs.2) to pull to your local machine. ## Controlling reasoning Laguna XS.2 has native reasoning support and is designed to work best with *preserved thinking*, where `reasoning` content from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls.
Expand for example ```python import json from openai import OpenAI client = OpenAI( base_url="https://inference.poolside.ai/v1", api_key="...", ) model = "poolside/laguna-xs.2" tools = [{"type": "function", "function": { "name": "shell", "description": "Execute a bash command and return the output.", "parameters": {"type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]}, }}] messages = [ {"role": "system", "content": "You are a coding agent with access to a shell tool."}, {"role": "user", "content": "Run uname -a"}, ] # Thinking is enabled by default when the server sets --default-chat-template-kwargs {"enable_thinking": True} # When using the Poolside API (https://inference.poolside.ai/v1), this flag is set by default response = client.chat.completions.create( model=model, messages=messages, tools=tools, stream=True, ) reasoning, content, tool_calls = "", "", [] for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "reasoning_content") and delta.reasoning_content: reasoning += delta.reasoning_content if hasattr(delta, "content") and delta.content: content += delta.content if hasattr(delta, "tool_calls") and delta.tool_calls: for tc in delta.tool_calls: if tc.index >= len(tool_calls): tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}}) if tc.function.name: tool_calls[tc.index]["function"]["name"] = tc.function.name if tc.function.arguments: tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments print(f"Reasoning: {reasoning}\nContent: {content}\nTool calls: {tool_calls}\n") # Return reasoning in the next request for best performance messages.append({ "role": "assistant", "content": content, "reasoning_content": reasoning, "tool_calls": [{"id": tc["id"], "type": "function", "function": tc["function"]} for tc in tool_calls] }) messages.append({ "role": "tool", "tool_call_id": tool_calls[0]["id"], "content": json.dumps({"stdout": "Darwin arm64", "exit_code": "0"}) }) response = client.chat.completions.create( model=model, messages=messages, tools=tools, stream=True, ) reasoning, content = "", "" for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "reasoning_content") and delta.reasoning_content: reasoning += delta.reasoning_content if hasattr(delta, "content") and delta.content: content += delta.content print(f"Reasoning: {reasoning}\nContent: {content}") ```
### Disabling reasoning You can disable thinking by setting `enable_thinking` to `False` in a request or by not providing `--default-chat-template-kwargs {"enable_thinking": True}` or equivalent when starting the server.
Expand for example ```python from openai import OpenAI client = OpenAI() completion = client.chat.completions.create( model="poolside/laguna-xs.2", messages=[ {"role": "user", "content": "Write a retry wrapper with exponential backoff."} ], extra_body={ "chat_template_kwargs": { "enable_thinking": False }, }, stream=True ) for chunk in completion: print(chunk.choices[0].delta) ```
For agentic coding use cases, we recommend enabling thinking and preserving reasoning in message history as outlined in the [Controlling reasoning] section. ## License This model is licensed under the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md). ## Intended and Responsible Use Laguna XS.2-NVFP4 is designed for software engineering and agentic coding use cases, and you are responsible for confirming that it is appropriate for your intended application. Laguna XS.2-NVFP4 is subject to the [Apache 2.0 License](https://huggingface.co/poolside/Laguna-XS.2-NVFP4/blob/main/LICENSE.md), and should be used consistently with Poolside's [Acceptable Use Policy](https://poolside.ai/legal/acceptable-use-policy). We advise against circumventing Laguna XS.2-NVFP4 safety guardrails without implementing substantially equivalent mitigations appropriate for your use case. Please report security vulnerabilities or safety concerns to [security@poolside.ai](mailto:security@poolside.ai).