---
license: apache-2.0
base_model: Qwen/Qwen3.5-9B-Base
tags:
- code
- reasoning
- distillation
- reinforcement-learning
- long-context
- claude-code
- openai-codex
- quantum-entropy
- merlin-research
language:
- en
pipeline_tag: image-text-to-text
---

# Pluto

![IMAGE 2026-03-22 02:04:31](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/yEhR_aUdMvbHKMuhiXvB7.jpeg)

[![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](https://www.apache.org/licenses/LICENSE-2.0)

[![IBM Quantum](https://img.shields.io/badge/IBM_Quantum-Kingston_156Q-7c3aed?style=for-the-badge)](https://quantum.ibm.com)

[![Training Hardware](https://img.shields.io/badge/Training_HW-Google_TPU_TRC-dc2626?style=for-the-badge)](https://sites.research.google/trc/)

**Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows.

---

## Model Summary

![benchmarks](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/rduiP2UeMrpMgcIfTIEm6.png)

| Property | Value |
|---|---|
| **Developer** | Merlin Research |
| **Base Model** | Qwen/Qwen3.5-9B-Base |
| **Parameters** | 9B |
| **Context Length** | 1,000,000 tokens |
| **Training** | SFT + RL with Adaptive Entropy Regularization |
| **Distillation** | Frontier coding models |
| **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) |
| **Quantum** | IBM Quantum Kingston (Heron r2) — entropy noise injection |
| **License** | Apache 2.0 |

---

## Key Features

### 🎯 Precision-First Design
Pluto is trained to minimize errors rather than maximize fluency. Every training signal — from distillation targets to RL reward shaping — is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences.

### 🔭 1M Token Context
Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history — Pluto maintains coherent reasoning across the full window.

### 🤖 Agentic Deployment Ready
Pluto is fine-tuned specifically for deployment in:
- **Claude Code** — system prompt formatting, tool call patterns, multi-turn agentic loops
- **OpenAI Codex / Assistants API** — compatible message structure and function calling behavior
- **Local deployment** — GGUF and quantized variants available for running against large local codebases without API latency

### ⚛️ Quantum Entropy Regularization (AER)
During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient λ(t) during GRPO training, providing:
- Resistance to entropy collapse and reward hacking
- Improved robustness on out-of-distribution inputs
- More stable training dynamics across long RL runs

This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization.

### 📚 Distillation from Frontier Models
Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale.

---

## Quickstart

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "MerlinSafety/Pluto"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp."
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

### With Unsloth (faster inference, 4-bit)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="MerlinSafety/Pluto",
    max_seq_length=131072,  # adjust as needed
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n    import requests\n    return requests.get(url).json()"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=1024,
    temperature=0.6,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```

### GGUF / llama.cpp (local deployment)

```bash
# Download Q4_K_M (recommended, ~5.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q4_K_M.gguf \
    --local-dir ./pluto

# Download Q8_0 (higher quality, ~9.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q8_0.gguf \
    --local-dir ./pluto

# Run with llama.cpp
./llama-cli \
    -m ./pluto/Pluto-Q4_K_M.gguf \
    -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \
    -n 1024 \
    --temp 0.6 \
    --top-p 0.95 \
    -c 8192
```

### Ollama

```bash
cat > Modelfile << 'EOF'
FROM ./Pluto-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
EOF

ollama create pluto -f Modelfile
ollama run pluto "Write a thread-safe singleton implementation in Python"
```

---

## Claude Code Integration

Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server:

```bash
# Start local server (example with llama.cpp server)
./llama-server \
    -m pluto-9b-q4_k_m.gguf \
    --port 8080 \
    -c 32768 \
    --chat-template qwen

# Use with Claude Code
claude --model http://localhost:8080 "Review this PR and identify potential bugs"
```

---

## OpenAI Codex / Assistants API Integration

Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # your local Pluto server
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="pluto",
    messages=[
        {
            "role": "user",
            "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly."
        }
    ],
    max_tokens=1024,
    temperature=0.6,
)

print(response.choices[0].message.content)
```

---

---

## Training Details

### Pipeline Overview

```
Qwen/Qwen3.5-9B-Base
    │
    ▼
SFT on curated advanced reasoning + coding dataset
(private dataset, distillation from frontier models)
    │
    ▼
GRPO Reinforcement Learning
with Adaptive Entropy Regularization (AER)
+ IBM Quantum Kingston entropy noise injection
    │
    ▼
Long-context fine-tuning (1M token extension)
    │
    ▼
Agentic deployment fine-tuning
(Claude Code + Codex format alignment)
    │
    ▼
Pluto 9B
```

### Adaptive Entropy Regularization (AER)

During RL training, the loss function was modified as:

```
L_total = L_RL + λ(t) · L_entropy
```

where `λ(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness.

### Compute
Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research.

---

## Intended Use

- Complex code generation and refactoring  
- Multi-file codebase analysis  
- Agentic coding pipelines (Claude Code, Codex)  
- Code review and bug detection  
- Architecture planning and technical reasoning  
- Local deployment with large private codebases  

---

## Limitations

- Pluto is optimized for coding and technical reasoning — general conversation and creative tasks are outside its primary design goal
- Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production
- Performance on very niche frameworks or proprietary APIs may be limited by training data coverage
- Quantum entropy component provides training-time benefits; inference behavior is classical

---

## Citation

```bibtex
@misc{pluto-2026,
  title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization},
  author={Merlin Research},
  year={2026},
  publisher={Merlin Research},
  url={https://huggingface.co/MerlinSafety/Pluto}
}
```

---

## About Merlin Research

[Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community.

**HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety)  
**Contact:** MerlinResearch@protonmail.com