File size: 10,165 Bytes
35f6eaf
 
 
 
e30b4c1
 
 
 
 
 
 
 
 
05840c5
e30b4c1
 
35f6eaf
 
05840c5
35f6eaf
05840c5
35f6eaf
b0c52d9
 
 
 
 
3a1913b
05840c5
35f6eaf
05840c5
 
 
 
1ea3af8
 
05840c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
license: apache-2.0
base_model: Qwen/Qwen3.5-9B-Base
tags:
- code
- reasoning
- distillation
- reinforcement-learning
- long-context
- claude-code
- openai-codex
- quantum-entropy
- merlin-research
language:
- en
pipeline_tag: image-text-to-text
---

# Pluto

![IMAGE 2026-03-22 02:04:31](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/yEhR_aUdMvbHKMuhiXvB7.jpeg)

[![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](https://www.apache.org/licenses/LICENSE-2.0)

[![IBM Quantum](https://img.shields.io/badge/IBM_Quantum-Kingston_156Q-7c3aed?style=for-the-badge)](https://quantum.ibm.com)

[![Training Hardware](https://img.shields.io/badge/Training_HW-Google_TPU_TRC-dc2626?style=for-the-badge)](https://sites.research.google/trc/)

**Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows.

---

## Model Summary

![benchmarks](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/rduiP2UeMrpMgcIfTIEm6.png)

| Property | Value |
|---|---|
| **Developer** | Merlin Research |
| **Base Model** | Qwen/Qwen3.5-9B-Base |
| **Parameters** | 9B |
| **Context Length** | 1,000,000 tokens |
| **Training** | SFT + RL with Adaptive Entropy Regularization |
| **Distillation** | Frontier coding models |
| **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) |
| **Quantum** | IBM Quantum Kingston (Heron r2) β€” entropy noise injection |
| **License** | Apache 2.0 |

---

## Key Features

### 🎯 Precision-First Design
Pluto is trained to minimize errors rather than maximize fluency. Every training signal β€” from distillation targets to RL reward shaping β€” is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences.

### πŸ”­ 1M Token Context
Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history β€” Pluto maintains coherent reasoning across the full window.

### πŸ€– Agentic Deployment Ready
Pluto is fine-tuned specifically for deployment in:
- **Claude Code** β€” system prompt formatting, tool call patterns, multi-turn agentic loops
- **OpenAI Codex / Assistants API** β€” compatible message structure and function calling behavior
- **Local deployment** β€” GGUF and quantized variants available for running against large local codebases without API latency

### βš›οΈ Quantum Entropy Regularization (AER)
During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient Ξ»(t) during GRPO training, providing:
- Resistance to entropy collapse and reward hacking
- Improved robustness on out-of-distribution inputs
- More stable training dynamics across long RL runs

This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization.

### πŸ“š Distillation from Frontier Models
Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale.

---

## Quickstart

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "MerlinSafety/Pluto"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp."
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

### With Unsloth (faster inference, 4-bit)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="MerlinSafety/Pluto",
    max_seq_length=131072,  # adjust as needed
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n    import requests\n    return requests.get(url).json()"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=1024,
    temperature=0.6,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```

### GGUF / llama.cpp (local deployment)

```bash
# Download Q4_K_M (recommended, ~5.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q4_K_M.gguf \
    --local-dir ./pluto

# Download Q8_0 (higher quality, ~9.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q8_0.gguf \
    --local-dir ./pluto

# Run with llama.cpp
./llama-cli \
    -m ./pluto/Pluto-Q4_K_M.gguf \
    -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \
    -n 1024 \
    --temp 0.6 \
    --top-p 0.95 \
    -c 8192
```

### Ollama

```bash
cat > Modelfile << 'EOF'
FROM ./Pluto-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
EOF

ollama create pluto -f Modelfile
ollama run pluto "Write a thread-safe singleton implementation in Python"
```

---

## Claude Code Integration

Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server:

```bash
# Start local server (example with llama.cpp server)
./llama-server \
    -m pluto-9b-q4_k_m.gguf \
    --port 8080 \
    -c 32768 \
    --chat-template qwen

# Use with Claude Code
claude --model http://localhost:8080 "Review this PR and identify potential bugs"
```

---

## OpenAI Codex / Assistants API Integration

Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # your local Pluto server
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="pluto",
    messages=[
        {
            "role": "user",
            "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly."
        }
    ],
    max_tokens=1024,
    temperature=0.6,
)

print(response.choices[0].message.content)
```

---

---

## Training Details

### Pipeline Overview

```
Qwen/Qwen3.5-9B-Base
    β”‚
    β–Ό
SFT on curated advanced reasoning + coding dataset
(private dataset, distillation from frontier models)
    β”‚
    β–Ό
GRPO Reinforcement Learning
with Adaptive Entropy Regularization (AER)
+ IBM Quantum Kingston entropy noise injection
    β”‚
    β–Ό
Long-context fine-tuning (1M token extension)
    β”‚
    β–Ό
Agentic deployment fine-tuning
(Claude Code + Codex format alignment)
    β”‚
    β–Ό
Pluto 9B
```

### Adaptive Entropy Regularization (AER)

During RL training, the loss function was modified as:

```
L_total = L_RL + Ξ»(t) Β· L_entropy
```

where `Ξ»(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness.

### Compute
Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research.

---

## Intended Use

- Complex code generation and refactoring  
- Multi-file codebase analysis  
- Agentic coding pipelines (Claude Code, Codex)  
- Code review and bug detection  
- Architecture planning and technical reasoning  
- Local deployment with large private codebases  

---

## Limitations

- Pluto is optimized for coding and technical reasoning β€” general conversation and creative tasks are outside its primary design goal
- Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production
- Performance on very niche frameworks or proprietary APIs may be limited by training data coverage
- Quantum entropy component provides training-time benefits; inference behavior is classical

---

## Citation

```bibtex
@misc{pluto-2026,
  title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization},
  author={Merlin Research},
  year={2026},
  publisher={Merlin Research},
  url={https://huggingface.co/MerlinSafety/Pluto}
}
```

---

## About Merlin Research

[Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community.

**HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety)  
**Contact:** MerlinResearch@protonmail.com