Pluto / README.md
squ11z1's picture
Update README.md
e30b4c1 verified
---
license: apache-2.0
base_model: Qwen/Qwen3.5-9B-Base
tags:
- code
- reasoning
- distillation
- reinforcement-learning
- long-context
- claude-code
- openai-codex
- quantum-entropy
- merlin-research
language:
- en
pipeline_tag: image-text-to-text
---
# Pluto
![IMAGE 2026-03-22 02:04:31](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/yEhR_aUdMvbHKMuhiXvB7.jpeg)
[![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](https://www.apache.org/licenses/LICENSE-2.0)
[![IBM Quantum](https://img.shields.io/badge/IBM_Quantum-Kingston_156Q-7c3aed?style=for-the-badge)](https://quantum.ibm.com)
[![Training Hardware](https://img.shields.io/badge/Training_HW-Google_TPU_TRC-dc2626?style=for-the-badge)](https://sites.research.google/trc/)
**Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows.
---
## Model Summary
![benchmarks](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/rduiP2UeMrpMgcIfTIEm6.png)
| Property | Value |
|---|---|
| **Developer** | Merlin Research |
| **Base Model** | Qwen/Qwen3.5-9B-Base |
| **Parameters** | 9B |
| **Context Length** | 1,000,000 tokens |
| **Training** | SFT + RL with Adaptive Entropy Regularization |
| **Distillation** | Frontier coding models |
| **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) |
| **Quantum** | IBM Quantum Kingston (Heron r2) — entropy noise injection |
| **License** | Apache 2.0 |
---
## Key Features
### 🎯 Precision-First Design
Pluto is trained to minimize errors rather than maximize fluency. Every training signal — from distillation targets to RL reward shaping — is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences.
### 🔭 1M Token Context
Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history — Pluto maintains coherent reasoning across the full window.
### 🤖 Agentic Deployment Ready
Pluto is fine-tuned specifically for deployment in:
- **Claude Code** — system prompt formatting, tool call patterns, multi-turn agentic loops
- **OpenAI Codex / Assistants API** — compatible message structure and function calling behavior
- **Local deployment** — GGUF and quantized variants available for running against large local codebases without API latency
### ⚛️ Quantum Entropy Regularization (AER)
During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient λ(t) during GRPO training, providing:
- Resistance to entropy collapse and reward hacking
- Improved robustness on out-of-distribution inputs
- More stable training dynamics across long RL runs
This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization.
### 📚 Distillation from Frontier Models
Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale.
---
## Quickstart
### Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "MerlinSafety/Pluto"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp."
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
do_sample=True,
repetition_penalty=1.1,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
### With Unsloth (faster inference, 4-bit)
```python
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="MerlinSafety/Pluto",
max_seq_length=131072, # adjust as needed
dtype=None,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n import requests\n return requests.get(url).json()"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.6,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```
### GGUF / llama.cpp (local deployment)
```bash
# Download Q4_K_M (recommended, ~5.4GB)
huggingface-cli download MerlinSafety/Pluto \
Pluto-Q4_K_M.gguf \
--local-dir ./pluto
# Download Q8_0 (higher quality, ~9.4GB)
huggingface-cli download MerlinSafety/Pluto \
Pluto-Q8_0.gguf \
--local-dir ./pluto
# Run with llama.cpp
./llama-cli \
-m ./pluto/Pluto-Q4_K_M.gguf \
-p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \
-n 1024 \
--temp 0.6 \
--top-p 0.95 \
-c 8192
```
### Ollama
```bash
cat > Modelfile << 'EOF'
FROM ./Pluto-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
EOF
ollama create pluto -f Modelfile
ollama run pluto "Write a thread-safe singleton implementation in Python"
```
---
## Claude Code Integration
Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server:
```bash
# Start local server (example with llama.cpp server)
./llama-server \
-m pluto-9b-q4_k_m.gguf \
--port 8080 \
-c 32768 \
--chat-template qwen
# Use with Claude Code
claude --model http://localhost:8080 "Review this PR and identify potential bugs"
```
---
## OpenAI Codex / Assistants API Integration
Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # your local Pluto server
api_key="not-needed"
)
response = client.chat.completions.create(
model="pluto",
messages=[
{
"role": "user",
"content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly."
}
],
max_tokens=1024,
temperature=0.6,
)
print(response.choices[0].message.content)
```
---
---
## Training Details
### Pipeline Overview
```
Qwen/Qwen3.5-9B-Base
SFT on curated advanced reasoning + coding dataset
(private dataset, distillation from frontier models)
GRPO Reinforcement Learning
with Adaptive Entropy Regularization (AER)
+ IBM Quantum Kingston entropy noise injection
Long-context fine-tuning (1M token extension)
Agentic deployment fine-tuning
(Claude Code + Codex format alignment)
Pluto 9B
```
### Adaptive Entropy Regularization (AER)
During RL training, the loss function was modified as:
```
L_total = L_RL + λ(t) · L_entropy
```
where `λ(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness.
### Compute
Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research.
---
## Intended Use
- Complex code generation and refactoring
- Multi-file codebase analysis
- Agentic coding pipelines (Claude Code, Codex)
- Code review and bug detection
- Architecture planning and technical reasoning
- Local deployment with large private codebases
---
## Limitations
- Pluto is optimized for coding and technical reasoning — general conversation and creative tasks are outside its primary design goal
- Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production
- Performance on very niche frameworks or proprietary APIs may be limited by training data coverage
- Quantum entropy component provides training-time benefits; inference behavior is classical
---
## Citation
```bibtex
@misc{pluto-2026,
title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization},
author={Merlin Research},
year={2026},
publisher={Merlin Research},
url={https://huggingface.co/MerlinSafety/Pluto}
}
```
---
## About Merlin Research
[Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community.
**HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety)
**Contact:** MerlinResearch@protonmail.com