| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3.5-9B-Base |
| tags: |
| - code |
| - reasoning |
| - distillation |
| - reinforcement-learning |
| - long-context |
| - claude-code |
| - openai-codex |
| - quantum-entropy |
| - merlin-research |
| language: |
| - en |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Pluto |
|
|
|  |
|
|
| [](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
| [](https://quantum.ibm.com) |
|
|
| [](https://sites.research.google/trc/) |
|
|
| **Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows. |
|
|
| --- |
|
|
| ## Model Summary |
|
|
|  |
|
|
| | Property | Value | |
| |---|---| |
| | **Developer** | Merlin Research | |
| | **Base Model** | Qwen/Qwen3.5-9B-Base | |
| | **Parameters** | 9B | |
| | **Context Length** | 1,000,000 tokens | |
| | **Training** | SFT + RL with Adaptive Entropy Regularization | |
| | **Distillation** | Frontier coding models | |
| | **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) | |
| | **Quantum** | IBM Quantum Kingston (Heron r2) — entropy noise injection | |
| | **License** | Apache 2.0 | |
|
|
| --- |
|
|
| ## Key Features |
|
|
| ### 🎯 Precision-First Design |
| Pluto is trained to minimize errors rather than maximize fluency. Every training signal — from distillation targets to RL reward shaping — is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences. |
|
|
| ### 🔭 1M Token Context |
| Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history — Pluto maintains coherent reasoning across the full window. |
|
|
| ### 🤖 Agentic Deployment Ready |
| Pluto is fine-tuned specifically for deployment in: |
| - **Claude Code** — system prompt formatting, tool call patterns, multi-turn agentic loops |
| - **OpenAI Codex / Assistants API** — compatible message structure and function calling behavior |
| - **Local deployment** — GGUF and quantized variants available for running against large local codebases without API latency |
|
|
| ### ⚛️ Quantum Entropy Regularization (AER) |
| During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient λ(t) during GRPO training, providing: |
| - Resistance to entropy collapse and reward hacking |
| - Improved robustness on out-of-distribution inputs |
| - More stable training dynamics across long RL runs |
|
|
| This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization. |
|
|
| ### 📚 Distillation from Frontier Models |
| Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale. |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ### Transformers |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_id = "MerlinSafety/Pluto" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp." |
| } |
| ] |
| |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=2048, |
| temperature=0.6, |
| top_p=0.95, |
| do_sample=True, |
| repetition_penalty=1.1, |
| ) |
| |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ### With Unsloth (faster inference, 4-bit) |
|
|
| ```python |
| from unsloth import FastLanguageModel |
| import torch |
| |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name="MerlinSafety/Pluto", |
| max_seq_length=131072, # adjust as needed |
| dtype=None, |
| load_in_4bit=True, |
| ) |
| |
| FastLanguageModel.for_inference(model) |
| |
| messages = [ |
| {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n import requests\n return requests.get(url).json()"} |
| ] |
| |
| inputs = tokenizer.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_tensors="pt" |
| ).to("cuda") |
| |
| outputs = model.generate( |
| input_ids=inputs, |
| max_new_tokens=1024, |
| temperature=0.6, |
| do_sample=True, |
| ) |
| |
| print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ### GGUF / llama.cpp (local deployment) |
|
|
| ```bash |
| # Download Q4_K_M (recommended, ~5.4GB) |
| huggingface-cli download MerlinSafety/Pluto \ |
| Pluto-Q4_K_M.gguf \ |
| --local-dir ./pluto |
| |
| # Download Q8_0 (higher quality, ~9.4GB) |
| huggingface-cli download MerlinSafety/Pluto \ |
| Pluto-Q8_0.gguf \ |
| --local-dir ./pluto |
| |
| # Run with llama.cpp |
| ./llama-cli \ |
| -m ./pluto/Pluto-Q4_K_M.gguf \ |
| -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \ |
| -n 1024 \ |
| --temp 0.6 \ |
| --top-p 0.95 \ |
| -c 8192 |
| ``` |
|
|
| ### Ollama |
|
|
| ```bash |
| cat > Modelfile << 'EOF' |
| FROM ./Pluto-Q4_K_M.gguf |
| PARAMETER temperature 0.6 |
| PARAMETER top_p 0.95 |
| PARAMETER num_ctx 8192 |
| EOF |
| |
| ollama create pluto -f Modelfile |
| ollama run pluto "Write a thread-safe singleton implementation in Python" |
| ``` |
|
|
| --- |
|
|
| ## Claude Code Integration |
|
|
| Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server: |
|
|
| ```bash |
| # Start local server (example with llama.cpp server) |
| ./llama-server \ |
| -m pluto-9b-q4_k_m.gguf \ |
| --port 8080 \ |
| -c 32768 \ |
| --chat-template qwen |
| |
| # Use with Claude Code |
| claude --model http://localhost:8080 "Review this PR and identify potential bugs" |
| ``` |
|
|
| --- |
|
|
| ## OpenAI Codex / Assistants API Integration |
|
|
| Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint: |
|
|
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI( |
| base_url="http://localhost:8080/v1", # your local Pluto server |
| api_key="not-needed" |
| ) |
| |
| response = client.chat.completions.create( |
| model="pluto", |
| messages=[ |
| { |
| "role": "user", |
| "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly." |
| } |
| ], |
| max_tokens=1024, |
| temperature=0.6, |
| ) |
| |
| print(response.choices[0].message.content) |
| ``` |
|
|
| --- |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Pipeline Overview |
|
|
| ``` |
| Qwen/Qwen3.5-9B-Base |
| │ |
| ▼ |
| SFT on curated advanced reasoning + coding dataset |
| (private dataset, distillation from frontier models) |
| │ |
| ▼ |
| GRPO Reinforcement Learning |
| with Adaptive Entropy Regularization (AER) |
| + IBM Quantum Kingston entropy noise injection |
| │ |
| ▼ |
| Long-context fine-tuning (1M token extension) |
| │ |
| ▼ |
| Agentic deployment fine-tuning |
| (Claude Code + Codex format alignment) |
| │ |
| ▼ |
| Pluto 9B |
| ``` |
|
|
| ### Adaptive Entropy Regularization (AER) |
|
|
| During RL training, the loss function was modified as: |
|
|
| ``` |
| L_total = L_RL + λ(t) · L_entropy |
| ``` |
|
|
| where `λ(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness. |
|
|
| ### Compute |
| Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| - Complex code generation and refactoring |
| - Multi-file codebase analysis |
| - Agentic coding pipelines (Claude Code, Codex) |
| - Code review and bug detection |
| - Architecture planning and technical reasoning |
| - Local deployment with large private codebases |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Pluto is optimized for coding and technical reasoning — general conversation and creative tasks are outside its primary design goal |
| - Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production |
| - Performance on very niche frameworks or proprietary APIs may be limited by training data coverage |
| - Quantum entropy component provides training-time benefits; inference behavior is classical |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{pluto-2026, |
| title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization}, |
| author={Merlin Research}, |
| year={2026}, |
| publisher={Merlin Research}, |
| url={https://huggingface.co/MerlinSafety/Pluto} |
| } |
| ``` |
|
|
| --- |
|
|
| ## About Merlin Research |
|
|
| [Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community. |
|
|
| **HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety) |
| **Contact:** MerlinResearch@protonmail.com |