--- license: apache-2.0 base_model: Qwen/Qwen3.5-9B-Base tags: - code - reasoning - distillation - reinforcement-learning - long-context - claude-code - openai-codex - quantum-entropy - merlin-research language: - en pipeline_tag: image-text-to-text --- # Pluto ![IMAGE 2026-03-22 02:04:31](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/yEhR_aUdMvbHKMuhiXvB7.jpeg) [![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](https://www.apache.org/licenses/LICENSE-2.0) [![IBM Quantum](https://img.shields.io/badge/IBM_Quantum-Kingston_156Q-7c3aed?style=for-the-badge)](https://quantum.ibm.com) [![Training Hardware](https://img.shields.io/badge/Training_HW-Google_TPU_TRC-dc2626?style=for-the-badge)](https://sites.research.google/trc/) **Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows. --- ## Model Summary ![benchmarks](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/rduiP2UeMrpMgcIfTIEm6.png) | Property | Value | |---|---| | **Developer** | Merlin Research | | **Base Model** | Qwen/Qwen3.5-9B-Base | | **Parameters** | 9B | | **Context Length** | 1,000,000 tokens | | **Training** | SFT + RL with Adaptive Entropy Regularization | | **Distillation** | Frontier coding models | | **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) | | **Quantum** | IBM Quantum Kingston (Heron r2) โ€” entropy noise injection | | **License** | Apache 2.0 | --- ## Key Features ### ๐ŸŽฏ Precision-First Design Pluto is trained to minimize errors rather than maximize fluency. Every training signal โ€” from distillation targets to RL reward shaping โ€” is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences. ### ๐Ÿ”ญ 1M Token Context Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history โ€” Pluto maintains coherent reasoning across the full window. ### ๐Ÿค– Agentic Deployment Ready Pluto is fine-tuned specifically for deployment in: - **Claude Code** โ€” system prompt formatting, tool call patterns, multi-turn agentic loops - **OpenAI Codex / Assistants API** โ€” compatible message structure and function calling behavior - **Local deployment** โ€” GGUF and quantized variants available for running against large local codebases without API latency ### โš›๏ธ Quantum Entropy Regularization (AER) During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient ฮป(t) during GRPO training, providing: - Resistance to entropy collapse and reward hacking - Improved robustness on out-of-distribution inputs - More stable training dynamics across long RL runs This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization. ### ๐Ÿ“š Distillation from Frontier Models Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale. --- ## Quickstart ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "MerlinSafety/Pluto" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ { "role": "user", "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp." } ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=2048, temperature=0.6, top_p=0.95, do_sample=True, repetition_penalty=1.1, ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` ### With Unsloth (faster inference, 4-bit) ```python from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name="MerlinSafety/Pluto", max_seq_length=131072, # adjust as needed dtype=None, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [ {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n import requests\n return requests.get(url).json()"} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") outputs = model.generate( input_ids=inputs, max_new_tokens=1024, temperature=0.6, do_sample=True, ) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) ``` ### GGUF / llama.cpp (local deployment) ```bash # Download Q4_K_M (recommended, ~5.4GB) huggingface-cli download MerlinSafety/Pluto \ Pluto-Q4_K_M.gguf \ --local-dir ./pluto # Download Q8_0 (higher quality, ~9.4GB) huggingface-cli download MerlinSafety/Pluto \ Pluto-Q8_0.gguf \ --local-dir ./pluto # Run with llama.cpp ./llama-cli \ -m ./pluto/Pluto-Q4_K_M.gguf \ -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \ -n 1024 \ --temp 0.6 \ --top-p 0.95 \ -c 8192 ``` ### Ollama ```bash cat > Modelfile << 'EOF' FROM ./Pluto-Q4_K_M.gguf PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER num_ctx 8192 EOF ollama create pluto -f Modelfile ollama run pluto "Write a thread-safe singleton implementation in Python" ``` --- ## Claude Code Integration Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server: ```bash # Start local server (example with llama.cpp server) ./llama-server \ -m pluto-9b-q4_k_m.gguf \ --port 8080 \ -c 32768 \ --chat-template qwen # Use with Claude Code claude --model http://localhost:8080 "Review this PR and identify potential bugs" ``` --- ## OpenAI Codex / Assistants API Integration Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint: ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", # your local Pluto server api_key="not-needed" ) response = client.chat.completions.create( model="pluto", messages=[ { "role": "user", "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly." } ], max_tokens=1024, temperature=0.6, ) print(response.choices[0].message.content) ``` --- --- ## Training Details ### Pipeline Overview ``` Qwen/Qwen3.5-9B-Base โ”‚ โ–ผ SFT on curated advanced reasoning + coding dataset (private dataset, distillation from frontier models) โ”‚ โ–ผ GRPO Reinforcement Learning with Adaptive Entropy Regularization (AER) + IBM Quantum Kingston entropy noise injection โ”‚ โ–ผ Long-context fine-tuning (1M token extension) โ”‚ โ–ผ Agentic deployment fine-tuning (Claude Code + Codex format alignment) โ”‚ โ–ผ Pluto 9B ``` ### Adaptive Entropy Regularization (AER) During RL training, the loss function was modified as: ``` L_total = L_RL + ฮป(t) ยท L_entropy ``` where `ฮป(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness. ### Compute Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research. --- ## Intended Use - Complex code generation and refactoring - Multi-file codebase analysis - Agentic coding pipelines (Claude Code, Codex) - Code review and bug detection - Architecture planning and technical reasoning - Local deployment with large private codebases --- ## Limitations - Pluto is optimized for coding and technical reasoning โ€” general conversation and creative tasks are outside its primary design goal - Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production - Performance on very niche frameworks or proprietary APIs may be limited by training data coverage - Quantum entropy component provides training-time benefits; inference behavior is classical --- ## Citation ```bibtex @misc{pluto-2026, title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization}, author={Merlin Research}, year={2026}, publisher={Merlin Research}, url={https://huggingface.co/MerlinSafety/Pluto} } ``` --- ## About Merlin Research [Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community. **HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety) **Contact:** MerlinResearch@protonmail.com