Image-Text-to-Text
Safetensors
GGUF
English
qwen3_5
code
reasoning
distillation
reinforcement-learning
long-context
claude-code
openai-codex
quantum-entropy
merlin-research
conversational
Instructions to use Merlin-Research/Pluto with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Merlin-Research/Pluto with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Merlin-Research/Pluto", filename="Pluto-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Merlin-Research/Pluto with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Merlin-Research/Pluto:Q4_K_M # Run inference directly in the terminal: llama cli -hf Merlin-Research/Pluto:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Merlin-Research/Pluto:Q4_K_M # Run inference directly in the terminal: llama cli -hf Merlin-Research/Pluto:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Merlin-Research/Pluto:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Merlin-Research/Pluto:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Merlin-Research/Pluto:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Merlin-Research/Pluto:Q4_K_M
Use Docker
docker model run hf.co/Merlin-Research/Pluto:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Merlin-Research/Pluto with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Merlin-Research/Pluto" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Merlin-Research/Pluto", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Merlin-Research/Pluto:Q4_K_M
- Ollama
How to use Merlin-Research/Pluto with Ollama:
ollama run hf.co/Merlin-Research/Pluto:Q4_K_M
- Unsloth Studio
How to use Merlin-Research/Pluto with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Merlin-Research/Pluto to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Merlin-Research/Pluto to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Merlin-Research/Pluto to start chatting
- Pi
How to use Merlin-Research/Pluto with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Merlin-Research/Pluto:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Merlin-Research/Pluto:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Merlin-Research/Pluto with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Merlin-Research/Pluto:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Merlin-Research/Pluto:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use Merlin-Research/Pluto with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Merlin-Research/Pluto:Q4_K_M
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "Merlin-Research/Pluto:Q4_K_M" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use Merlin-Research/Pluto with Docker Model Runner:
docker model run hf.co/Merlin-Research/Pluto:Q4_K_M
- Lemonade
How to use Merlin-Research/Pluto with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Merlin-Research/Pluto:Q4_K_M
Run and chat with the model
lemonade run user.Pluto-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| base_model: Qwen/Qwen3.5-9B-Base | |
| tags: | |
| - code | |
| - reasoning | |
| - distillation | |
| - reinforcement-learning | |
| - long-context | |
| - claude-code | |
| - openai-codex | |
| - quantum-entropy | |
| - merlin-research | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| # Pluto | |
|  | |
| [](https://www.apache.org/licenses/LICENSE-2.0) | |
| [](https://quantum.ibm.com) | |
| [](https://sites.research.google/trc/) | |
| **Pluto** is a 9B parameter coding and reasoning model developed by [Merlin Research](https://huggingface.co/MerlinSafety), built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows. | |
| --- | |
| ## Model Summary | |
|  | |
| | Property | Value | | |
| |---|---| | |
| | **Developer** | Merlin Research | | |
| | **Base Model** | Qwen/Qwen3.5-9B-Base | | |
| | **Parameters** | 9B | | |
| | **Context Length** | 1,000,000 tokens | | |
| | **Training** | SFT + RL with Adaptive Entropy Regularization | | |
| | **Distillation** | Frontier coding models | | |
| | **Compute** | Google Cloud (TPU/GPU via Google TRC Research Grant) | | |
| | **Quantum** | IBM Quantum Kingston (Heron r2) — entropy noise injection | | |
| | **License** | Apache 2.0 | | |
| --- | |
| ## Key Features | |
| ### 🎯 Precision-First Design | |
| Pluto is trained to minimize errors rather than maximize fluency. Every training signal — from distillation targets to RL reward shaping — is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences. | |
| ### 🔭 1M Token Context | |
| Pluto supports up to **1,000,000 tokens** of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history — Pluto maintains coherent reasoning across the full window. | |
| ### 🤖 Agentic Deployment Ready | |
| Pluto is fine-tuned specifically for deployment in: | |
| - **Claude Code** — system prompt formatting, tool call patterns, multi-turn agentic loops | |
| - **OpenAI Codex / Assistants API** — compatible message structure and function calling behavior | |
| - **Local deployment** — GGUF and quantized variants available for running against large local codebases without API latency | |
| ### ⚛️ Quantum Entropy Regularization (AER) | |
| During RL training, Pluto used **Adaptive Entropy Regularization (AER)** with quantum noise sourced from the **IBM Quantum Kingston** processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient λ(t) during GRPO training, providing: | |
| - Resistance to entropy collapse and reward hacking | |
| - Improved robustness on out-of-distribution inputs | |
| - More stable training dynamics across long RL runs | |
| This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization. | |
| ### 📚 Distillation from Frontier Models | |
| Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale. | |
| --- | |
| ## Quickstart | |
| ### Transformers | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model_id = "MerlinSafety/Pluto" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp." | |
| } | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True | |
| ) | |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=2048, | |
| temperature=0.6, | |
| top_p=0.95, | |
| do_sample=True, | |
| repetition_penalty=1.1, | |
| ) | |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### With Unsloth (faster inference, 4-bit) | |
| ```python | |
| from unsloth import FastLanguageModel | |
| import torch | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| model_name="MerlinSafety/Pluto", | |
| max_seq_length=131072, # adjust as needed | |
| dtype=None, | |
| load_in_4bit=True, | |
| ) | |
| FastLanguageModel.for_inference(model) | |
| messages = [ | |
| {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n import requests\n return requests.get(url).json()"} | |
| ] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_tensors="pt" | |
| ).to("cuda") | |
| outputs = model.generate( | |
| input_ids=inputs, | |
| max_new_tokens=1024, | |
| temperature=0.6, | |
| do_sample=True, | |
| ) | |
| print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### GGUF / llama.cpp (local deployment) | |
| ```bash | |
| # Download Q4_K_M (recommended, ~5.4GB) | |
| huggingface-cli download MerlinSafety/Pluto \ | |
| Pluto-Q4_K_M.gguf \ | |
| --local-dir ./pluto | |
| # Download Q8_0 (higher quality, ~9.4GB) | |
| huggingface-cli download MerlinSafety/Pluto \ | |
| Pluto-Q8_0.gguf \ | |
| --local-dir ./pluto | |
| # Run with llama.cpp | |
| ./llama-cli \ | |
| -m ./pluto/Pluto-Q4_K_M.gguf \ | |
| -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \ | |
| -n 1024 \ | |
| --temp 0.6 \ | |
| --top-p 0.95 \ | |
| -c 8192 | |
| ``` | |
| ### Ollama | |
| ```bash | |
| cat > Modelfile << 'EOF' | |
| FROM ./Pluto-Q4_K_M.gguf | |
| PARAMETER temperature 0.6 | |
| PARAMETER top_p 0.95 | |
| PARAMETER num_ctx 8192 | |
| EOF | |
| ollama create pluto -f Modelfile | |
| ollama run pluto "Write a thread-safe singleton implementation in Python" | |
| ``` | |
| --- | |
| ## Claude Code Integration | |
| Pluto is optimized for use as a local backend in Claude Code via the `--model` flag when pointing to a local OpenAI-compatible server: | |
| ```bash | |
| # Start local server (example with llama.cpp server) | |
| ./llama-server \ | |
| -m pluto-9b-q4_k_m.gguf \ | |
| --port 8080 \ | |
| -c 32768 \ | |
| --chat-template qwen | |
| # Use with Claude Code | |
| claude --model http://localhost:8080 "Review this PR and identify potential bugs" | |
| ``` | |
| --- | |
| ## OpenAI Codex / Assistants API Integration | |
| Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint: | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI( | |
| base_url="http://localhost:8080/v1", # your local Pluto server | |
| api_key="not-needed" | |
| ) | |
| response = client.chat.completions.create( | |
| model="pluto", | |
| messages=[ | |
| { | |
| "role": "user", | |
| "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly." | |
| } | |
| ], | |
| max_tokens=1024, | |
| temperature=0.6, | |
| ) | |
| print(response.choices[0].message.content) | |
| ``` | |
| --- | |
| --- | |
| ## Training Details | |
| ### Pipeline Overview | |
| ``` | |
| Qwen/Qwen3.5-9B-Base | |
| │ | |
| ▼ | |
| SFT on curated advanced reasoning + coding dataset | |
| (private dataset, distillation from frontier models) | |
| │ | |
| ▼ | |
| GRPO Reinforcement Learning | |
| with Adaptive Entropy Regularization (AER) | |
| + IBM Quantum Kingston entropy noise injection | |
| │ | |
| ▼ | |
| Long-context fine-tuning (1M token extension) | |
| │ | |
| ▼ | |
| Agentic deployment fine-tuning | |
| (Claude Code + Codex format alignment) | |
| │ | |
| ▼ | |
| Pluto 9B | |
| ``` | |
| ### Adaptive Entropy Regularization (AER) | |
| During RL training, the loss function was modified as: | |
| ``` | |
| L_total = L_RL + λ(t) · L_entropy | |
| ``` | |
| where `λ(t)` is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness. | |
| ### Compute | |
| Training was conducted on Google Cloud TPU/GPU infrastructure supported by a **Google TPU Research Cloud (TRC) grant** awarded to Merlin Research. | |
| --- | |
| ## Intended Use | |
| - Complex code generation and refactoring | |
| - Multi-file codebase analysis | |
| - Agentic coding pipelines (Claude Code, Codex) | |
| - Code review and bug detection | |
| - Architecture planning and technical reasoning | |
| - Local deployment with large private codebases | |
| --- | |
| ## Limitations | |
| - Pluto is optimized for coding and technical reasoning — general conversation and creative tasks are outside its primary design goal | |
| - Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production | |
| - Performance on very niche frameworks or proprietary APIs may be limited by training data coverage | |
| - Quantum entropy component provides training-time benefits; inference behavior is classical | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{pluto-2026, | |
| title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization}, | |
| author={Merlin Research}, | |
| year={2026}, | |
| publisher={Merlin Research}, | |
| url={https://huggingface.co/MerlinSafety/Pluto} | |
| } | |
| ``` | |
| --- | |
| ## About Merlin Research | |
| [Merlin Research](https://huggingface.co/MerlinSafety) is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community. | |
| **HuggingFace:** [huggingface.co/MerlinSafety](https://huggingface.co/MerlinSafety) | |
| **Contact:** MerlinResearch@protonmail.com |