Instructions to use Drissman/hermythos-rdt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Drissman/hermythos-rdt with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Drissman/hermythos-rdt",
	filename="bonsai-rdt-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Drissman/hermythos-rdt with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Drissman/hermythos-rdt:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Drissman/hermythos-rdt:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Drissman/hermythos-rdt:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Drissman/hermythos-rdt:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Drissman/hermythos-rdt:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Drissman/hermythos-rdt:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Drissman/hermythos-rdt:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Drissman/hermythos-rdt:Q4_K_M

Use Docker

docker model run hf.co/Drissman/hermythos-rdt:Q4_K_M

LM Studio
Jan

vLLM

How to use Drissman/hermythos-rdt with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Drissman/hermythos-rdt"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Drissman/hermythos-rdt",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Drissman/hermythos-rdt:Q4_K_M

Ollama
How to use Drissman/hermythos-rdt with Ollama:
```
ollama run hf.co/Drissman/hermythos-rdt:Q4_K_M
```

Unsloth Studio

How to use Drissman/hermythos-rdt with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Drissman/hermythos-rdt to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Drissman/hermythos-rdt to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Drissman/hermythos-rdt to start chatting

How to use Drissman/hermythos-rdt with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Drissman/hermythos-rdt:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Drissman/hermythos-rdt:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Drissman/hermythos-rdt with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Drissman/hermythos-rdt:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Drissman/hermythos-rdt:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use Drissman/hermythos-rdt with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Drissman/hermythos-rdt:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "Drissman/hermythos-rdt:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use Drissman/hermythos-rdt with Docker Model Runner:
```
docker model run hf.co/Drissman/hermythos-rdt:Q4_K_M
```

Lemonade

How to use Drissman/hermythos-rdt with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Drissman/hermythos-rdt:Q4_K_M

Run and chat with the model

lemonade run user.hermythos-rdt-Q4_K_M

List all available models

lemonade list

HERMYTHOS — Agentic AI Engine + Ternary RDT Model

HERMES (Rust agentic engine) + MYTHOS (1.58-bit ternary RDT model). One sovereign binary. Zero Big Tech dependence.

Qwen3-8B base → Ternary Bonsai weights → QAT fine-tuned → Q4_K_M quantized
4.68 GB · 4.90 BPW · 399 layers · 73.5% QAT accuracy · CPU-native

Why HERMYTHOS exists

Every frontier model runs on someone else's cloud. Every agent platform phones home. Every fine-tune assumes NVIDIA.

HERMYTHOS breaks all three assumptions.

Sovereign inference — runs CPU-only on a laptop (5 tok/s on Intel Core Ultra, Q4_K_M). No GPU required. No API key needed.
Open weights — 1.58-bit ternary model derived from Qwen3-8B via QAT on H100, quantized to Q4_K_M (4.7 GB). You own every parameter.
Agent-native engine — 22 Rust crates (29 total), cybernetic loop with 8 tools, 3 frontends (TUI / Flutter Web / Open WebUI). Zero Python at runtime.

The model itself is only half the equation — the Rust engine (hermythos-server) provides the agentic scaffolding: tool execution, FSM-based cybernetic loops, memory persistence, and multi-agent LLM debate via RecursiveMAS.

Quick Install

# One command. Downloads the model + engine.
curl -fsSL https://raw.githubusercontent.com/drissman/hermythos/main/scripts/install.sh | bash

Or manually:

# 1. Get the model
hf download Drissman/hermythos-rdt bonsai-rdt-q4_k_m.gguf --local-dir ./models

# 2. Clone the engine
git clone https://github.com/drissman/hermythos
cd hermythos-rdt
cargo run -p hermythos-server --release -- --model ./models/bonsai-rdt-q4_k_m.gguf

Technical Specs


Architecture	Qwen3-8B → Ternary Bonsai (BitLinear 1.58-bit)
Base model	`prism-ml/Ternary-Bonsai-8B-unpacked`
Training	QAT LoRA (rank 128), 3 epochs, 150 ChatML examples
Loss	4.94 → 3.43 (QAT on H100 GPU)
Accuracy	21.8% → 73.5% (ternary fidelity)
Layers patched	252 (all Linear → BitLinear {-1, 0, +1})
Quantization	Q4_K_M via llama.cpp (399 blocks, 642s)
Final size	4.68 GB (down from 15.6 GB FP16)
BPW	4.90 bits per weight

Why Ternary Matters

Standard LLMs use 16-bit floats per weight. That's 16 GB for an 8B model. Ternary packs weights into {-1, 0, +1} — 12-16× denser — and eliminates multiplication from inference entirely.

FP16 matmul:  multiply-add-multiply-add...  (expensive)
Ternary matmul: add-skip-subtract...        (just additions)

This means:

Runs on CPU — no GPU required. Laptop-grade Core Ultra gets 5 tok/s.
Runs on RISC-V — no CUDA dependency. No NVIDIA lock-in.
12× smaller — 4.7 GB fits in RAM + disk of any machine built after 2015.

The trade-off is training complexity: ternary quantization requires QAT (Quantization-Aware Training) with Straight-Through Estimator. This model was fine-tuned on an H100 — but inference runs anywhere.

Architecture — 3-Layer Stack

┌─────────────────────────────────────────┐
│  UI LAYER — 3 Frontends                  │
│  TUI (ratatui) · Flutter Web · Open WebUI│
└──────────────┬──────────────────────────┘
               │ WebSocket / OpenAI API
┌──────────────▼──────────────────────────┐
│  ORCHESTRATION — Rust/Tokio (22 crates)  │
│  Agent Core · Tools · Memory · Skills    │
│  RecursiveMAS · MDASH · Cluster · Faber  │
└──────────────┬──────────────────────────┘
               │ GGUF · llama.cpp
┌──────────────▼──────────────────────────┐
│  MODEL — BonsaiRDT Ternary               │
│  Qwen3-8B base · BitLinear · 252 layers  │
│  1.58-bit weights · 4.68 GB Q4_K_M       │
└─────────────────────────────────────────┘

The engine (22 crates):

hermythos-server — WebSocket + OpenAI-compatible API backend
hermythos-mas — RecursiveMAS multi-agent topologies
hermythos-cluster — Distributed GRPO + debate orchestration
hermythos-compute — TERNARY format, CPU backend (AVX2)
faber-* — Sovereign data platform (Faber Foundry, MIT)

270+ tests. cargo test --workspace = 100% green.

Performance

Backend	Hardware	tok/s
llama.cpp CPU	Intel Core Ultra 7 165U (10-core)	5.09
llama.cpp GPU (Intel Arc OpenCL)	WSL2 D3D12 translation	1.80 (useless)
llama.cpp CPU	AMD Ryzen 9	~12 (estimated)

Rule: CPU-only on WSL2. GPU path is slower due to D3D12 overhead.

Roadmap — 100 Days

✅ Distribution  →  ⬜ Documentation  →  ⬜ External Testing  →  ⬜ Auto-Evolution

Day	Phase	Status
1-10	Distribution (HF repo, install script, QAT)	✅ DONE
11-25	Documentation (README, quickstart, architecture)	🔄 NOW
26-50	External testing + feedback loop	⬜
51-80	Continual Self-Evolution (IT16)	⬜
81-100	Sovereignty (offline mode, zero cloud, Faber)	⬜

Persona-Driven Design

HERMYTHOS ships with 6 interaction modes tuned by BMAD (Brain-inspired Multi-Agent Distillation):

Mode	Style	Use
TDA	Dense, matrices, sharp decisions	Technical architecture
Bilan	KPIs, gap analysis, retrospectives	Project review
Coaching	Mentorship, concrete plans	Onboarding
Personnel	Individual optimization	Self-improvement
Général	Conceptual explanations	Discovery
Cybernétique	2nd-order systemic analysis	Meta-cognition

Credits

Architecture & Training: Driss NAAMANE (Senior Cloud Architect / TDA)
Base Model: Qwen3-8B (Alibaba) + Ternary Bonsai (prism-ml)
QAT Pipeline: H100 RunPod, TRL + PEFT
Quantization: llama.cpp Q4_K_M
Engine: 22 crates Rust, 270+ tests, Apache 2.0 / MIT dual-licensed

License

Model weights: Apache 2.0 (inherited from Qwen3) Engine: MIT

"Un modèle que tu possèdes. Un moteur que tu contrôles. Une plateforme que personne ne peut t'enlever."

Downloads last month: 64

GGUF

Model size

8B params

Architecture

qwen3

Hardware compatibility

4-bit

Model tree for Drissman/hermythos-rdt

Base model

prism-ml/Ternary-Bonsai-8B-unpacked

Quantized

(18)

this model