How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Drissman/hermythos-rdt",
	filename="bonsai-rdt-q4_k_m.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

HERMYTHOS โ€” Agentic AI Engine + Ternary RDT Model

HERMES (Rust agentic engine) + MYTHOS (1.58-bit ternary RDT model). One sovereign binary. Zero Big Tech dependence.

Qwen3-8B base โ†’ Ternary Bonsai weights โ†’ QAT fine-tuned โ†’ Q4_K_M quantized
4.68 GB ยท 4.90 BPW ยท 399 layers ยท 73.5% QAT accuracy ยท CPU-native


Why HERMYTHOS exists

Every frontier model runs on someone else's cloud. Every agent platform phones home. Every fine-tune assumes NVIDIA.

HERMYTHOS breaks all three assumptions.

  • Sovereign inference โ€” runs CPU-only on a laptop (5 tok/s on Intel Core Ultra, Q4_K_M). No GPU required. No API key needed.
  • Open weights โ€” 1.58-bit ternary model derived from Qwen3-8B via QAT on H100, quantized to Q4_K_M (4.7 GB). You own every parameter.
  • Agent-native engine โ€” 22 Rust crates (29 total), cybernetic loop with 8 tools, 3 frontends (TUI / Flutter Web / Open WebUI). Zero Python at runtime.

The model itself is only half the equation โ€” the Rust engine (hermythos-server) provides the agentic scaffolding: tool execution, FSM-based cybernetic loops, memory persistence, and multi-agent LLM debate via RecursiveMAS.


Quick Install

# One command. Downloads the model + engine.
curl -fsSL https://raw.githubusercontent.com/drissman/hermythos/main/scripts/install.sh | bash

Or manually:

# 1. Get the model
hf download Drissman/hermythos-rdt bonsai-rdt-q4_k_m.gguf --local-dir ./models

# 2. Clone the engine
git clone https://github.com/drissman/hermythos
cd hermythos-rdt
cargo run -p hermythos-server --release -- --model ./models/bonsai-rdt-q4_k_m.gguf

Technical Specs

Architecture Qwen3-8B โ†’ Ternary Bonsai (BitLinear 1.58-bit)
Base model prism-ml/Ternary-Bonsai-8B-unpacked
Training QAT LoRA (rank 128), 3 epochs, 150 ChatML examples
Loss 4.94 โ†’ 3.43 (QAT on H100 GPU)
Accuracy 21.8% โ†’ 73.5% (ternary fidelity)
Layers patched 252 (all Linear โ†’ BitLinear {-1, 0, +1})
Quantization Q4_K_M via llama.cpp (399 blocks, 642s)
Final size 4.68 GB (down from 15.6 GB FP16)
BPW 4.90 bits per weight

Why Ternary Matters

Standard LLMs use 16-bit floats per weight. That's 16 GB for an 8B model. Ternary packs weights into {-1, 0, +1} โ€” 12-16ร— denser โ€” and eliminates multiplication from inference entirely.

FP16 matmul:  multiply-add-multiply-add...  (expensive)
Ternary matmul: add-skip-subtract...        (just additions)

This means:

  • Runs on CPU โ€” no GPU required. Laptop-grade Core Ultra gets 5 tok/s.
  • Runs on RISC-V โ€” no CUDA dependency. No NVIDIA lock-in.
  • 12ร— smaller โ€” 4.7 GB fits in RAM + disk of any machine built after 2015.

The trade-off is training complexity: ternary quantization requires QAT (Quantization-Aware Training) with Straight-Through Estimator. This model was fine-tuned on an H100 โ€” but inference runs anywhere.


Architecture โ€” 3-Layer Stack

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  UI LAYER โ€” 3 Frontends                  โ”‚
โ”‚  TUI (ratatui) ยท Flutter Web ยท Open WebUIโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚ WebSocket / OpenAI API
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ORCHESTRATION โ€” Rust/Tokio (22 crates)  โ”‚
โ”‚  Agent Core ยท Tools ยท Memory ยท Skills    โ”‚
โ”‚  RecursiveMAS ยท MDASH ยท Cluster ยท Faber  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚ GGUF ยท llama.cpp
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  MODEL โ€” BonsaiRDT Ternary               โ”‚
โ”‚  Qwen3-8B base ยท BitLinear ยท 252 layers  โ”‚
โ”‚  1.58-bit weights ยท 4.68 GB Q4_K_M       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The engine (22 crates):

  • hermythos-server โ€” WebSocket + OpenAI-compatible API backend
  • hermythos-mas โ€” RecursiveMAS multi-agent topologies
  • hermythos-cluster โ€” Distributed GRPO + debate orchestration
  • hermythos-compute โ€” TERNARY format, CPU backend (AVX2)
  • faber-* โ€” Sovereign data platform (Faber Foundry, MIT)

270+ tests. cargo test --workspace = 100% green.


Performance

Backend Hardware tok/s
llama.cpp CPU Intel Core Ultra 7 165U (10-core) 5.09
llama.cpp GPU (Intel Arc OpenCL) WSL2 D3D12 translation 1.80 (useless)
llama.cpp CPU AMD Ryzen 9 ~12 (estimated)

Rule: CPU-only on WSL2. GPU path is slower due to D3D12 overhead.


Roadmap โ€” 100 Days

โœ… Distribution  โ†’  โฌœ Documentation  โ†’  โฌœ External Testing  โ†’  โฌœ Auto-Evolution
Day Phase Status
1-10 Distribution (HF repo, install script, QAT) โœ… DONE
11-25 Documentation (README, quickstart, architecture) ๐Ÿ”„ NOW
26-50 External testing + feedback loop โฌœ
51-80 Continual Self-Evolution (IT16) โฌœ
81-100 Sovereignty (offline mode, zero cloud, Faber) โฌœ

Persona-Driven Design

HERMYTHOS ships with 6 interaction modes tuned by BMAD (Brain-inspired Multi-Agent Distillation):

Mode Style Use
TDA Dense, matrices, sharp decisions Technical architecture
Bilan KPIs, gap analysis, retrospectives Project review
Coaching Mentorship, concrete plans Onboarding
Personnel Individual optimization Self-improvement
Gรฉnรฉral Conceptual explanations Discovery
Cybernรฉtique 2nd-order systemic analysis Meta-cognition

Credits

  • Architecture & Training: Driss NAAMANE (Senior Cloud Architect / TDA)
  • Base Model: Qwen3-8B (Alibaba) + Ternary Bonsai (prism-ml)
  • QAT Pipeline: H100 RunPod, TRL + PEFT
  • Quantization: llama.cpp Q4_K_M
  • Engine: 22 crates Rust, 270+ tests, Apache 2.0 / MIT dual-licensed

License

Model weights: Apache 2.0 (inherited from Qwen3) Engine: MIT


"Un modรจle que tu possรจdes. Un moteur que tu contrรดles. Une plateforme que personne ne peut t'enlever."

Downloads last month
64
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Drissman/hermythos-rdt

Quantized
(18)
this model