Chimère: A Self-Improving MoE Inference System for Consumer Hardware

35B parameters. 80 tokens/second on the production HTTP path. One GPU. $0.10/day. The model improves while you sleep.

🆕 Latest update — April 2026: Step 7 multi-architecture dispatch. The same chimere-server runtime now also runs Mamba-2 / Nemotron-H MoE hybrid SSM models end-to-end, on top of a custom backport of upstream llama.cpp's Mamba-2 work into our ik_llama.cpp fork (offered upstream as PR #1593). NVIDIA Nemotron-3-Nano-30B-A3B Q4_0 measured at ~45 tok/s on RTX 5060 Ti (sm_120, NCMOE=30, ctx 2048). Qwen3.5 production path is byte-for-byte unchanged. See Multi-architecture support below.


What is Chimère?

Chimère is a complete inference system — not just a model or a runtime, but an integrated stack where every component feeds the others. It runs Qwen3.5-35B-A3B (35B total parameters, 3.5B active per token, 256 experts) on a single RTX 5060 Ti (16 GB VRAM) at **80 tok/s on the chimere-server HTTP production path** (the bare ik_llama backend reaches ~93 tok/s; the Rust HTTP / sampling layer adds the difference). A nightly quality loop further improves the system from production traffic.

This is the kind of system NVIDIA builds for enterprise deployments, except it runs on a desktop in the south of France.

User request
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  ODO — Unified Orchestrator (port 8084)                 │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Intent      │  │ Entropy      │  │ Confidence    │  │
│  │ Classifier  │  │ Router       │  │ RAG Trigger   │  │
│  │ (3-cascade) │  │ (fast/qual/  │  │ (logprob      │  │
│  │             │  │  ultra)      │  │  probe)       │  │
│  └──────┬──────┘  └──────┬───────┘  └──────┬────────┘  │
│         │                │                  │           │
│  ┌──────▼──────────────────────────────────▼────────┐  │
│  │ Enrichment Pipeline                               │  │
│  │ • Web Search SOTA (8-stage: expand→search→RRF→    │  │
│  │   fetch→chunk→rerank→CRAG→synthesize)             │  │
│  │ • ChromaDB RAG (dense + BM25 + RRF + cross-enc)  │  │
│  │ • FAISS Semantic Few-shot (per domain)            │  │
│  │ • Dynamic Engram (web → n-gram logit bias)        │  │
│  │ • Tool Injection (auto, from pipeline YAML)       │  │
│  └──────┬────────────────────────────────────────────┘  │
│         │                                               │
│  ┌──────▼────────┐  ┌────────────────┐                  │
│  │ DVTS Tree     │  │ ABF + CGRS     │                  │
│  │ Search (K=2,  │  │ (thinking      │                  │
│  │ ThinkPRM)     │  │  budget mgmt)  │                  │
│  └──────┬────────┘  └──────┬─────────┘                  │
└─────────┼──────────────────┼────────────────────────────┘
          │                  │
          ▼                  ▼
┌─────────────────────────────────────────────────────────┐
│  chimere-server (port 8081) — Rust Runtime              │
│  • ik_llama FFI backend (93 tok/s)                      │
│  • Multi-tier Engram (Cuckoo <10ns → hash O(1) → FAISS)│
│  • Logprobs (top-5 log-softmax, real values)            │
│  • ABF token 248069 forcing at budget threshold         │
│  • IQ3_S custom-mix / RAMP-v2 (15.2 GB, 3.78 BPW)     │
│  • KV cache: q8_0 keys + q4_0 values (sweet spot)      │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│  Quality & Self-Improvement Loop                        │
│  • ThinkPRM-1.5B (CPU): step-level verification         │
│  • quality_scores.jsonl (104 scores, mean 3.04/5)       │
│  • training_pairs.jsonl (68 pairs, score ≥ 4)           │
│  • 03:00 — Nightly LoRA (MeZO, stops GGUF→train→restart)│
│  • 04:00 — Engram WRITE (quality-gated, decay >30d)     │
│  • Mon 02:00 — DSPy MIPROv2 prompt optimization         │
│  • 6h — ChromaDB RAG reindex                            │
└─────────────────────────────────────────────────────────┘

Why This Is SOTA for Consumer Hardware (March 2026)

vs. Existing Solutions

System What it does What Chimère adds
llama.cpp / ik_llama Serve GGUF models We use ik_llama as backend (+23% vs stock), add Engram, ABF, quality loop
KTransformers CPU/GPU co-serving for MoE We go further: per-tensor mixed-precision (RAMP), self-improving quality
Ollama / LM Studio Easy local LLM UI No orchestration, no quality loop, no domain memory, no nightly improvement
vLLM / TensorRT-LLM High-throughput serving Requires A100+, no consumer GPU support for 35B MoE
OpenRouter / Together AI API access to MoE models Cloud, not local. $0.60/M tokens vs $0.10/day
Autoresearch (Karpathy) Self-improving research agent Concept paper, not deployed system. No runtime, no quantization
DeepSeek Engram Conditional memory for MoE Paper only. We implemented multi-tier version with quality-gated write

What No One Else Has Combined

  1. Rust runtime + custom CUDA sm_120 kernels for Blackwell consumer GPUs — 56K lines, the only MoE runtime in Rust
  2. RAMP data-free quantization — per-tensor mixed-precision without calibration data, 15.2 GB GGUF
  3. Multi-tier Engram — Cuckoo filter (<10ns) → N-gram hash (O(1)) → FAISS semantic (~5ms), with quality-gated nightly writes
  4. Adaptive Budget Forcing — thinking budget management for quantized reasoning (IQ3_S produces less coherent thinking than BF16, so budget must be shorter)
  5. The quality loop — ThinkPRM scores every response, high-quality pairs feed nightly LoRA + Engram + DSPy. The system improves from production traffic.
  6. Honest negative results — we tried and documented why speculative decoding (DFlash τ=6.06, wall-clock 0.73×), MTP (84.8% acceptance, 0.51×), and expert prefetch (86.65% hit, +1.1%) don't help on this hardware.

The Components

1. ODO — Unified Orchestrator

chimere-odo | 17K lines Python

A single proxy between the user and the model that adds intelligence:

  • Intent Classification: 3-strategy cascade (regex 99% <1ms → filetype → LLM GBNF <50ms). Routes to: code, kine, cyber, research, default.
  • Entropy Router: Measures query complexity → fast (no-think, 0.7 temp), quality (think, 2048 budget, ABF 0.55), ultra (DVTS K=2, ThinkPRM).
  • Enrichment: Web search (8-stage SOTA pipeline), ChromaDB RAG (dense+BM25+RRF+cross-encoder), semantic few-shot, dynamic Engram.
  • Quality Gate: ThinkPRM-1.5B verifies step-level reasoning. Score ≤ 2 → retry with reflection.
  • Pipeline YAML: Define multi-step agent workflows with hot-reload. 5 pipelines shipped (code: architect→coder, cyber: triage→correlate→remediate).

Why not LangChain/LlamaIndex? Too heavy, too many abstractions, designed for cloud APIs. ODO is 1,525 lines doing exactly what we need with zero external framework dependencies.

2. Engram — Multi-Tier Domain Memory

chimere-engram-tables

Inspired by DeepSeek Engram but implemented as a lightweight lookup system:

  • Tier 0 — Cuckoo filter (<10ns): skips 97% of lookups for tokens not in any table
  • Tier 1 — FNV-1a hash tables (O(1)): core N-gram matching, binary format compatible with Rust and Python
  • Tier 2 — FAISS semantic (~5ms): embedding-based few-shot example retrieval

Ablation results (measured on 10-question benchmark):

Engram v1 (α=0.35, think+response):  77%  ← biases thinking, DEGRADES
Engram OFF (α=0):                    85%  ← baseline
Engram v2 (α=0.1, response-only):   88%  ← PRODUCTION

Key insight: applying Engram bias during the thinking phase constrains reasoning with domain patterns. Response-only bias with low α is the sweet spot.

Why not RAG alone? RAG injects knowledge via context (expensive, limited by context window). Engram injects at the logit level (zero context cost, unlimited knowledge). They're complementary — RAG for long-form retrieval, Engram for factual bias.

3. RAMP — Data-Free Quantization Pipeline

ramp-quant | 9K lines Python + C

Produces hardware-optimized GGUF without calibration data:

  1. NSDS sensitivity — kurtosis + SVD rank per tensor, data-free
  2. Proxy model — round-trip quantization error × sensitivity = instant loss estimate
  3. Evolutionary search — 128 population, 200 generations, under VRAM budget
  4. Build — generates llama-quantize --custom-q command

GDN sensitivity hierarchy discovered:

SSM gates (α, β)  >  Attention Q/K  >  Shared experts  >  Routed experts
     Q8_0               Q5_K/Q6_K         Q5_K              IQ3_S

What we tried that failed:

Method Result Why
QuaRot (Hadamard rotation) PPL = 49,524 Incompatible with GDN recurrent state
OptRot (Givens rotation) PPL = 49,524 No cross-layer absorption
ParoQuant (pairwise rotation) OOM 256 experts × grouped_mm > 16 GB
EvoPress (KL fitness) OOM Full model needed in RAM (70 GB)
GPTQ/AWQ layer-wise Complex MoE expert handling not supported

4. chimere-server — Rust Inference Runtime

chimere | 56K lines Rust + 2.6K CUDA

The only MoE inference runtime written in Rust, with:

  • Custom CUDA kernels for sm_120 (IQ3_S dequant, Q8_0+dp4a GEMV, flash attention, fused MoE)
  • Three backends: libllama FFI (93 tok/s), cudarc (57 tok/s), Candle (18 tok/s)
  • GDN state save/restore (impossible in llama.cpp — this enables speculative decoding on hybrid architectures)
  • OpenAI-compatible /v1/chat/completions API with streaming logprobs

Performance journey (5 days, March 14-19):

9.1 → 21 → 30.5 → 42.5 → 57 → 93 tok/s

Through 9 Candle optimizations, Q8_1+dp4a kernels, fused operations, and finally the libllama FFI breakthrough.

5. Self-Improvement Loop

The system improves while idle:

Timer What How
03:00 daily LoRA training MeZO zeroth-order on quality-filtered pairs (score ≥ 4). Stops GGUF server → trains → restarts (try/finally).
04:00 daily Engram WRITE Add validated responses to domain tables. Decay: halve weight >30d, delete >90d.
Mon 02:00 DSPy MIPROv2 Bayesian prompt optimization per domain. Tested: code +8% on benchmark.
Every 6h RAG reindex ChromaDB re-ingestion of knowledge base.

Quality scoring: ThinkPRM-1.5B runs on CPU, scores every response 1-5 with step-level chain-of-thought verification. 104 scores logged (mean 3.04/5), 68 training pairs generated, 72 SPIN DPO pairs accumulated.

Quality scoring uses a 9B model on CPU (qwen9b-scorer, port 8085) — runs alongside the production 35B with zero VRAM impact. Previous 27B scorer required stopping production; the 9B CPU scorer eliminated all nightly downtime.

Why not RLHF/DPO on cloud? We can't afford cloud GPU time. MeZO trains at inference cost — the script stops the GGUF server, trains for ~1 min, restarts. Quality is lower than full DPO but it's free and runs every night.

6. DFlash — Block Diffusion Drafter

Paper + code

8 architectures over 27 days. Best holdout result: τ=6.06 (comparable to original DFlash paper). But wall-clock = 0.73× (slowdown) because the target model is too fast (93 tok/s) for speculative decoding to help.

The GDN State Barrier: GDN recurrent layers cannot be rolled back after draft rejection. This is a structural incompatibility affecting all hybrid SSM-attention models (Jamba, RWKV, Qwen3.5). No amount of drafter improvement fixes this — it requires a new runtime (chimere-server provides one).

7. MTP — Multi-Token Prediction (Negative Result)

Model | Patches

First MTP implementation for Qwen3.5 MoE in ik_llama.cpp (5 patches, 8 bugs fixed). 84.8% acceptance but 0.51× speedup — the MTP layer is itself MoE (256 experts on CPU), costing as much as a main forward.

8. Expert Prefetch (Negative Result)

Models

MLP predictor achieves 86.65% hit@8 accuracy. Zero speedup because ggml's multi-threaded CPU loop makes GPU prefetch serialize what was parallel work.


Quantified Results

Metric Value
Generation throughput (Qwen3.5-35B-A3B prod path) 80 tok/s chimere-server HTTP, ~93 tok/s bare ik_llama backend
Generation throughput (Nemotron-3-Nano-30B-A3B, NEW) ~45 tok/s chimere-server HTTP, NCMOE=30, ctx 2048
Model Qwen3.5-35B-A3B, RAMP-v2 15.2 GB (3.78 BPW)
VRAM usage ~14 GB / 16 GB
Benchmark 10/10 (code, math, tools, domain)
Engram ablation v1: 77% → OFF: 85% → v2: 88%
Quality scores 104 entries, mean 3.04/5
DFlash τ (holdout) 6.06 (comparable to paper's 6.4)
DFlash wall-clock 0.73× (negative — honest result)
MTP acceptance 84.8% (but 0.51× throughput)
Expert prefetch 86.65% hit@8 (but +1.1% throughput)
Code size 121K lines (Rust + Python + CUDA + C)
Cost $0.10/day electricity
Hardware RTX 5060 Ti 16GB, i5-14600KF, 32GB DDR5

Multi-architecture support

As of April 2026 (Step 7 of the chimere-server multi-arch refactor), the same chimere-server runtime dispatches between two code paths based on the GGUF's general.architecture metadata:

Path Architectures Features
Qwen3.5 (prod) qwen35moe Full stack: MTP, MRoPE, Engram, multi-agent, cudarc / Candle / libllama backends, fast C++ sampler
Generic (libllama) mamba2, nemotron_h_moe, mamba libllama-only: forward via LlamaForward FFI, no MTP, no Engram, single-agent at Step 7

The Generic path was unblocked by a 12-commit Phase 3.x backport of upstream llama.cpp's Mamba-2 / Nemotron-H MoE support into our ik_llama.cpp fork, offered upstream as PR #1593. Validated end-to-end on:

  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF Q4_0: 45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via bin/test-nemotron and through HTTP /v1/chat/completions
  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF UD-IQ3_XXS: same path, coherent text on CPU

Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B, Hymba-1.5B-Base, Zamba2-7B.

The technical doc lives at chimere-server/docs/STEP7_MULTI_ARCH.md.


All Repositories

Code

Repository Lines What
chimere 96K Rust runtime + DFlash + MTP patches + 5 papers + Step 7 multi-arch dispatch
chimere-odo 17K Orchestrator + Engram + search + quality loop
ik_llama.cpp (fork) C++ backend fork — branch mamba2-nemotron-h-backport + PR #1593
ramp-quant 9K Quantization pipeline

Models

Model Size What
RAMP-v2-15G 15.2 GB Production GGUF (automated pipeline)
IQ3_S-custom-mix 14.7 GB Hand-crafted 317-override GGUF
IQ3_S-MTP 18.4 GB First MTP-enabled GGUF for Qwen3.5 MoE
MeZO LoRA 340 KB Zeroth-order LoRA proof-of-concept

Data

Dataset What
chimere-dflash-data DFlash training prompts (3,927)
chimere-quality-scores Quality scores + training pairs
chimere-engram-tables N-gram domain tables
chimere-expert-predictor 4 MLP predictor variants
chimere-calibration imatrix calibration corpus

Papers (5 drafts, arXiv pending endorsement)

  1. Block Diffusion Drafting for Hybrid MoE Models — 8 architectures, GDN State Barrier, wall-clock 0.73×
  2. Chimère System Paper — the complete self-improving stack
  3. RAMP: Data-Free Mixed-Precision Quantization — 7 builds, QuaRot failure, RAMP-v2
  4. MTP on Qwen3.5 MoE — 84.8% acceptance, 0.51× (negative result)
  5. Expert Prefetch — 86.65% hit@8, zero speedup (negative result)

All LaTeX sources: chimere/paper/latex/


Author

Kévin Rémondière — Independent ML researcher, Oloron-Sainte-Marie, France

ORCID: 0009-0008-2443-7166

Built in 7 weeks on a desktop. Everything open-source. The model improves in its sleep.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Kevletesteur/chimere-system