Chimera 47B

Klyrone F.Z.E. · March 2026 · Apache 2.0

Chimera 47B is a 46.7B parameter Mixture-of-Experts language model built using Klyrone's MoE assembly framework. It is constructed from Mixtral-8x7B-v0.1 and Mixtral-8x7B-Instruct-v0.1 — combining the base model's knowledge with the instruct model's capabilities — without any additional training. With 8 experts and top-2 routing, only 12.9B parameters are active per token, enabling fast inference at 154 tokens/second on H200 hardware.

A technical paper detailing the methodology is forthcoming.


Key Numbers

Total Parameters 46.7 B
Active / Token 12.9 B
Architecture MoE · 8 experts · top-2 routing
Context Length 32,768 tokens
Generation Speed 154 t/s · H200
Prompt Processing 878 t/s · H200
Quantization Q5_K_M · 5.69 BPW
File Size 30.95 GB GGUF
License Apache 2.0

Capabilities

  • ✅ Instruction following — multi-turn conversational coherence
  • ✅ Code generation — correct, edge-case-aware output
  • ✅ Creative writing — long-form prose and poetry
  • ✅ Factual reasoning — physics, mathematics, general knowledge
  • ✅ Consumer-grade deployment — fits accessible GPU budgets at Q5_K_M

Formal benchmark results (MMLU, HellaSwag, ARC-Challenge, GSM8K) in progress.


About the Approach

Klyrone's MoE assembly framework constructs high-performance models by composing expert sub-networks from compatible source models — without full retraining. The expert FFN weights (w1, w2, w3) from Mixtral-8x7B-Instruct are transplanted into the Mixtral-8x7B-v0.1 base, preserving routing coherence while inheriting the instruction-tuned capabilities of the donor model.

For enterprise licensing or research collaboration, contact research@klyrone.com


Usage

llama.cpp

./llama-server \
  -m Chimera-47B-Q5_K_M.gguf \
  -ngl 99 \
  --ctx-size 32768 \
  --port 8080

Or for direct CLI inference:

./llama-cli \
  -m Chimera-47B-Q5_K_M.gguf \
  -p "You are a helpful assistant." \
  --ctx-size 32768 \
  -ngl 99 \
  -n 512

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="klyrone/Chimera",
    filename="Chimera-47B-Q5_K_M.gguf",
    n_gpu_layers=99,
    n_ctx=4096,
    verbose=False
)

output = llm(
    "You are a helpful assistant.\n\nExplain the difference between supervised and unsupervised learning.",
    max_tokens=512,
    stop=["</s>"]
)
print(output["choices"][0]["text"])

Ollama

ollama run hf.co/klyrone/Chimera

Note: This model is distributed as a GGUF file. Native Transformers loading (AutoModelForCausalLM) is not supported directly — use llama.cpp, llama-cpp-python, or Ollama for inference.


Hardware Requirements

Quantization VRAM Required Recommended Hardware
Q5_K_M (this file) ~34 GB A40, A100, 2× 3090/4090
Q4_K_M ~27 GB 3090/4090, A6000
Q3_K_M ~22 GB 24 GB consumer GPU

Limitations

  • Router fine-tuning not yet applied — a short gate re-alignment is expected to yield marginal quality gains
  • No independent safety evaluation conducted — not recommended for unsupervised public-facing deployment
  • Benchmark results pending publication
  • STEM-heavy benchmarks (abstract algebra, HS math) may underperform relative to general capability, as mathematical knowledge is distributed across attention layers rather than expert FFNs

Citation

@misc{chimera47b2026,
  title        = {Chimera 47B},
  author       = {{Klyrone F.Z.E.}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/klyrone/Chimera}}
}

Chimera 47B · Klyrone F.Z.E. · Apache 2.0 · A technical paper on the MoE assembly technique is forthcoming.

Downloads last month
109
GGUF
Model size
47B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for klyrone/Chimera