Instructions to use klyrone/Chimera with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use klyrone/Chimera with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="klyrone/Chimera",
	filename="Chimera-47B-Q5_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use klyrone/Chimera with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf klyrone/Chimera:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf klyrone/Chimera:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf klyrone/Chimera:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf klyrone/Chimera:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf klyrone/Chimera:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf klyrone/Chimera:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf klyrone/Chimera:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf klyrone/Chimera:Q5_K_M

Use Docker

docker model run hf.co/klyrone/Chimera:Q5_K_M

LM Studio
Jan

vLLM

How to use klyrone/Chimera with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "klyrone/Chimera"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "klyrone/Chimera",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/klyrone/Chimera:Q5_K_M

Ollama
How to use klyrone/Chimera with Ollama:
```
ollama run hf.co/klyrone/Chimera:Q5_K_M
```

Unsloth Studio

How to use klyrone/Chimera with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for klyrone/Chimera to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for klyrone/Chimera to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for klyrone/Chimera to start chatting

Atomic Chat new
Docker Model Runner
How to use klyrone/Chimera with Docker Model Runner:
```
docker model run hf.co/klyrone/Chimera:Q5_K_M
```

Lemonade

How to use klyrone/Chimera with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull klyrone/Chimera:Q5_K_M

Run and chat with the model

lemonade run user.Chimera-Q5_K_M

List all available models

lemonade list

Chimera 47B

Klyrone F.Z.E. · March 2026 · Apache 2.0

Modular Expert Assembly (MEA) is a zero-compute framework that surgically grafts instruct-tuned MoE experts into base attention layers, achieving polymathic synthesis without backpropagation fine-tuning.

Chimera 47B is a 46.7B parameter Mixture-of-Experts language model built using Klyrone's MoE assembly framework. It is constructed from Mixtral-8x7B-v0.1 and Mixtral-8x7B-Instruct-v0.1 — combining the base model's knowledge with the instruct model's capabilities — without any additional training. With 8 experts and top-2 routing, only 12.9B parameters are active per token, enabling fast inference at 154 tokens/second on H200 hardware.

A technical paper detailing the methodology is forthcoming.

Key Numbers


Total Parameters	46.7 B
Active / Token	12.9 B
Architecture	MoE · 8 experts · top-2 routing
Context Length	32,768 tokens
Generation Speed	154 t/s · H200
Prompt Processing	878 t/s · H200
Quantization	Q5_K_M · 5.69 BPW
File Size	30.95 GB GGUF
License	Apache 2.0

Capabilities

✅ Instruction following — multi-turn conversational coherence
✅ Code generation — correct, edge-case-aware output
✅ Creative writing — long-form prose and poetry
✅ Factual reasoning — physics, mathematics, general knowledge
✅ Consumer-grade deployment — fits accessible GPU budgets at Q5_K_M

Quantitative Benchmarks: Running zero-shot evaluations across diverse logical and reasoning paradigms yielded the following results for the Chimera-47B model (Q5_K_M):

Benchmark	Chimera 47B Q5 (0-shot)	Mixtral 8x7B Base (FP16)	Mixtral Instruct (FP16)	Target Baseline Context
WinoGrande	75.14%	77.2% (0-shot)	~77.0% (0-shot)	Exact matching logic retained
MMLU	67.80%	70.6% (5-shot)	71.4% (5-shot)	Near-native retrieval despite 0-shot/Q5
HellaSwag	In-Progress	84.4% (0-shot)	~84.0% (0-shot)	Pending final score
ARC-C	85.41%	85.8% (25-shot)	~85.0% (25-shot)	Exceptional 0-shot match versus 25-shot
GSM8K	73.69%	58.4% (8-shot)	74.4% (8-shot)	Flawless Instruct-math transfer

This confirms that MEA perfectly preserves basal logic without catastrophic forgetting.

Modular Expert Assembly (MEA) Framework

1. The Death of Catastrophic Forgetting & Model Training Paradigms

The open-source AI community often faces a financial barrier when scaling capabilities. While sparse Mixture-of-Experts (MoE) architectures (e.g., Mixtral 8x7B) have significantly reduced inference costs, fine-tuning them on instructions usually destroys their base logical perplexity (Catastrophic Forgetting).

This technical report introduces the solution: Modular Expert Assembly (MEA). Because an MoE model isolates domain-specific knowledge into discrete sub-networks governed by a frozen gate/router layer, we prove that these sub-networks can be treated as swappable logic units. By perfectly retaining base attention logic and surgically grafting instruct capabilities, MEA eliminates catastrophic forgetting entirely.

2. The MEA Framework

The MEA methodology enables "brain transplants" between two models that share an identical structural skeleton (layer count, hidden dimensions, expert count).

2.1 Structural Isolation

The foundational layers of the model—specifically the Multi-Head Attention (MHA), token embeddings, layer normalization, and the router mechanism—are extracted strictly from the Base Model. These layers hold foundational grammar and routing intuition established during extreme-scale pre-training.

2.2 Expert Swapping & Interpolation

We target strictly the routed experts (e.g., .block_sparse_moe.experts.N in Mixtral). An interpolation factor $\alpha \in [0, 1]$ dictates the degree of the swap: $W_{MEA} = (1 - \alpha) W_{base} + \alpha W_{donor}$ At $\alpha=1.0$, the donor's specialized experts entirely overwrite the base experts.

2.3 Compute Economics & Hardware Efficiency

To bypass VRAM constraints entirely, the MEA script performs this interpolation utilizing safetensors over asynchronous ThreadPool execution. This memory mapping reduces a 270GB+ operation footprint to roughly 30GB of system RAM, executing perfectly on a standard desktop CPU in less than 20 minutes, costing $0 in GPU compute.

2.4 Future Capability: Plug-and-Play Experts

The validity of MEA opens the door for cross-model modularity. A mathematically specialized expert from a donor model (e.g., DeepSeek-Math) could be surgically extracted and placed into a generalized Base MoE. This "brain transplant" transfers state-of-the-art calculus parameters entirely offline without anyone running a single epoch of GPU backpropagation.

For enterprise licensing or research collaboration, contact research@klyrone.com

🧪 Zero-Compute Capability Evaluation

Prompt: Design a renewable energy generation system utilizing the temperature differential between the ocean's surface and deep ocean. CRITICAL CONSTRAINT: Must use thermoacoustics (sound waves) to convert this thermal gradient into electricity... Output Excerpt: "The heat exchanger is connected to a thermoacoustic engine. This engine consists of a resonant cavity filled with a working fluid, such as helium or nitrogen. One end of the cavity is connected to the warm section of the heat exchanger, while the other end is connected to the cold..." Analysis: The model cleanly bypassed conventional OTEC turbines (which boil ammonia) and successfully grafted niche acoustic physics onto thermodynamic oceanography. It effortlessly retrieved precise hardware constraints (e.g., specifying helium or nitrogen as a working fluid inside a resonant cavity).

Prompt: Write a Python script that calculates the exact Hertz frequencies of a C-Major scale in Equal Temperament. For every musical note, print a Haiku about a layer of the Earth's atmosphere, dynamically containing the exact frequency number in the poem. Output Excerpt:

frequency_ratio = 2 ** (1 / 12)
# ... mathematically loops 12 times per octave ...
atmospheric_layers = { 0: "Troposphere", 1: "Stratosphere", 2: "Mesosphere" ... }
haiku = f"{frequency:.2f} Hz hums, \n{layer.split()[0]} whispers, \nmelodies of the spheres."

Analysis: While the literal syllable count of the dynamically evaluated float number disrupted the strict 5-7-5 constraint (an anticipated Tokenizer-level limitation), the model beautifully retrieved the 2 ** (1/12) Equal Temperament formula, mapped the Earth's atmospheric layers in exact scientific order, and fused them into a functionally flawless Python execution loop.

Usage

llama.cpp

./llama-server \
  -m Chimera-47B-Q5_K_M.gguf \
  -ngl 99 \
  --ctx-size 32768 \
  --port 8080

Or for direct CLI inference:

./llama-cli \
  -m Chimera-47B-Q5_K_M.gguf \
  -p "You are a helpful assistant." \
  --ctx-size 32768 \
  -ngl 99 \
  -n 512

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="klyrone/Chimera",
    filename="Chimera-47B-Q5_K_M.gguf",
    n_gpu_layers=99,
    n_ctx=4096,
    verbose=False
)

output = llm(
    "You are a helpful assistant.\n\nExplain the difference between supervised and unsupervised learning.",
    max_tokens=512,
    stop=["</s>"]
)
print(output["choices"][0]["text"])

Ollama

ollama run hf.co/klyrone/Chimera

Note: This model is distributed as a GGUF file. Native Transformers loading (AutoModelForCausalLM) is not supported directly — use llama.cpp, llama-cpp-python, or Ollama for inference.

Hardware Requirements

Quantization	VRAM Required	Recommended Hardware
Q5_K_M (this file)	~34 GB	A40, A100, 2× 3090/4090
Q4_K_M	~27 GB	3090/4090, A6000
Q3_K_M	~22 GB	24 GB consumer GPU

Limitations

Router fine-tuning not yet applied — a short gate re-alignment is expected to yield marginal quality gains
No independent safety evaluation conducted — not recommended for unsupervised public-facing deployment
Benchmark results pending publication
STEM-heavy benchmarks (abstract algebra, HS math) may underperform relative to general capability, as mathematical knowledge is distributed across attention layers rather than expert FFNs.
Pattern Entrenchment (Adversarial Traps): Extensive testing indicates that grafting text-experts onto text-attention layers does not spontaneously generate a deterministic 'World Model'. The model remains highly vulnerable to out-of-distribution math/logic traps (e.g., Anti-Pattern spatial puzzles) where the Base Model's semantic rote-memorization overpowers the logical reasoning of the Instruct Experts.

Citation

@misc{chimera47b2026,
  title        = {Chimera 47B},
  author       = {{Klyrone F.Z.E.}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/klyrone/Chimera}}
}

Chimera 47B · Klyrone F.Z.E. · Apache 2.0 · A technical report on the MoE assembly technique is available in the files section.

Downloads last month: 57

GGUF

Model size

47B params

Architecture

llama

Hardware compatibility

5-bit

Model tree for klyrone/Chimera

mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x7B-v0.1

Merge model

this model