"The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge"

Community Article

Published March 31, 2026

Upvote

VIDRAFT_LAB

SeaWolf-AI

FINAL-Bench

Darwin-35B-A3B-Opus
Table of Contents
1. Why Darwin — The Child That Surpassed Both Parents
Benchmark Results
Why Not Simply Use the Mother?
Why Not Simply Use the Father?
2. Model Overview
3. Parent Models
4. Darwin V5 — Beyond Simple Merging
The Limitations of Conventional Merging
Darwin V4's Breakthrough
Darwin V5: Model MRI Opens the Eyes
V4 vs V5 at a Glance
Darwin V4: Discovered Parameters (Blind Evolution)
Darwin V5: The MRI-Guided Merge Recipe
Layer-Wise Merge Strategy (3 Surgical Blocks)
5. Model MRI Scans — Parent Neural Anatomy
Mother MRI: Claude 4.6 Opus Distilled
Father MRI: A Healthy Generalist (The Organ Donor)
Parent Comparison: Layer Advantage Map
How GPQA 90% Was Achieved
Evolution History
6. Child Model Health Check — MRI Verification
Why This Matters
Darwin V5 MRI-Guided Merge Recipe
7. Inherited Capabilities
From the Father (Qwen3.5-35B-A3B)
From the Mother (Claude 4.6 Opus Distilled)
8. Father's Official Benchmarks (Reference)
9. Performance & Hardware Requirements
Inference Speed
Hardware Requirements
10. Model Specifications
11. Usage
SGLang (Recommended)
vLLM
Transformers
Best Practices
12. Built By
Acknowledgements
Citation
Contact
13. FAQ
Darwin-35B-A3B-Opus

Darwin-35B-A3B-Opus

"The child surpassed both parents — that is evolution."

TL;DR: 35B MoE (3B active) | GPQA Diamond 90.0% (vs Father 84.2% & Mother 85.0%) | MMMLU 85.0% | Multimodal ✅ | 201 Languages | 262K Context | 147.8 tok/s | Apache 2.0

Why Darwin — The Child That Surpassed Both Parents
Model Overview
Parent Models
Darwin V5 — Beyond Simple Merging
Model MRI Scans — Parent Neural Anatomy
Child Model Health Check — MRI Verification
Inherited Capabilities
Father's Official Benchmarks (Reference)
Performance & Hardware Requirements
Model Specifications
Usage
Built By
FAQ

1. Why Darwin — The Child That Surpassed Both Parents

There is a fundamental question at the heart of AI model merging: If the parent models already exist, why crossbreed at all?

This model is the answer.

Benchmark Results

GPQA Diamond (198 Questions, Graduate-Level Reasoning)

Model	Accuracy	Multimodal	Benchmark Published
🧬 Darwin-35B-A3B-Opus (Child)	90.0%	✅ Image/Video	✅ Fully Open
👩 Mother — Jackrong Claude 4.6 Opus Distilled	85.0%	❌ Text-only	❌ Not Published
👨 Father — Qwen3.5-35B-A3B (Official)	84.2%	✅ Image/Video	✅ Official

Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format ("ANSWER: LETTER")

MMMLU (Multilingual Knowledge, 29 Languages)

Model	Accuracy
🧬 Darwin-35B-A3B-Opus (Child)	85.0%
👨 Father — Qwen3.5-35B-A3B (Official)	85.2%

Darwin preserves Father-level multilingual knowledge while achieving decisively superior reasoning.

The child outperformed both parents in reasoning and matched the Father in multilingual knowledge.

GPQA vs Father: +6.9% relative improvement ((90.0−84.2)/84.2)
GPQA vs Mother: +5.9% relative improvement ((90.0−85.0)/85.0)
MMMLU: 85.0% — Father-level (85.2%) multilingual knowledge preserved

Why Not Simply Use the Mother?

	Mother (Claude Distilled)	Darwin (Child)
Reasoning	Strong (85.0%)	Stronger (90.0%)
Image/Video	❌ Lost during text-only fine-tuning	✅ Inherited from Father
201 Languages	❌ Potentially degraded	✅ Inherited from Father
262K Context	Unverified	✅ Father's architecture preserved
Benchmark Transparency	❌ No scores published	✅ Fully open

Why Not Simply Use the Father?

The Father (Qwen3.5-35B-A3B) excels in versatility but plateaus at 84.2% on hard reasoning tasks. Darwin pushes reasoning to 90.0% while retaining Father-level multilingual knowledge (MMMLU 85.0% vs 85.2%) along with all general-purpose capabilities.

Bottom line: Darwin is the only model that exceeds the Mother's reasoning, preserves the Father's multilingual knowledge, and retains full multimodal capability — all at once.

2. Model Overview

Darwin-35B-A3B-Opus is a next-generation reasoning-enhanced language model produced by VIDRAFT's Darwin V5 evolution engine.

Darwin V5 fuses two key innovations:

Evolutionary Merge — Applies natural selection to automatically discover optimal weight combinations across generations of candidates
Model MRI Integration — CT-scans each parent model layer by layer before merging, steering the evolutionary process with structural insight

If conventional merging is "mixing ingredients blindfolded," Darwin V5 is "precision surgery under X-ray guidance."

3. Parent Models

Role	Model	Strengths
👨 Father	Qwen/Qwen3.5-35B-A3B	General knowledge, multimodal (image/video), coding, agents, 201 languages, 262K context
👩 Mother	Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled	Claude 4.6 Opus CoT distillation, structured step-by-step reasoning, coding agent compatibility

4. Darwin V5 — Beyond Simple Merging

The Limitations of Conventional Merging

Traditional model merging requires humans to set hyperparameters — ratio, density, and the like — by intuition. You pick ratio=0.5, density=0.9, run the merge once, and hope for the best. The outcome hinges on luck, and applying a single ratio uniformly across billions of parameters ignores the distinct role each layer plays.

Darwin V4's Breakthrough

Darwin V4 addressed this with evolutionary algorithms — automatically exploring hundreds of parameter combinations and selecting survivors based on real benchmark scores. Yet V4 was still blind evolution: it had no understanding of what each layer actually does.

Darwin V5: Model MRI Opens the Eyes

V5 integrates Model MRI — a neural anatomy analyzer — to give the evolutionary process "sight":

[Phase 0] Model MRI — CT-scan both parents, layer by layer
    ↓  "Father's layers 15–25 concentrate multilingual knowledge"
    ↓  "Mother's layers 30–40 concentrate reasoning patterns"
    ↓
[Phase 1] MRI-Guided Evolution — Begin from a scan-informed initial genome
    ↓  Not random, but "initialized from CT findings"
    ↓
[Phase 2] mergekit real merge + benchmark-driven fitness selection
    ↓  Faster convergence within the MRI-narrowed search space
    ↓
[Phase 3] MRI Health Check — CT-scan the child model
    ↓  Detect interference and function loss
    ↓  Prescribe layer-specific ratio adjustments
    ↓
[Final] Darwin-35B-A3B-Opus

V4 vs V5 at a Glance

	Darwin V4	Darwin V5
Analogy	Mixing ingredients blindfolded	Precision surgery under X-ray
Initial genome	Random	MRI-guided
Layer control	2 ratios (attn/ffn)	40 layers independently
Pre-diagnosis	❌ None	✅ Phase 0 MRI scan
Post-verification	Benchmark only	✅ Phase 3 health check
Search efficiency	Broad, unfocused	Narrowed, guided search
Failure diagnosis	Unknown "why"	Pinpoints the failing layer

Darwin V4: Discovered Parameters (Blind Evolution)

Parameter	Value	Interpretation
ratio	0.481	Father 52% : Mother 48% — asymmetric blend
density_a	0.855	85.5% of Father's weights selected
density_b	0.971	97.1% of Mother's weights adopted
attn	0.168	Only 16.8% modification in attention layers
ffn	0.841	84.1% modification in FFN layers

What this means: Attention patterns (determining what to focus on) are almost entirely preserved from the Father, while FFN layers (the knowledge store) are largely overwritten with the Mother's reasoning patterns.

Discovering attn=0.168 alongside ffn=0.841 — this extreme asymmetry — is virtually impossible to arrive at through human intuition.

Darwin V5: The MRI-Guided Merge Recipe

After scanning both parents, Model MRI prescribed a fundamentally different recipe:

MRI-Guided Genome

Parameter	V4 (Blind)	V5 (MRI)	Shift
global_ratio	0.481	0.800	Mother weight ↑↑
attn_ratio	0.168	0.320	Attention also shifts toward Mother
ffn_ratio	0.841	0.590	FFN becomes more conservative
density_a	0.855	0.799	Similar
density_b	0.971	0.799	Mother density ↓ (Dead Expert compensation)

The key insight: MRI prescribed "draw more heavily from the Mother (ratio 0.8), but reduce density (0.799) because 50–65% of her experts are dead." V4, searching blindly, landed on ratio=0.481 — the opposite direction entirely.

Layer-Wise Merge Strategy (3 Surgical Blocks)

MRI did not prescribe uniform ratios. Instead, it partitioned all 40 layers into 3 distinct blocks:

Merge Ratio + Parent Importance + MoE Health per Layer

Block	Layers	t (Mother %)	Router Source	Rationale
Block 1	L0–L37	59.9%	Mother	Reasoning pattern injection across the bulk of the network
Block 2	L38	90.0%	Mother	Golden Layer — the Mother's core reasoning engine
Block 3	L39	53.4%	Father	Output layer — Father's router preserves multimodal routing

L38 is the "Golden Layer": The Mother's MRI revealed peak cosine distance at L34–L38 (see Mother MRI below). Darwin V5 responded by assigning t=0.9 to L38 — transplanting the Mother's reasoning engine nearly in its entirety.

5. Model MRI Scans — Parent Neural Anatomy

Mother MRI: Claude 4.6 Opus Distilled

Mother Probe Cosine Distance

Probe-wise Layer Importance: Layers L34–L38 light up in intense red (high cosine distance) across the REASONING, CODE, and LOGIC probes — this is the Mother's reasoning engine.

Mother MoE Health

Metric	Status	Interpretation
Router Entropy	✅ ~1.0 across all layers	Healthy — experts are evenly distributed
Dead Expert %	🔴 50–65%	Critical — Claude distillation killed half the experts
Expert Similarity	✅ 0.001–0.008	Healthy — surviving experts remain diverse

A Dead Expert rate of 50–65% is the telltale fingerprint of Claude's text-only distillation. The fine-tuning process silenced multimodal and multilingual experts that were never activated during text-only training.

Mother Expert Utilization Heatmap

Expert Utilization Heatmap: The map is predominantly dark (inactive), with only sparse bright activations — the Claude reasoning pattern is concentrated in a small cluster of specialized experts.

Father MRI: A Healthy Generalist (The Organ Donor)

Father MoE Health

Father Expert Utilization Heatmap

Father Layer Importance by Probe

The Father (Qwen3.5-35B-A3B) exhibits healthy, uniform expert activation across all 40 layers — a well-balanced generalist with every expert alive and contributing. He serves as the "organ donor" who revives the Mother's dead 50–65% of experts.

Parent Comparison: Layer Advantage Map

Parent A vs B Layer Advantage

Above zero (↑ A): Father is stronger — primarily L0–L5 (embedding and early layers)
Below zero (↓ B): Mother is stronger — scattered but consistent from L5 through L35
L34–L38: Mother shows her strongest advantage on the REASONING and CODE probes
L39: Father recovers — the output layer favors Father's multimodal routing

This advantage map directly informed the 3-block merge recipe: Mother dominates L0–L38, Father reclaims L39.

How GPQA 90% Was Achieved

Mother L34–L38: reasoning engine (MRI red zone)
    ↓ t=0.9 — transplanted nearly in full
    +
Father L39: output router (multimodal/multilingual expert activation)
    ↓ t=0.53 — Father's routing preserved
    +
Dead Expert replacement → Father's living experts fill the Mother's dead slots
    ↓
= GPQA 90.0% (surpassing both parents)

The Mother's "reasoning brain" was transplanted while her dead experts were replaced with the Father's living counterparts. Reasoning went up; versatility stayed intact.

Evolution History

Phase 1 → Phase 2 evolution complete
Final real_score: 0.8405
Merge time: 181.6 seconds
Merge commit: 109838c2

6. Child Model Health Check — MRI Verification

Darwin Health Check — Child vs Parents

✅ Verdict: Healthy — No issues detected.

The chart above plots the layer-by-layer importance of the child (Darwin, green bars) against both parents (Father = blue dashed, Mother = red dashed). Key findings:

Layer 0 (Embedding): The child's importance spikes to 0.42 — both parents exhibit similar peaks (~0.35–0.50). The child has successfully inherited the critical embedding layer from both parents with no interference.

Layers 1–33 (Middle): Near-zero importance across all three models. This is expected — middle layers in MoE architectures process information incrementally, with no single layer acting as a bottleneck. The child tracks both parents precisely, confirming zero function loss across the bulk of the network.

Layers 34–39 (Reasoning Engine): Importance rises sharply. This is the exact region where the Mother's MRI revealed intense reasoning activity (cosine distance > 0.6). The child's green bars match or exceed both parents — demonstrating that the Mother's reasoning patterns were successfully transplanted while the Father's output routing was preserved.

Layer 39 (Output): The child peaks at ~0.48, closely tracking both parents. The final output layer is intact.

Why This Matters

The MRI health check confirms three critical outcomes:

No interference — There is no layer where the child's importance abnormally exceeds the parents' (which would signal weight conflict)
No function loss — There is no layer where the parents had high importance but the child collapsed to zero
Successful transplant — The L34–L39 reasoning engine from the Mother is fully operational in the child

Darwin V5 MRI-Guided Merge Recipe

# MRI-guided layer-wise merge (3 blocks)
# Genome: ratio=0.800 attn=0.320 ffn=0.590 density=0.799

L0–L37:  t=0.5988 (Mother 60%) — router from Mother
L38:     t=0.9000 (Mother 90%) — "Golden Layer" reasoning core
L39:     t=0.5336 (Father 47%) — router from Father (output routing)

Insight	Detail
L38 = "Golden Layer"	MRI identified L34–L38 as the Mother's reasoning core. Darwin assigned t=0.9 (90% Mother) to L38 specifically
Router Strategy: B→B→A	Mother's router for the reasoning layers, Father's router for the final output — preserving both the reasoning pathways and multimodal routing
Dead Expert Revival	The Mother's 50–65% dead experts (killed during text-only fine-tuning) were replaced with the Father's living experts — restoring multimodal and multilingual capabilities

📄 The full algorithm and technical details of the Darwin V5 evolution engine will be released alongside an upcoming paper.

7. Inherited Capabilities

From the Father (Qwen3.5-35B-A3B)

Multimodal: Image and video understanding
201 Languages: Global linguistic coverage
262K Context: Native long-context support (extendable to 1M via YaRN)
Gated DeltaNet + MoE: Efficient hybrid architecture
Multi-Token Prediction: Improved inference throughput

From the Mother (Claude 4.6 Opus Distilled)

Structured Thinking: Systematic step-by-step reasoning within <think> tags
Efficient Reasoning: "Let me analyze this request carefully: 1… 2… 3…" pattern
Coding Agent Compatibility: Native "developer" role support for Claude Code and OpenCode
Tool Calling Stability: Consistent performance in tool-use scenarios
Autonomous Execution: Extended autonomous operation in agentic environments

8. Father's Official Benchmarks (Reference)

Darwin is built on this architecture with enhanced reasoning:

Category	Benchmark	Father Official
Knowledge	MMLU-Pro	85.3
Knowledge	MMLU-Redux	93.3
Reasoning	GPQA Diamond	84.2
Reasoning	HLE w/ CoT	22.4
Math	HMMT Feb 2025	89.0
Coding	SWE-bench Verified	69.2
Coding	LiveCodeBench v6	74.6
Agent	TAU2-Bench	81.2
Agent	BFCL-V4 (Tool Use)	67.3
Instruction	IFEval	91.9
Multilingual	MMMLU	85.2
Agentic Search	BrowseComp	61.0

9. Performance & Hardware Requirements

Inference Speed

Metric	Value
Generation Speed	147.8 tok/s
Environment	Single NVIDIA H100 93GB NVL, SGLang, BF16
Qwen Official API	162.8 tok/s (Alibaba Cloud)

Hardware Requirements

Setup	VRAM	Status
BF16 (Full Precision)	65.5 GiB
Single H100 93GB NVL	93 GB	✅ Comfortable
Single A100 80GB	80 GB	⚠️ Tight
Single A100 40GB	40 GB	❌ Insufficient
Q8 Quantized	~35 GiB
Single A100 40GB	40 GB	✅ Feasible
Q4_K_M Quantized	~18 GiB
Single RTX 4090 24GB	24 GB	✅ Comfortable
2× RTX 4090 (tp=2)	48 GB	✅ BF16 feasible

As a Mixture-of-Experts model, only 3B parameters are active per token despite loading the full 35B. This sparsity means quantization has minimal impact on output quality.

10. Model Specifications


Architecture	Qwen3.5 MoE (Gated DeltaNet + MoE)
Total Parameters	35B
Active Parameters	3B per forward pass
Hidden Dimension	2,048
Layers	40
Layer Layout	10 × (3 × GDN→MoE + 1 × Attention→MoE)
Experts	256 (8 routed + 1 shared active)
Expert Intermediate Dim	512
Context Length	262,144 native (up to 1,010,000 via YaRN)
Languages	201
Multimodal	✅ Image & Video input
License	Apache 2.0
Engine	Darwin V5 (Evolutionary Merge + Model MRI)
Evolution Phase	Phase 2, real_score 0.8405
Merge Commit	109838c2

11. Usage

SGLang (Recommended)

python -m sglang.launch_server \
  --model-path FINAL-Bench/Darwin-35B-A3B-Opus \
  --tp 1 \
  --mem-fraction-static 0.90 \
  --context-length 32768 \
  --trust-remote-code

vLLM

vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \
  --trust-remote-code \
  --enforce-eager

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-35B-A3B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-35B-A3B-Opus",
    dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

Best Practices

Use context ≥ 32K for reasoning tasks — the model leverages extended thinking
For maximum reasoning quality, use thinking mode (default) with generous max_tokens (≥ 16384)
The model generates <think> blocks for internal reasoning; extract the final answer after </think>

12. Built By


Developer	VIDRAFT
Evolution Engine	Darwin V5 (Evolutionary Merge + Model MRI)
Infrastructure	4 × NVIDIA H100 93GB NVL GPU
Merge Time	181.6 seconds
Shard Distribution	14 shards → GPU [1, 2, 3] round-robin

Acknowledgements

Korean Government — This research was supported by the Korean Government's 'GPU Support Program' research grant
Qwen Team — Qwen3.5-35B-A3B base architecture
Jackrong — Claude 4.6 Opus Reasoning Distilled model
nohurry, TeichAI — Distillation datasets

Citation

@misc{vidraft_darwin_35b_opus,
  title        = {Darwin-35B-A3B-Opus: MRI-Guided Evolutionary Merge Beyond Both Parents},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}}
}

Contact

📧 kkms1116@koreacu.ac.kr

13. FAQ

What is Darwin-35B-A3B-Opus?

Darwin-35B-A3B-Opus is a 35-billion-parameter Mixture-of-Experts language model with 3B active parameters per token. It was created using evolutionary merge techniques that combine Qwen3.5-35B-A3B's multimodal versatility with Claude 4.6 Opus reasoning distillation, achieving 90.0% on GPQA Diamond — surpassing both parent models.

How does Darwin V5 differ from simple model merging?

Traditional merging applies uniform ratios based on guesswork. Darwin V5 pairs evolutionary algorithms (natural selection) with Model MRI (neural CT-scanning) to automatically discover optimal, layer-specific merge ratios. For instance, it uncovered attn=0.168 and ffn=0.841 — an extreme asymmetry that would be virtually impossible to find by human intuition.

What GPU do I need to run this model?

For BF16 full precision: an A100 80GB (tight fit) or H100 93GB (comfortable). For Q4 quantization: a single RTX 4090 (24GB) is sufficient. The model loads 35B parameters but activates only 3B per token thanks to its MoE architecture.

Does it support multimodal inputs (images/video)?

Yes. Darwin inherits the Father model's (Qwen3.5-35B-A3B) full multimodal capabilities, including image and video understanding — unlike the Mother model, which lost these abilities during text-only fine-tuning.

What languages does it support?

201 languages and dialects, inherited from Qwen3.5's multilingual training. The MMMLU benchmark confirms 85.0% multilingual knowledge retention across 29 evaluated languages.

What is Model MRI?

Model MRI is a neural anatomy analysis tool that CT-scans each layer of a language model to determine its functional role. When integrated with Darwin, it guides the evolutionary merge — telling the algorithm which layers to preserve from each parent and which to replace. In this model, MRI identified L38 as the Mother's "Golden Layer" (core reasoning engine) and prescribed 90% Mother weight for that specific layer.

What are "Dead Experts" and why do they matter?

In Mixture-of-Experts (MoE) models, each layer contains hundreds of specialist sub-networks (experts). The Mother model's Claude distillation killed 50–65% of these experts, because text-only fine-tuning never activated the multimodal and multilingual specialists. Darwin's MRI detected this and prescribed replacing the dead experts with the Father's living ones — reviving the capabilities the Mother had lost.

Is this model open source?

Yes. Darwin-35B-A3B-Opus is released under the Apache 2.0 license, fully open for both commercial and research use.

#DarwinAI #EvolutionaryMerge #ModelMRI #DarwinV5 #GPQA90 #Qwen35 #MoE3B #Reasoning #Multimodal #201Languages #OpenSource #Apache2 #VIDRAFT #NaturalSelection #LayerWiseMerge #ClaudeOpus #ThinkingModel #CodingAgent #LongContext262K #BestOpenSourceLLM2026 #DeadExpertRevival #GoldenLayer #MoEMerge #NeuralAnatomy

Models mentioned in this article 2

Datasets mentioned in this article 2

Quantum Cryptanalysis on Real Hardware: Pushing Symmetric-Structure Key Recovery Beyond the Published Frontier

July 5, 2026

Adding a GPU Without Building One

July 3, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote