Darwin-35B-A3B-Opus

Darwin-35B-A3B-Opus

Model Space FINAL Bench ALL Bench

"The child surpassed both parents โ€” that is evolution."

TL;DR: 35B MoE (3B active) | GPQA Diamond 90.0% (beats Father 84.2% & Mother 85.0%) | MMMLU 85.0% | Multimodal โœ… | 201 Languages | 262K Context | 147.8 tok/s | Apache 2.0

#Darwin #EvolutionaryMerge #ModelMRI #Qwen3.5 #MoE #Reasoning #GPQA90 #Multimodal #OpenSource #Apache2 #DarwinV5 #VIDRAFT


Why Darwin? โ€” The Child That Surpassed Both Parents

The fundamental question of AI model merging: If parent models already exist, why crossbreed?

This model is the answer.

Benchmark Results

GPQA Diamond (198 Questions, Graduate-Level Reasoning)

Model Accuracy Multimodal Benchmark Published
๐Ÿงฌ Darwin-35B-A3B-Opus (Child) 90.0% โœ… Image/Video โœ… Fully Open
๐Ÿ‘ฉ Mother โ€” Jackrong Claude 4.6 Opus Distilled 85.0% โŒ Text-only โŒ Not Published
๐Ÿ‘จ Father โ€” Qwen3.5-35B-A3B (Official) 84.2% โœ… Image/Video โœ… Official

Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format ("ANSWER: LETTER")

MMMLU (Multilingual Knowledge, 29 Languages)

Model Accuracy
๐Ÿงฌ Darwin-35B-A3B-Opus (Child) 85.0%
๐Ÿ‘จ Father โ€” Qwen3.5-35B-A3B (Official) 85.2%

Darwin maintains Father-level multilingual knowledge while gaining superior reasoning.

The child surpassed both parents in reasoning, and matched the Father in multilingual knowledge.

  • GPQA vs Father: +6.9% relative improvement ((90.0โˆ’84.2)/84.2)
  • GPQA vs Mother: +5.9% relative improvement ((90.0โˆ’85.0)/85.0)
  • MMMLU: 85.0% โ€” Father-level (85.2%) multilingual knowledge preserved

Why Not Just Use the Mother?

Mother (Claude Distilled) Darwin (Child)
Reasoning Strong (85.0%) Stronger (90.0%)
Image/Video โŒ Lost (text-only fine-tune) โœ… Inherited from Father
201 Languages โŒ Potentially degraded โœ… Inherited from Father
262K Context Unverified โœ… Father's architecture preserved
Benchmark Transparency โŒ No scores published โœ… Fully open

Why Not Just Use the Father?

The Father (Qwen3.5-35B-A3B) excels in versatility but scores 84.2% on hard reasoning. Darwin pushes reasoning to 90.0% while maintaining Father-level multilingual knowledge (MMMLU 85.0% vs 85.2%) and all general capabilities.

Conclusion: The only model that surpasses the Mother's reasoning, preserves the Father's multilingual knowledge, and retains full multimodal capabilities.


Model Overview

Darwin-35B-A3B-Opus is a next-generation reasoning-enhanced language model created by VIDRAFT's Darwin V5 evolution engine.

Darwin V5 combines two innovations:

  1. Evolutionary Merge โ€” Applies natural selection to automatically find optimal weight combinations
  2. Model MRI Integration โ€” CT-scans parent models layer by layer before merging, guiding evolution with structural insight

If conventional merging is "mixing recipes blindfolded," Darwin V5 is "precision surgery with X-ray guidance."


Parent Models

Role Model Strengths
๐Ÿ‘จ Father Qwen/Qwen3.5-35B-A3B General knowledge, multimodal (image/video), coding, agents, 201 languages, 262K context
๐Ÿ‘ฉ Mother Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled Claude 4.6 Opus CoT distillation, structured step-by-step reasoning, coding agent compatibility

Darwin V5 โ€” Beyond Simple Merge

Limitations of Conventional Merging

Traditional model merging relies on humans setting hyperparameters like ratio and density by intuition. Set ratio=0.5, density=0.9, run once, and hope for the best. The result depends on luck, and applying the same ratio uniformly across billions of parameters ignores each layer's unique role.

Darwin V4's Advance

Darwin V4 solved this with evolutionary algorithms โ€” automatically searching hundreds of parameter combinations and selecting survivors by real benchmark scores. But V4 was still blind evolution: it didn't know what each layer does.

Darwin V5: Model MRI Opens the Eyes

V5 integrates Model MRI (neural anatomy analyzer) to give evolution "sight":

[Phase 0] Model MRI โ€” CT-scan both parents layer by layer
    โ†“  "Father's layers 15-25 concentrate multilingual knowledge"
    โ†“  "Mother's layers 30-40 concentrate reasoning patterns"
    โ†“
[Phase 1] MRI-Guided Evolution โ€” Start from scan-informed initial genome
    โ†“  Not random, but "informed by CT results"
    โ†“
[Phase 2] mergekit real merge + benchmark fitness selection
    โ†“  Faster convergence in MRI-narrowed search space
    โ†“
[Phase 3] MRI Health Check โ€” CT-scan the child model
    โ†“  Detect interference, function loss
    โ†“  Prescribe layer-specific ratio adjustments
    โ†“
[Final] Darwin-35B-A3B-Opus

V4 vs V5

Darwin V4 Darwin V5
Analogy Mixing recipes blindfolded Precision surgery with X-ray
Initial genome Random MRI-guided
Layer control 2 ratios (attn/ffn) 40 layers independently
Pre-diagnosis โŒ None โœ… Phase 0 MRI scan
Post-verification Benchmark only โœ… Phase 3 health check
Search efficiency Wide space Narrowed, guided search
Failure diagnosis Unknown "why" Pinpoint which layer failed

Darwin V4: Discovered Optimal Parameters (Blind Evolution)

Parameter Value Meaning
ratio 0.481 Father 52% : Mother 48% asymmetric blend
density_a 0.855 Selected 85.5% of Father's weights
density_b 0.971 Adopted 97.1% of Mother's weights
attn 0.168 Only 16.8% change in attention layers
ffn 0.841 84.1% change in FFN layers

Interpretation: Attention patterns (what to focus on) are almost entirely preserved from the Father, while FFN layers (knowledge storage) are largely replaced with the Mother's reasoning patterns.

Discovering attn=0.168 and ffn=0.841 โ€” this extreme asymmetry โ€” is virtually impossible by human intuition.

Darwin V5: MRI-Guided Merge Recipe

After scanning both parents, Model MRI generated a fundamentally different prescription:

MRI-Guided Genome

Parameter V4 (Blind) V5 (MRI) Change
global_ratio 0.481 0.800 Mother weight โ†‘โ†‘
attn_ratio 0.168 0.320 Attention also shifts to Mother
ffn_ratio 0.841 0.590 FFN becomes more conservative
density_a 0.855 0.799 Similar
density_b 0.971 0.799 Mother density โ†“ (Dead Expert compensation)

Key insight: MRI prescribed "use more of the Mother (ratio 0.8), but reduce density (0.799) because 50-65% of her experts are dead." V4 found ratio=0.481 blindly โ€” the opposite direction.

Layer-Wise Merge Strategy (3 Blocks)

MRI didn't apply uniform ratios. It split 40 layers into 3 blocks:

Merge Ratio + Parent Importance + MoE Health per Layer

Block Layers t (Mother %) Router Source Rationale
Block 1 L0~L37 59.9% Mother Reasoning pattern injection across most layers
Block 2 L38 90.0% Mother Golden Layer โ€” Mother's reasoning engine core
Block 3 L39 53.4% Father Output layer โ€” Father's router preserves multimodal routing

L38 is the "Golden Layer": Mother's MRI showed peak cosine distance at L34~L38 (see Mother MRI below). Darwin V5 responded by assigning t=0.9 to L38 โ€” transplanting the Mother's reasoning engine almost entirely.


Model MRI Scans โ€” Parent Neural Anatomy

Mother MRI: Claude 4.6 Opus Distilled

Mother Probe Cosine Distance

Probe-wise Layer Importance: L34~L38 shows intense red (high cosine distance) across REASONING, CODE, LOGIC probes โ€” this is the Mother's reasoning engine.

Mother MoE Health

Metric Status Interpretation
Router Entropy โœ… ~1.0 across all layers Healthy โ€” experts evenly distributed
Dead Expert % ๐Ÿ”ด 50~65% Critical โ€” Claude distillation killed half the experts
Expert Similarity โœ… 0.001~0.008 Healthy โ€” surviving experts remain diverse

Dead Expert 50~65% is the fingerprint of Claude text-only distillation. The fine-tuning killed multimodal and multilingual experts that were no longer activated during text-only training.

Mother Expert Utilization

Expert Utilization Heatmap: Mostly dark (inactive) with sparse bright activations โ€” the Claude reasoning pattern is concentrated in a small number of specialized experts.

Father MRI: Healthy Generalist (Organ Donor)

Father MoE Health

Father Expert Utilization

Father Layer Importance by Probe

The Father (Qwen3.5-35B-A3B) shows healthy, uniform expert activation across all 40 layers โ€” a well-balanced generalist with all experts alive. This is the "organ donor" that revives the Mother's dead 50โ€“65% experts.

Parent Comparison: Layer Advantage Map

Parent A vs B Layer Advantage

  • Above zero (โ†‘ A): Father stronger โ€” primarily L0~L5 (embedding/early layers)
  • Below zero (โ†“ B): Mother stronger โ€” scattered but consistent across L5~L35
  • L34~L38: Mother shows strongest advantage in REASONING and CODE probes
  • L39: Father recovers โ€” output layer favors Father's multimodal routing

This advantage map directly informed the 3-block merge recipe: Mother dominates L0~L38, Father retakes L39.

How GPQA 90% Was Achieved

Mother L34~L38 reasoning engine (MRI red zone)
    โ†“ t=0.9 โ€” transplanted almost entirely
    +
Father L39 output router (multimodal/multilingual expert activation)
    โ†“ t=0.53 โ€” Father's routing preserved
    +
Dead Expert replacement โ†’ Father's living experts fill Mother's dead slots
    โ†“
= GPQA 90.0% (surpassed both parents)

The Mother's "reasoning brain" was transplanted while her dead experts were replaced with the Father's living ones. Reasoning went up, versatility was preserved.

Evolution History

  • Phase 1 โ†’ Phase 2 evolution complete
  • Final real_score: 0.8405
  • Merge time: 181.6 seconds
  • Merge commit: 109838c2

Model MRI Health Check โ€” Child vs Parents

Darwin Health Check โ€” Child vs Parents

โœ… Health: Healthy โ€” No issues detected.

The chart above shows the layer-by-layer importance of the child (Darwin, green bars) compared to both parents (Father = blue dashed, Mother = red dashed). Key findings:

Layer 0 (Embedding): Child importance spikes to 0.42 โ€” both parents show similar peaks (~0.35โ€“0.50). The child successfully inherited the critical embedding layer from both parents without interference.

Layers 1โ€“33 (Middle): Near-zero importance across all three models. This is normal โ€” middle layers in MoE models process information incrementally, with no single layer being critical. The child tracks both parents perfectly, confirming no function loss in the bulk of the network.

Layers 34โ€“39 (Reasoning Engine): Importance rises sharply. This is the region where Mother's MRI showed intense reasoning activity (cosine distance > 0.6). The child's green bars match or exceed both parents โ€” proving that Mother's reasoning patterns were successfully transplanted while Father's output routing was preserved.

Layer 39 (Output): Child peaks at ~0.48, closely matching both parents. The final output layer is intact.

Why This Matters

The MRI health check confirms three things:

  1. No interference โ€” No layer where child importance abnormally exceeds parents (which would indicate weight conflict)
  2. No function loss โ€” No layer where parents had high importance but child dropped to zero
  3. Successful transplant โ€” L34โ€“L39 reasoning engine from Mother is fully operational in the child

Darwin V5 MRI-Guided Merge Recipe

# MRI-guided layer-wise merge (3 blocks)
# Genome: ratio=0.800 attn=0.320 ffn=0.590 density=0.799

L0โ€“L37:  t=0.5988 (Mother 60%) โ€” router from Mother
L38:     t=0.9000 (Mother 90%) โ€” "Golden Layer" reasoning core
L39:     t=0.5336 (Father 47%) โ€” router from Father (output routing)
Insight Detail
L38 = "Golden Layer" MRI identified L34โ€“L38 as Mother's reasoning core. Darwin assigned t=0.9 (90% Mother) to L38 specifically
Router Strategy: Bโ†’Bโ†’A Mother's router for reasoning layers, Father's router for final output โ€” preserves both reasoning paths and multimodal routing
Dead Expert Revival Mother's 50โ€“65% dead experts (killed by text-only fine-tuning) were replaced with Father's live experts โ€” restoring multimodal and multilingual capabilities

Inherited Capabilities

From Father (Qwen3.5-35B-A3B)

  • Multimodal: Image and video understanding
  • 201 Languages: Global linguistic coverage
  • 262K Context: Native long-context (extendable to 1M via YaRN)
  • Gated DeltaNet + MoE: Efficient hybrid architecture
  • Multi-Token Prediction: Improved inference throughput

From Mother (Claude 4.6 Opus Distilled)

  • Structured Thinking: Systematic step-by-step reasoning within <think> tags
  • Efficient Reasoning: "Let me analyze this request carefully: 1..2..3..." pattern
  • Coding Agent Compatibility: Native "developer" role support for Claude Code, OpenCode
  • Tool Calling Stability: Consistent performance in tool-use scenarios
  • Autonomous Execution: Extended autonomous operation in agentic environments

Father's Official Benchmarks (Reference)

Darwin is built on this architecture with enhanced reasoning:

Category Benchmark Father Official
Knowledge MMLU-Pro 85.3
Knowledge MMLU-Redux 93.3
Reasoning GPQA Diamond 84.2
Reasoning HLE w/ CoT 22.4
Math HMMT Feb 2025 89.0
Coding SWE-bench Verified 69.2
Coding LiveCodeBench v6 74.6
Agent TAU2-Bench 81.2
Agent BFCL-V4 (Tool Use) 67.3
Instruction IFEval 91.9
Multilingual MMMLU 85.2
Agentic Search BrowseComp 61.0

Performance

Inference Speed

Metric Value
Generation Speed 147.8 tok/s
Environment Single NVIDIA H100 93GB NVL, SGLang, BF16
Qwen Official API 162.8 tok/s (Alibaba Cloud)

Hardware Requirements

Setup VRAM Status
BF16 (Full Precision) 65.5 GiB
Single H100 93GB NVL 93 GB โœ… Comfortable
Single A100 80GB 80 GB โš ๏ธ Tight
Single A100 40GB 40 GB โŒ Insufficient
Q8 Quantized ~35 GiB
Single A100 40GB 40 GB โœ… Possible
Q4_K_M Quantized ~18 GiB
Single RTX 4090 24GB 24 GB โœ… Comfortable
2ร— RTX 4090 (tp=2) 48 GB โœ… BF16 possible

As a Mixture-of-Experts model, only 3B parameters are active per token despite loading the full 35B. Quantization has minimal impact due to this sparsity.


Model Specifications

Architecture Qwen3.5 MoE (Gated DeltaNet + MoE)
Total Parameters 35B
Active Parameters 3B per forward pass
Hidden Dimension 2,048
Layers 40
Layer Layout 10 ร— (3 ร— GDNโ†’MoE + 1 ร— Attentionโ†’MoE)
Experts 256 (8 routed + 1 shared active)
Expert Intermediate Dim 512
Context Length 262,144 native (up to 1,010,000 via YaRN)
Languages 201
Multimodal โœ… Image & Video input
License Apache 2.0
Engine Darwin V5 (Evolutionary Merge + Model MRI)
Evolution Phase Phase 2, real_score 0.8405
Merge Commit 109838c2

Usage

SGLang (Recommended)

python -m sglang.launch_server \
  --model-path FINAL-Bench/Darwin-35B-A3B-Opus \
  --tp 1 \
  --mem-fraction-static 0.90 \
  --context-length 32768 \
  --trust-remote-code

vLLM

vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \
  --trust-remote-code \
  --enforce-eager

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-35B-A3B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-35B-A3B-Opus",
    dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

Best Practices

  • Use context โ‰ฅ 32K for reasoning tasks โ€” the model leverages extended thinking
  • For maximum reasoning quality, use thinking mode (default) with sufficient max_tokens (โ‰ฅ 16384)
  • The model generates <think> blocks for internal reasoning; extract the final answer after </think>

Built By

Developer VIDRAFT
Evolution Engine Darwin V5 (Evolutionary Merge + Model MRI)
Infrastructure 4 ร— NVIDIA H100 93GB NVL GPU
Merge Time 181.6 seconds
Shard Distribution 14 shards โ†’ GPU [1, 2, 3] round-robin

Acknowledgements

  • Korean Government โ€” This research was supported by the Korean Government's 'GPU Support Program' research grant
  • Qwen Team โ€” Qwen3.5-35B-A3B base architecture
  • Jackrong โ€” Claude 4.6 Opus Reasoning Distilled model
  • nohurry, TeichAI โ€” Distillation datasets

Citation

@misc{vidraft_darwin_35b_opus,
  title        = {Darwin-35B-A3B-Opus: MRI-Guided Evolutionary Merge Beyond Both Parents},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}}
}

Contact

๐Ÿ“ง kkms1116@koreacu.ac.kr

FAQ (Frequently Asked Questions)

What is Darwin-35B-A3B-Opus? Darwin-35B-A3B-Opus is a 35 billion parameter Mixture-of-Experts language model (3B active per token) that was created using evolutionary merge techniques. It combines Qwen3.5-35B-A3B's multimodal versatility with Claude 4.6 Opus reasoning distillation, achieving 90.0% on GPQA Diamond โ€” surpassing both parent models.
How does Darwin V5 differ from simple model merging? Traditional merging applies uniform ratios by guesswork. Darwin V5 uses evolutionary algorithms (natural selection) combined with Model MRI (neural CT-scanning) to automatically discover optimal layer-specific merge ratios. For example, it found attn=0.168 and ffn=0.841 โ€” an extreme asymmetry impossible to find by intuition.
What GPU do I need to run this model? For BF16 full precision: A100 80GB (tight) or H100 93GB (comfortable). For Q4 quantization: a single RTX 4090 (24GB) is sufficient. The model loads 35B parameters but only activates 3B per token due to its MoE architecture.
Does it support multimodal (images/video)? Yes. Darwin inherits the Father model's (Qwen3.5-35B-A3B) full multimodal capabilities including image and video understanding, unlike the Mother model which lost this during text-only fine-tuning.
What languages does it support? 201 languages and dialects, inherited from Qwen3.5's multilingual training. MMMLU benchmark confirms 85.0% multilingual knowledge retention across 29 evaluated languages.
What is Model MRI? Model MRI is a neural anatomy analysis tool that CT-scans each layer of a language model to understand what functions it performs. When integrated with Darwin, it guides the evolutionary merge process โ€” telling the algorithm which layers to preserve from each parent and which to replace. In this model, MRI identified L38 as the Mother's "golden layer" (core reasoning engine) and prescribed 90% Mother weight for that specific layer.
What are "Dead Experts" and why does it matter? In Mixture-of-Experts (MoE) models, each layer contains hundreds of specialist sub-networks (experts). The Mother model's Claude distillation killed 50โ€“65% of these experts because text-only fine-tuning didn't activate multimodal/multilingual specialists. Darwin's MRI detected this and prescribed replacing dead experts with the Father's living ones โ€” reviving capabilities the Mother lost.
Is this model open source? Yes. Darwin-35B-A3B-Opus is released under the Apache 2.0 license, fully open for commercial and research use.

#DarwinAI #EvolutionaryMerge #ModelMRI #DarwinV5 #GPQA90 #Qwen35 #MoE3B #Reasoning #Multimodal #201Languages #OpenSource #Apache2 #VIDRAFT #NaturalSelection #LayerWiseMerge #ClaudeOpus #ThinkingModel #CodingAgent #LongContext262K #BestOpenSourceLLM2026 #DeadExpertRevival #GoldenLayer #MoEMerge #NeuralAnatomy

Downloads last month
-
Safetensors
Model size
36B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FINAL-Bench/Darwin-35B-A3B-Opus

Space using FINAL-Bench/Darwin-35B-A3B-Opus 1

Evaluation results