Image-Text-to-Text
Transformers
Safetensors
qwen3_5_moe
Merge
evolutionary-merge
darwin
darwin-v5
model-mri
reasoning
advanced-reasoning
chain-of-thought
thinking
qwen3.5
qwen
Mixture of Experts
mixture-of-experts
claude-opus
distillation
multimodal
vision-language
gpqa
benchmark
open-source
apache-2.0
layer-wise-merge
moe-merge
dead-expert-revival
coding-agent
tool-calling
long-context
262k-context
conversational
Eval Results (legacy)
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,455 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- Qwen/Qwen3.5-35B-A3B
|
| 5 |
+
- Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
|
| 6 |
+
tags:
|
| 7 |
+
- merge
|
| 8 |
+
- evolutionary-merge
|
| 9 |
+
- darwin
|
| 10 |
+
- darwin-v5
|
| 11 |
+
- model-mri
|
| 12 |
+
- reasoning
|
| 13 |
+
- advanced-reasoning
|
| 14 |
+
- chain-of-thought
|
| 15 |
+
- thinking
|
| 16 |
+
- qwen3.5
|
| 17 |
+
- qwen
|
| 18 |
+
- moe
|
| 19 |
+
- mixture-of-experts
|
| 20 |
+
- claude-opus
|
| 21 |
+
- distillation
|
| 22 |
+
- multimodal
|
| 23 |
+
- vision-language
|
| 24 |
+
- multilingual
|
| 25 |
+
- 201-languages
|
| 26 |
+
- gpqa
|
| 27 |
+
- benchmark
|
| 28 |
+
- open-source
|
| 29 |
+
- apache-2.0
|
| 30 |
+
- natural-selection
|
| 31 |
+
- layer-wise-merge
|
| 32 |
+
- coding-agent
|
| 33 |
+
- tool-calling
|
| 34 |
+
- long-context
|
| 35 |
+
- 262k-context
|
| 36 |
+
language:
|
| 37 |
+
- en
|
| 38 |
+
- zh
|
| 39 |
+
- ko
|
| 40 |
+
- ja
|
| 41 |
+
- de
|
| 42 |
+
- fr
|
| 43 |
+
- es
|
| 44 |
+
- ru
|
| 45 |
+
- ar
|
| 46 |
+
- multilingual
|
| 47 |
+
pipeline_tag: text-generation
|
| 48 |
+
library_name: transformers
|
| 49 |
+
model-index:
|
| 50 |
+
- name: Darwin-35B-A3B-Opus
|
| 51 |
+
results:
|
| 52 |
+
- task:
|
| 53 |
+
type: text-generation
|
| 54 |
+
name: Graduate-Level Reasoning
|
| 55 |
+
dataset:
|
| 56 |
+
type: Idavidrein/gpqa
|
| 57 |
+
name: GPQA Diamond
|
| 58 |
+
config: gpqa_diamond
|
| 59 |
+
split: train
|
| 60 |
+
metrics:
|
| 61 |
+
- type: accuracy
|
| 62 |
+
value: 90.0
|
| 63 |
+
name: Accuracy
|
| 64 |
+
verified: false
|
| 65 |
+
- task:
|
| 66 |
+
type: text-generation
|
| 67 |
+
name: Multilingual Knowledge
|
| 68 |
+
dataset:
|
| 69 |
+
type: openai/MMMLU
|
| 70 |
+
name: MMMLU
|
| 71 |
+
metrics:
|
| 72 |
+
- type: accuracy
|
| 73 |
+
value: 85.0
|
| 74 |
+
name: Accuracy
|
| 75 |
+
verified: false
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
# Darwin-35B-A3B-Opus
|
| 79 |
+
|
| 80 |
+
<p align="center">
|
| 81 |
+
<em>"The child surpassed both parents — that is evolution."</em>
|
| 82 |
+
</p>
|
| 83 |
+
|
| 84 |
+
<!-- SEO: Structured Summary for Search Engines & AI Answer Engines -->
|
| 85 |
+
<!--
|
| 86 |
+
Darwin-35B-A3B-Opus is a 35B parameter Mixture-of-Experts (MoE) language model with 3B active parameters,
|
| 87 |
+
created by VIDRAFT using the Darwin V5 evolutionary merge engine with Model MRI integration.
|
| 88 |
+
It achieves 90.0% on GPQA Diamond (vs Father Qwen3.5-35B-A3B at 84.2%) and 85.0% on MMMLU,
|
| 89 |
+
while preserving multimodal capabilities (image/video), 201 language support, and 262K context length.
|
| 90 |
+
Licensed under Apache 2.0.
|
| 91 |
+
-->
|
| 92 |
+
|
| 93 |
+
> **TL;DR**: 35B MoE (3B active) | **GPQA Diamond 90.0%** (beats Father 84.2% & Mother 85.0%) | **MMMLU 85.0%** | Multimodal ✅ | 201 Languages | 262K Context | 147.8 tok/s | Apache 2.0
|
| 94 |
+
>
|
| 95 |
+
> `#Darwin` `#EvolutionaryMerge` `#ModelMRI` `#Qwen3.5` `#MoE` `#Reasoning` `#GPQA90` `#Multimodal` `#OpenSource` `#Apache2` `#DarwinV5` `#VIDRAFT`
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Why Darwin? — The Child That Surpassed Both Parents
|
| 100 |
+
|
| 101 |
+
The fundamental question of AI model merging: **If parent models already exist, why crossbreed?**
|
| 102 |
+
|
| 103 |
+
This model is the answer.
|
| 104 |
+
|
| 105 |
+
### Benchmark Results
|
| 106 |
+
|
| 107 |
+
**GPQA Diamond (198 Questions, Graduate-Level Reasoning)**
|
| 108 |
+
|
| 109 |
+
| Model | Accuracy | Multimodal | Benchmark Published |
|
| 110 |
+
|---|---|---|---|
|
| 111 |
+
| 🧬 **Darwin-35B-A3B-Opus (Child)** | **90.0%** | ✅ Image/Video | ✅ Fully Open |
|
| 112 |
+
| 👩 Mother — Jackrong Claude 4.6 Opus Distilled | 85.0% | ❌ Text-only | ❌ Not Published |
|
| 113 |
+
| 👨 Father — Qwen3.5-35B-A3B (Official) | 84.2% | ✅ Image/Video | ✅ Official |
|
| 114 |
+
|
| 115 |
+
> *Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format ("ANSWER: LETTER")*
|
| 116 |
+
|
| 117 |
+
**MMMLU (Multilingual Knowledge, 29 Languages)**
|
| 118 |
+
|
| 119 |
+
| Model | Accuracy |
|
| 120 |
+
|---|---|
|
| 121 |
+
| 🧬 **Darwin-35B-A3B-Opus (Child)** | **85.0%** |
|
| 122 |
+
| 👨 Father — Qwen3.5-35B-A3B (Official) | 85.2% |
|
| 123 |
+
|
| 124 |
+
> *Darwin maintains Father-level multilingual knowledge while gaining superior reasoning.*
|
| 125 |
+
|
| 126 |
+
**The child surpassed both parents in reasoning, and matched the Father in multilingual knowledge.**
|
| 127 |
+
|
| 128 |
+
- GPQA vs Father: **+6.9% relative improvement** ((90.0−84.2)/84.2)
|
| 129 |
+
- GPQA vs Mother: **+5.9% relative improvement** ((90.0−85.0)/85.0)
|
| 130 |
+
- MMMLU: **85.0%** — Father-level (85.2%) multilingual knowledge preserved
|
| 131 |
+
|
| 132 |
+
### Why Not Just Use the Mother?
|
| 133 |
+
|
| 134 |
+
| | Mother (Claude Distilled) | Darwin (Child) |
|
| 135 |
+
|---|---|---|
|
| 136 |
+
| Reasoning | Strong (85.0%) | **Stronger (90.0%)** |
|
| 137 |
+
| Image/Video | ❌ Lost (text-only fine-tune) | ✅ Inherited from Father |
|
| 138 |
+
| 201 Languages | ❌ Potentially degraded | ✅ Inherited from Father |
|
| 139 |
+
| 262K Context | Unverified | ✅ Father's architecture preserved |
|
| 140 |
+
| Benchmark Transparency | ❌ No scores published | ✅ Fully open |
|
| 141 |
+
|
| 142 |
+
### Why Not Just Use the Father?
|
| 143 |
+
|
| 144 |
+
The Father (Qwen3.5-35B-A3B) excels in versatility but scores 84.2% on hard reasoning. Darwin **pushes reasoning to 90.0%** while maintaining Father-level multilingual knowledge (MMMLU 85.0% vs 85.2%) and all general capabilities.
|
| 145 |
+
|
| 146 |
+
**Conclusion: The only model that surpasses the Mother's reasoning, preserves the Father's multilingual knowledge, and retains full multimodal capabilities.**
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## Model Overview
|
| 151 |
+
|
| 152 |
+
**Darwin-35B-A3B-Opus** is a next-generation reasoning-enhanced language model created by VIDRAFT's **Darwin V5** evolution engine.
|
| 153 |
+
|
| 154 |
+
Darwin V5 combines two innovations:
|
| 155 |
+
1. **Evolutionary Merge** — Applies natural selection to automatically find optimal weight combinations
|
| 156 |
+
2. **Model MRI Integration** — CT-scans parent models layer by layer before merging, guiding evolution with structural insight
|
| 157 |
+
|
| 158 |
+
If conventional merging is "mixing recipes blindfolded," Darwin V5 is **"precision surgery with X-ray guidance."**
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Parent Models
|
| 163 |
+
|
| 164 |
+
| Role | Model | Strengths |
|
| 165 |
+
|---|---|---|
|
| 166 |
+
| 👨 Father | [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) | General knowledge, multimodal (image/video), coding, agents, 201 languages, 262K context |
|
| 167 |
+
| 👩 Mother | [Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled) | Claude 4.6 Opus CoT distillation, structured step-by-step reasoning, coding agent compatibility |
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## Darwin V5 — Beyond Simple Merge
|
| 172 |
+
|
| 173 |
+
### Limitations of Conventional Merging
|
| 174 |
+
|
| 175 |
+
Traditional model merging relies on humans setting hyperparameters like ratio and density **by intuition**. Set ratio=0.5, density=0.9, run once, and hope for the best. The result depends on luck, and applying the same ratio uniformly across billions of parameters ignores each layer's unique role.
|
| 176 |
+
|
| 177 |
+
### Darwin V4's Advance
|
| 178 |
+
|
| 179 |
+
Darwin V4 solved this with **evolutionary algorithms** — automatically searching hundreds of parameter combinations and selecting survivors by real benchmark scores. But V4 was still **blind evolution**: it didn't know what each layer does.
|
| 180 |
+
|
| 181 |
+
### Darwin V5: Model MRI Opens the Eyes
|
| 182 |
+
|
| 183 |
+
V5 integrates **Model MRI** (neural anatomy analyzer) to give evolution "sight":
|
| 184 |
+
|
| 185 |
+
```
|
| 186 |
+
[Phase 0] Model MRI — CT-scan both parents layer by layer
|
| 187 |
+
↓ "Father's layers 15-25 concentrate multilingual knowledge"
|
| 188 |
+
↓ "Mother's layers 30-40 concentrate reasoning patterns"
|
| 189 |
+
↓
|
| 190 |
+
[Phase 1] MRI-Guided Evolution — Start from scan-informed initial genome
|
| 191 |
+
↓ Not random, but "informed by CT results"
|
| 192 |
+
↓
|
| 193 |
+
[Phase 2] mergekit real merge + benchmark fitness selection
|
| 194 |
+
↓ Faster convergence in MRI-narrowed search space
|
| 195 |
+
↓
|
| 196 |
+
[Phase 3] MRI Health Check — CT-scan the child model
|
| 197 |
+
↓ Detect interference, function loss
|
| 198 |
+
↓ Prescribe layer-specific ratio adjustments
|
| 199 |
+
↓
|
| 200 |
+
[Final] Darwin-35B-A3B-Opus
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### V4 vs V5
|
| 204 |
+
|
| 205 |
+
| | Darwin V4 | Darwin V5 |
|
| 206 |
+
|---|---|---|
|
| 207 |
+
| Analogy | Mixing recipes blindfolded | **Precision surgery with X-ray** |
|
| 208 |
+
| Initial genome | Random | **MRI-guided** |
|
| 209 |
+
| Layer control | 2 ratios (attn/ffn) | **40 layers independently** |
|
| 210 |
+
| Pre-diagnosis | ❌ None | ✅ Phase 0 MRI scan |
|
| 211 |
+
| Post-verification | Benchmark only | ✅ Phase 3 health check |
|
| 212 |
+
| Search efficiency | Wide space | **Narrowed, guided search** |
|
| 213 |
+
| Failure diagnosis | Unknown "why" | **Pinpoint which layer failed** |
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
### Discovered Optimal Parameters
|
| 218 |
+
|
| 219 |
+
| Parameter | Value | Meaning |
|
| 220 |
+
|---|---|---|
|
| 221 |
+
| ratio | 0.481 | Father 52% : Mother 48% asymmetric blend |
|
| 222 |
+
| density_a | 0.855 | Selected 85.5% of Father's weights |
|
| 223 |
+
| density_b | 0.971 | Adopted 97.1% of Mother's weights |
|
| 224 |
+
| attn | 0.168 | Only 16.8% change in attention layers |
|
| 225 |
+
| ffn | 0.841 | 84.1% change in FFN layers |
|
| 226 |
+
|
| 227 |
+
**Interpretation:** Attention patterns (what to focus on) are **almost entirely preserved** from the Father, while FFN layers (knowledge storage) are **largely replaced** with the Mother's reasoning patterns.
|
| 228 |
+
|
| 229 |
+
Discovering attn=0.168 and ffn=0.841 — this extreme asymmetry — is **virtually impossible by human intuition**.
|
| 230 |
+
|
| 231 |
+
### Evolution History
|
| 232 |
+
|
| 233 |
+
- Phase 1 → Phase 2 evolution complete
|
| 234 |
+
- Final real_score: **0.8405**
|
| 235 |
+
- Merge time: 181.6 seconds
|
| 236 |
+
- Merge commit: `109838c2`
|
| 237 |
+
|
| 238 |
+
---
|
| 239 |
+
|
| 240 |
+
## Inherited Capabilities
|
| 241 |
+
|
| 242 |
+
### From Father (Qwen3.5-35B-A3B)
|
| 243 |
+
- **Multimodal**: Image and video understanding
|
| 244 |
+
- **201 Languages**: Global linguistic coverage
|
| 245 |
+
- **262K Context**: Native long-context (extendable to 1M via YaRN)
|
| 246 |
+
- **Gated DeltaNet + MoE**: Efficient hybrid architecture
|
| 247 |
+
- **Multi-Token Prediction**: Improved inference throughput
|
| 248 |
+
|
| 249 |
+
### From Mother (Claude 4.6 Opus Distilled)
|
| 250 |
+
- **Structured Thinking**: Systematic step-by-step reasoning within `<think>` tags
|
| 251 |
+
- **Efficient Reasoning**: "Let me analyze this request carefully: 1..2..3..." pattern
|
| 252 |
+
- **Coding Agent Compatibility**: Native "developer" role support for Claude Code, OpenCode
|
| 253 |
+
- **Tool Calling Stability**: Consistent performance in tool-use scenarios
|
| 254 |
+
- **Autonomous Execution**: Extended autonomous operation in agentic environments
|
| 255 |
+
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## Father's Official Benchmarks (Reference)
|
| 259 |
+
|
| 260 |
+
Darwin is built on this architecture with enhanced reasoning:
|
| 261 |
+
|
| 262 |
+
| Category | Benchmark | Father Official |
|
| 263 |
+
|---|---|---|
|
| 264 |
+
| Knowledge | MMLU-Pro | 85.3 |
|
| 265 |
+
| Knowledge | MMLU-Redux | 93.3 |
|
| 266 |
+
| Reasoning | GPQA Diamond | 84.2 |
|
| 267 |
+
| Reasoning | HLE w/ CoT | 22.4 |
|
| 268 |
+
| Math | HMMT Feb 2025 | 89.0 |
|
| 269 |
+
| Coding | SWE-bench Verified | 69.2 |
|
| 270 |
+
| Coding | LiveCodeBench v6 | 74.6 |
|
| 271 |
+
| Agent | TAU2-Bench | 81.2 |
|
| 272 |
+
| Agent | BFCL-V4 (Tool Use) | 67.3 |
|
| 273 |
+
| Instruction | IFEval | 91.9 |
|
| 274 |
+
| Multilingual | MMMLU | 85.2 |
|
| 275 |
+
| Agentic Search | BrowseComp | 61.0 |
|
| 276 |
+
|
| 277 |
+
---
|
| 278 |
+
|
| 279 |
+
## Performance
|
| 280 |
+
|
| 281 |
+
### Inference Speed
|
| 282 |
+
|
| 283 |
+
| Metric | Value |
|
| 284 |
+
|---|---|
|
| 285 |
+
| **Generation Speed** | **147.8 tok/s** |
|
| 286 |
+
| Environment | Single NVIDIA H100 93GB NVL, SGLang, BF16 |
|
| 287 |
+
| Qwen Official API | 162.8 tok/s (Alibaba Cloud) |
|
| 288 |
+
|
| 289 |
+
### Hardware Requirements
|
| 290 |
+
|
| 291 |
+
| Setup | VRAM | Status |
|
| 292 |
+
|---|---|---|
|
| 293 |
+
| **BF16 (Full Precision)** | **65.5 GiB** | |
|
| 294 |
+
| Single H100 93GB NVL | 93 GB | ✅ Comfortable |
|
| 295 |
+
| Single A100 80GB | 80 GB | ⚠️ Tight |
|
| 296 |
+
| Single A100 40GB | 40 GB | ❌ Insufficient |
|
| 297 |
+
| **Q8 Quantized** | **~35 GiB** | |
|
| 298 |
+
| Single A100 40GB | 40 GB | ✅ Possible |
|
| 299 |
+
| **Q4_K_M Quantized** | **~18 GiB** | |
|
| 300 |
+
| Single RTX 4090 24GB | 24 GB | ✅ Comfortable |
|
| 301 |
+
| 2× RTX 4090 (tp=2) | 48 GB | ✅ BF16 possible |
|
| 302 |
+
|
| 303 |
+
> As a Mixture-of-Experts model, only 3B parameters are active per token despite loading the full 35B. Quantization has minimal impact due to this sparsity.
|
| 304 |
+
|
| 305 |
+
---
|
| 306 |
+
|
| 307 |
+
## Model Specifications
|
| 308 |
+
|
| 309 |
+
| | |
|
| 310 |
+
|---|---|
|
| 311 |
+
| Architecture | Qwen3.5 MoE (Gated DeltaNet + MoE) |
|
| 312 |
+
| Total Parameters | 35B |
|
| 313 |
+
| Active Parameters | 3B per forward pass |
|
| 314 |
+
| Hidden Dimension | 2,048 |
|
| 315 |
+
| Layers | 40 |
|
| 316 |
+
| Layer Layout | 10 × (3 × GDN→MoE + 1 × Attention→MoE) |
|
| 317 |
+
| Experts | 256 (8 routed + 1 shared active) |
|
| 318 |
+
| Expert Intermediate Dim | 512 |
|
| 319 |
+
| Context Length | 262,144 native (up to 1,010,000 via YaRN) |
|
| 320 |
+
| Languages | 201 |
|
| 321 |
+
| Multimodal | ✅ Image & Video input |
|
| 322 |
+
| License | Apache 2.0 |
|
| 323 |
+
| Engine | Darwin V5 (Evolutionary Merge + Model MRI) |
|
| 324 |
+
| Evolution Phase | Phase 2, real_score 0.8405 |
|
| 325 |
+
| Merge Commit | 109838c2 |
|
| 326 |
+
|
| 327 |
+
---
|
| 328 |
+
|
| 329 |
+
## Usage
|
| 330 |
+
|
| 331 |
+
### SGLang (Recommended)
|
| 332 |
+
|
| 333 |
+
```bash
|
| 334 |
+
python -m sglang.launch_server \
|
| 335 |
+
--model-path FINAL-Bench/Darwin-35B-A3B-Opus \
|
| 336 |
+
--tp 1 \
|
| 337 |
+
--mem-fraction-static 0.90 \
|
| 338 |
+
--context-length 32768 \
|
| 339 |
+
--trust-remote-code
|
| 340 |
+
```
|
| 341 |
+
|
| 342 |
+
### vLLM
|
| 343 |
+
|
| 344 |
+
```bash
|
| 345 |
+
vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \
|
| 346 |
+
--trust-remote-code \
|
| 347 |
+
--enforce-eager
|
| 348 |
+
```
|
| 349 |
+
|
| 350 |
+
### Transformers
|
| 351 |
+
|
| 352 |
+
```python
|
| 353 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 354 |
+
|
| 355 |
+
tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-35B-A3B-Opus", trust_remote_code=True)
|
| 356 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 357 |
+
"FINAL-Bench/Darwin-35B-A3B-Opus",
|
| 358 |
+
dtype="bfloat16",
|
| 359 |
+
device_map="auto",
|
| 360 |
+
trust_remote_code=True,
|
| 361 |
+
)
|
| 362 |
+
```
|
| 363 |
+
|
| 364 |
+
### Best Practices
|
| 365 |
+
- Use **context ≥ 32K** for reasoning tasks — the model leverages extended thinking
|
| 366 |
+
- For maximum reasoning quality, use **thinking mode (default)** with sufficient max_tokens (≥ 16384)
|
| 367 |
+
- The model generates `<think>` blocks for internal reasoning; extract the final answer after `</think>`
|
| 368 |
+
|
| 369 |
+
---
|
| 370 |
+
|
| 371 |
+
## Built By
|
| 372 |
+
|
| 373 |
+
| | |
|
| 374 |
+
|---|---|
|
| 375 |
+
| Developer | **VIDRAFT** |
|
| 376 |
+
| Evolution Engine | Darwin V5 (Evolutionary Merge + Model MRI) |
|
| 377 |
+
| Infrastructure | 4 × NVIDIA H100 93GB NVL GPU |
|
| 378 |
+
| Merge Time | 181.6 seconds |
|
| 379 |
+
| Shard Distribution | 14 shards → GPU [1, 2, 3] round-robin |
|
| 380 |
+
|
| 381 |
+
---
|
| 382 |
+
|
| 383 |
+
## Acknowledgements
|
| 384 |
+
|
| 385 |
+
- **Korean Government** — This research was supported by the Korean Government's 'GPU Support Program' research grant
|
| 386 |
+
- [Qwen Team](https://huggingface.co/Qwen) — Qwen3.5-35B-A3B base architecture
|
| 387 |
+
- [Jackrong](https://huggingface.co/Jackrong) — Claude 4.6 Opus Reasoning Distilled model
|
| 388 |
+
- [nohurry](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered), [TeichAI](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) — Distillation datasets
|
| 389 |
+
|
| 390 |
+
---
|
| 391 |
+
|
| 392 |
+
## Citation
|
| 393 |
+
|
| 394 |
+
```bibtex
|
| 395 |
+
@misc{vidraft_darwin_35b_opus,
|
| 396 |
+
title = {Darwin-35B-A3B-Opus: MRI-Guided Evolutionary Merge Beyond Both Parents},
|
| 397 |
+
author = {VIDRAFT},
|
| 398 |
+
year = {2026},
|
| 399 |
+
publisher = {Hugging Face},
|
| 400 |
+
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}}
|
| 401 |
+
}
|
| 402 |
+
```
|
| 403 |
+
|
| 404 |
+
---
|
| 405 |
+
|
| 406 |
+
## FAQ (Frequently Asked Questions)
|
| 407 |
+
|
| 408 |
+
<details>
|
| 409 |
+
<summary><b>What is Darwin-35B-A3B-Opus?</b></summary>
|
| 410 |
+
Darwin-35B-A3B-Opus is a 35 billion parameter Mixture-of-Experts language model (3B active per token) that was created using evolutionary merge techniques. It combines Qwen3.5-35B-A3B's multimodal versatility with Claude 4.6 Opus reasoning distillation, achieving 90.0% on GPQA Diamond — surpassing both parent models.
|
| 411 |
+
</details>
|
| 412 |
+
|
| 413 |
+
<details>
|
| 414 |
+
<summary><b>How does Darwin V5 differ from simple model merging?</b></summary>
|
| 415 |
+
Traditional merging applies uniform ratios by guesswork. Darwin V5 uses evolutionary algorithms (natural selection) combined with Model MRI (neural CT-scanning) to automatically discover optimal layer-specific merge ratios. For example, it found attn=0.168 and ffn=0.841 — an extreme asymmetry impossible to find by intuition.
|
| 416 |
+
</details>
|
| 417 |
+
|
| 418 |
+
<details>
|
| 419 |
+
<summary><b>What GPU do I need to run this model?</b></summary>
|
| 420 |
+
For BF16 full precision: A100 80GB (tight) or H100 93GB (comfortable). For Q4 quantization: a single RTX 4090 (24GB) is sufficient. The model loads 35B parameters but only activates 3B per token due to its MoE architecture.
|
| 421 |
+
</details>
|
| 422 |
+
|
| 423 |
+
<details>
|
| 424 |
+
<summary><b>Does it support multimodal (images/video)?</b></summary>
|
| 425 |
+
Yes. Darwin inherits the Father model's (Qwen3.5-35B-A3B) full multimodal capabilities including image and video understanding, unlike the Mother model which lost this during text-only fine-tuning.
|
| 426 |
+
</details>
|
| 427 |
+
|
| 428 |
+
<details>
|
| 429 |
+
<summary><b>What languages does it support?</b></summary>
|
| 430 |
+
201 languages and dialects, inherited from Qwen3.5's multilingual training. MMMLU benchmark confirms 85.0% multilingual knowledge retention across 29 evaluated languages.
|
| 431 |
+
</details>
|
| 432 |
+
|
| 433 |
+
<details>
|
| 434 |
+
<summary><b>What is Model MRI?</b></summary>
|
| 435 |
+
Model MRI is a neural anatomy analysis tool that CT-scans each layer of a language model to understand what functions it performs. When integrated with Darwin, it guides the evolutionary merge process — telling the algorithm which layers to preserve from each parent and which to replace.
|
| 436 |
+
</details>
|
| 437 |
+
|
| 438 |
+
<details>
|
| 439 |
+
<summary><b>Is this model open source?</b></summary>
|
| 440 |
+
Yes. Darwin-35B-A3B-Opus is released under the Apache 2.0 license, fully open for commercial and research use.
|
| 441 |
+
</details>
|
| 442 |
+
|
| 443 |
+
---
|
| 444 |
+
|
| 445 |
+
<!-- AEO: Keywords for AI Answer Engines -->
|
| 446 |
+
<!--
|
| 447 |
+
Keywords: Darwin-35B-A3B-Opus, evolutionary merge, model merging, Darwin V5, Model MRI,
|
| 448 |
+
GPQA Diamond 90%, Qwen3.5-35B-A3B, Claude 4.6 Opus, reasoning model, mixture of experts,
|
| 449 |
+
MoE 3B active, 35B parameters, multimodal LLM, 201 languages, 262K context,
|
| 450 |
+
open source AI model, Apache 2.0, VIDRAFT, natural selection AI,
|
| 451 |
+
layer-wise merge ratio, attention preservation, FFN replacement,
|
| 452 |
+
best open source reasoning model 2026, Qwen merge, coding agent compatible
|
| 453 |
+
-->
|
| 454 |
+
|
| 455 |
+
`#DarwinAI` `#EvolutionaryMerge` `#ModelMRI` `#DarwinV5` `#GPQA90` `#Qwen35` `#MoE3B` `#Reasoning` `#Multimodal` `#201Languages` `#OpenSource` `#Apache2` `#VIDRAFT` `#NaturalSelection` `#LayerWiseMerge` `#ClaudeOpus` `#ThinkingModel` `#CodingAgent` `#LongContext262K` `#BestOpenSourceLLM2026`
|