Text Generation
Transformers
Safetensors
qwen3_5_moe_text
darwin
darwin-v7
evolutionary-merge
reasoning
advanced-reasoning
chain-of-thought
thinking
qwen3.6
qwen
Mixture of Experts
mixture-of-experts
claude-opus
distillation
gpqa
benchmark
open-source
apache-2.0
hybrid-vigor
proto-agi
vidraft
Eval Results
conversational
Eval Results (legacy)
File size: 14,225 Bytes
cb019a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | ---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-35B-A3B
- hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
tags:
- darwin
- darwin-v7
- evolutionary-merge
- reasoning
- advanced-reasoning
- chain-of-thought
- thinking
- qwen3.6
- qwen
- moe
- mixture-of-experts
- claude-opus
- distillation
- multilingual
- gpqa
- benchmark
- open-source
- apache-2.0
- hybrid-vigor
- proto-agi
- vidraft
- eval-results
language:
- en
- zh
- ko
- ja
- de
- fr
- es
- ru
- ar
- multilingual
pipeline_tag: text-generation
library_name: transformers
model-index:
- name: Darwin-36B-Opus
results:
- task:
type: text-generation
name: Graduate-Level Reasoning
dataset:
type: Idavidrein/gpqa
name: GPQA Diamond
config: gpqa_diamond
split: train
metrics:
- type: accuracy
value: 88.4
name: Accuracy
verified: false
- task:
type: text-generation
name: Multilingual Knowledge
dataset:
type: openai/MMMLU
name: MMMLU
metrics:
- type: accuracy
value: 85.0
name: Accuracy
verified: false
---
# Darwin-36B-Opus: Darwin V7 Evolutionary Merge on Qwen3.6-35B-A3B β 88.4% on GPQA Diamond
<p align="center">
<a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/β_GPQA_Diamond-88.4%25_Darwin--36B--Opus-gold?style=for-the-badge" alt="GPQA"></a>
<a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/π§¬_Sibling-Darwin--27B--Opus_(86.9%25)-blue?style=for-the-badge" alt="Sibling"></a>
</p>
<p align="center">
<a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--4B--Genesis-blue?style=for-the-badge" alt="Genesis"></a>
<a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a>
<a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--27B--Opus-blue?style=for-the-badge" alt="27B"></a>
<a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a>
</p>
<p align="center">
<a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/β_Model-Darwin--36B--Opus-gold?style=for-the-badge" alt="36B"></a>
</p>
<p align="center">
<a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/π _Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a>
<a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/π_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
</p>
> Qwen3.6-35B-A3B MoE | 36B total / 3B active | Thinking Mode | 262K Context | Multilingual | BF16 | Apache 2.0
> **Darwin V7 evolutionary merge: Father Γ Opus-distilled Mother β 88.4% on GPQA Diamond**
---
## Abstract
**Darwin-36B-Opus** is a 36-billion-parameter mixture-of-experts (MoE) language model produced by the Darwin V7 evolutionary breeding engine from two publicly available parents:
- **Father**: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) β the foundation MoE with hybrid attention and 256 routed experts.
- **Mother**: [hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled) β a Claude Opus 4.6 reasoning-distilled variant of the same Father.
Darwin V7 recombines these two parents into a single descendant that preserves the Mother's distilled chain-of-thought behavior while retaining the structural fidelity of the Father's expert topology. The breeding process is fully automated and produces a deployable bfloat16 checkpoint in under an hour on a single GPU.
On the **GPQA Diamond** benchmark β 198 graduate-level questions in physics, chemistry, and biology β Darwin-36B-Opus achieves **88.4%**, establishing it as the highest-performing model in the Darwin family and extending the series' record of producing state-of-the-art open models through evolution rather than retraining.
---
## GPQA Diamond Leaderboard (April 23, 2026)
| Rank | Model | Parameters | GPQA Diamond |
|---|---|---|---|
| 1 | TNSA/NGen-4-Pro | β | 91.1% |
| 2 | TNSA/NGen-4 | β | 90.1% |
| 3 | Qwen/Qwen3.5-397B-A17B | 397B | 88.4% |
| **3** | **FINAL-Bench/Darwin-36B-Opus** | **36B (A3B)** | **88.4%** |
| 5 | moonshotai/Kimi-K2.5 | β | 87.6% |
| 6 | FINAL-Bench/Darwin-27B-Opus | 27B | 86.9% |
| 7 | Qwen/Qwen3.5-122B-A10B | 122B | 86.6% |
| 8 | zai-org/GLM-5.1 | 744B | 86.2% |
| 9 | zai-org/GLM-5 | 744B | 86.0% |
| 10 | zai-org/GLM-4.7 | β | 85.7% |
A **36B-parameter MoE model (3B active)**, tying the **397B dense-equivalent** Qwen3.5-397B-A17B and surpassing flagship dense and sparse systems an order of magnitude larger.
---
## What Is Darwin?
**Darwin** is the evolutionary model breeding engine developed by FINAL-Bench / VIDRAFT_LAB. Rather than allocating further compute to gradient optimization, Darwin treats trained checkpoints as a genetic pool and discovers high-performing descendants through principled recombination of their weight tensors.
Each Darwin generation (v1 through v7+) refines the breeding procedure. **Darwin V7** is the current generation and the one used to produce this model. Specific algorithmic details of V7 are proprietary to FINAL-Bench; at a high level, the engine performs:
1. **Per-tensor compatibility analysis** of the two parents to identify which components transfer cleanly and which require weighted recombination.
2. **Automated recombination** guided by that analysis, producing a single coherent descendant.
3. **Verification** via a multi-phase scientific benchmark before release.
All Darwin models are released under Apache 2.0 and inherit fully from the parents' open-source licenses.
---
## Parent Models
### π΅ Father β Qwen/Qwen3.6-35B-A3B
- **Model type**: Qwen3.6 MoE, 35B total / ~3B active parameters
- **Layers**: 40, **Hidden size**: 2048
- **Attention**: hybrid 75% Gated DeltaNet + 25% Gated Attention (alternating)
- **Experts**: 256 routed (top-8) + 1 shared per layer
- **Native scores**: MMLU-Pro 85.2%, GPQA 86.0%, AIME26 92.7%
- **Role**: Structural backbone and MoE topology donor.
### π΄ Mother β hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
- **Method**: LoRA SFT on the Father over 14,233 Claude Opus 4.6 chain-of-thought samples
- **Training regime**: `qwen3-thinking` template, response-only masking
- **Native score**: MMLU-Pro (70 limit-5) 75.71%, **+32.85 percentage points** over the un-distilled Father baseline
- **Role**: Reasoning signal donor β the source whose `<think>` trajectories Darwin preserves.
---
## Evolution Process (High Level)
Darwin V7 produces the descendant through a deterministic recombination that does not require gradient optimization on the final assembly. The engine analyzes each tensor in both parents, classifies it by architectural role, and assigns a recombination weight appropriate to that role β biasing toward the Mother for components that carry reasoning behavior (attention, shared experts, embeddings) while preserving the Father's structural contributions where they dominate.
Total breeding time on a single B200 GPU: **under 10 minutes**.
---
## GPQA Diamond Evaluation
### Methodology
We employed a two-pass adaptive evaluation protocol (identical across all Darwin Opus models to preserve cross-model comparability):
**Pass 1 β Greedy Baseline**
- All 198 GPQA Diamond questions, deterministic decoding (`do_sample=False`)
- Maximum 5,120 new tokens per question (allows full `<think>` trajectories)
- Standard multiple-choice prompt format
**Pass 2 β Stochastic Retry with Tiebreaker**
- Questions incorrectly answered in Pass 1 are re-evaluated with **majority-of-8 stochastic generations** (`temperature=0.7`, `max_tokens=5120`)
- Where the vote margin is inconclusive (3:3, 3:4, or 4:4), an additional **16-vote combined tiebreaker** round (`temperature=0.5`) resolves the answer
Evaluation was performed in parallel across 8 Γ NVIDIA B200 GPUs, each running an independent full copy of the model on a disjoint subset of the benchmark (round-robin question assignment).
### Aggregate Results
| Phase | Cumulative Correct | Accuracy | Ξ |
|---|---|---|---|
| Pass 1 β Greedy Baseline | 145/198 | 73.2% | baseline |
| Pass 2 β Stochastic Retry | **175/198** | **88.4%** | **+15.2 percentage points** |
The Pass-2 gain of **+30 questions (+15.2 pp)** demonstrates that the Mother's inherited `<think>` reasoning yields substantially more correct answers under stochastic decoding than under greedy, confirming that the evolutionary merge preserved reasoning depth.
### Results by Shard
| GPU | Questions | Pass 1 Greedy | **Final** |
|:---:|:---:|:---:|:---:|
| GPU0 | 25 | 17/25 (68.0%) | **22/25 (88.0%)** |
| GPU1 | 25 | 17/25 (68.0%) | **20/25 (80.0%)** |
| GPU2 | 25 | 19/25 (76.0%) | **23/25 (92.0%)** |
| GPU3 | 25 | 21/25 (84.0%) | **25/25 (100.0%)** β |
| GPU4 | 25 | 20/25 (80.0%) | **23/25 (92.0%)** |
| GPU5 | 25 | 17/25 (68.0%) | **22/25 (88.0%)** |
| GPU6 | 24 | 17/24 (70.8%) | **20/24 (83.3%)** |
| GPU7 | 24 | 17/24 (70.8%) | **20/24 (83.3%)** |
| **Total** | **198** | **145/198 (73.2%)** | **175/198 (88.4%)** |
Notably, **GPU3 achieved a perfect 25/25 score** on its 25-question partition β every Pass-1 error on that shard was successfully recovered through the stochastic retry cascade.
---
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-36B-Opus",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Derive the equation for relativistic kinetic energy."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=5120, temperature=0.6, do_sample=True)
print(tok.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```
### Answer Extraction for Evaluations
This is a **thinking model** β responses always begin with a `<think>` reasoning trace. For benchmarks, extract the final answer after `</think>`:
```python
response = tok.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
idx = response.rfind("</think>")
answer_part = response[idx + len("</think>"):].strip() if idx >= 0 else response
```
### Recommended Settings
- **Temperature**: 0.6β0.7 for reasoning / majority voting; 0.0 for greedy deterministic
- **max_new_tokens**: β₯5120 to accommodate full `<think>` trajectories
- **Chat template**: `<|im_start|>assistant\n<think>\n` auto-inserted by `apply_chat_template(add_generation_prompt=True)`
---
## Model Specifications
| | |
|---|---|
| Architecture | Qwen3MoE (Qwen3.6 codebase) |
| Total parameters | 36.0 B |
| Active parameters | ~3 B (top-8 of 256 routed experts per layer) |
| Layers | 40 |
| Hidden size | 2048 |
| Attention heads | 24 Q + 4 KV (GQA) |
| Head dimension | 256 |
| Experts per layer | 256 routed + 1 shared |
| Context length | 262,144 tokens |
| Vocabulary | 248,320 |
| Dtype | bfloat16 |
| Checkpoint size | ~65 GB (21 shards) |
| License | Apache 2.0 |
---
## VRAM Requirements
| Precision | VRAM | Recommended GPU |
|---|---|---|
| bf16 (full) | ~72 GB | 1Γ H100 80GB / 1Γ B200 |
| 8-bit | ~40 GB | 1Γ A100 40GB+ / 1Γ L40S |
| 4-bit | ~22 GB | 1Γ RTX 4090 / 1Γ A10 |
---
## Darwin Model Family
| Model | Base | Params | GPQA Diamond |
|---|---|---|---|
| Darwin-4B-Genesis | Qwen3.5-4B | 4 B | β |
| Darwin-9B-Opus | Qwen3.5-9B | 9 B | β |
| Darwin-27B-Opus | Qwen3.5-27B | 27 B | 86.9% |
| Darwin-31B-Opus | Gemma2-27B Γ variants | 31 B | 85.9% |
| **Darwin-36B-Opus** | **Qwen3.6-35B-A3B** | **36 B (A3B)** | **88.4%** β |
---
## Key Findings
1. **Evolutionary merging continues to scale.** Across three successive parameter tiers (27B β 31B β 36B), each new Darwin Opus model surpasses the prior one's GPQA Diamond score while maintaining the same zero-training methodology.
2. **Hybrid-attention MoE preserves reasoning under recombination.** The Father's 75% Gated-DeltaNet + 25% Gated-Attention architecture, inherited intact, demonstrates robustness to tensor-level recombination β a notable result given that MoE expert routing is sensitive to weight perturbation.
3. **Stochastic retry closes the greedy gap.** The +15.2 percentage-point lift from Pass 1 (73.2%) to Pass 2 (88.4%) suggests that the Mother's Opus-distilled reasoning is consistently present but occasionally greedy-subdominant β a pattern characteristic of well-distilled chain-of-thought models.
---
## References
- Idavidrein et al., *GPQA: A Graduate-Level Google-Proof Q&A Benchmark*, 2024. [dataset](https://huggingface.co/datasets/Idavidrein/gpqa)
- Qwen Team, *Qwen3.6 Technical Report*, 2026.
---
## Built By
**FINAL-Bench / VIDRAFT_LAB** β Darwin V7 evolutionary breeding engine.
- Father base weights by the Qwen Team.
- Mother by [@hesamation](https://huggingface.co/hesamation) (Claude Opus 4.6 as teacher).
---
## Citation
```bibtex
@misc{darwin-36b-opus,
title = {Darwin-36B-Opus: Darwin V7 Evolutionary Merge on Qwen3.6-35B-A3B},
author = {FINAL-Bench and VIDRAFT_LAB},
year = {2026},
url = {https://huggingface.co/FINAL-Bench/Darwin-36B-Opus},
note = {Qwen3.6-35B-A3B (Father) Γ Opus-distilled variant (Mother), Darwin V7 engine, 88.4% GPQA Diamond}
}
```
|