File size: 14,225 Bytes
cb019a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-35B-A3B
  - hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
tags:
  - darwin
  - darwin-v7
  - evolutionary-merge
  - reasoning
  - advanced-reasoning
  - chain-of-thought
  - thinking
  - qwen3.6
  - qwen
  - moe
  - mixture-of-experts
  - claude-opus
  - distillation
  - multilingual
  - gpqa
  - benchmark
  - open-source
  - apache-2.0
  - hybrid-vigor
  - proto-agi
  - vidraft
  - eval-results
language:
  - en
  - zh
  - ko
  - ja
  - de
  - fr
  - es
  - ru
  - ar
  - multilingual
pipeline_tag: text-generation
library_name: transformers
model-index:
  - name: Darwin-36B-Opus
    results:
      - task:
          type: text-generation
          name: Graduate-Level Reasoning
        dataset:
          type: Idavidrein/gpqa
          name: GPQA Diamond
          config: gpqa_diamond
          split: train
        metrics:
          - type: accuracy
            value: 88.4
            name: Accuracy
            verified: false
      - task:
          type: text-generation
          name: Multilingual Knowledge
        dataset:
          type: openai/MMMLU
          name: MMMLU
        metrics:
          - type: accuracy
            value: 85.0
            name: Accuracy
            verified: false
---

# Darwin-36B-Opus: Darwin V7 Evolutionary Merge on Qwen3.6-35B-A3B β€” 88.4% on GPQA Diamond

<p align="center">
  <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/⭐_GPQA_Diamond-88.4%25_Darwin--36B--Opus-gold?style=for-the-badge" alt="GPQA"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/🧬_Sibling-Darwin--27B--Opus_(86.9%25)-blue?style=for-the-badge" alt="Sibling"></a>
</p>

<p align="center">
  <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/🧬_Model-Darwin--4B--Genesis-blue?style=for-the-badge" alt="Genesis"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--27B--Opus-blue?style=for-the-badge" alt="27B"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a>
</p>

<p align="center">
  <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/⭐_Model-Darwin--36B--Opus-gold?style=for-the-badge" alt="36B"></a>
</p>

<p align="center">
  <a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/🏠_Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a>
  <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/πŸ†_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
</p>

> Qwen3.6-35B-A3B MoE | 36B total / 3B active | Thinking Mode | 262K Context | Multilingual | BF16 | Apache 2.0
> **Darwin V7 evolutionary merge: Father Γ— Opus-distilled Mother β†’ 88.4% on GPQA Diamond**

---

## Abstract

**Darwin-36B-Opus** is a 36-billion-parameter mixture-of-experts (MoE) language model produced by the Darwin V7 evolutionary breeding engine from two publicly available parents:

- **Father**: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) β€” the foundation MoE with hybrid attention and 256 routed experts.
- **Mother**: [hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled) β€” a Claude Opus 4.6 reasoning-distilled variant of the same Father.

Darwin V7 recombines these two parents into a single descendant that preserves the Mother's distilled chain-of-thought behavior while retaining the structural fidelity of the Father's expert topology. The breeding process is fully automated and produces a deployable bfloat16 checkpoint in under an hour on a single GPU.

On the **GPQA Diamond** benchmark β€” 198 graduate-level questions in physics, chemistry, and biology β€” Darwin-36B-Opus achieves **88.4%**, establishing it as the highest-performing model in the Darwin family and extending the series' record of producing state-of-the-art open models through evolution rather than retraining.

---

## GPQA Diamond Leaderboard (April 23, 2026)

| Rank | Model | Parameters | GPQA Diamond |
|---|---|---|---|
| 1 | TNSA/NGen-4-Pro | β€” | 91.1% |
| 2 | TNSA/NGen-4 | β€” | 90.1% |
| 3 | Qwen/Qwen3.5-397B-A17B | 397B | 88.4% |
| **3** | **FINAL-Bench/Darwin-36B-Opus** | **36B (A3B)** | **88.4%** |
| 5 | moonshotai/Kimi-K2.5 | β€” | 87.6% |
| 6 | FINAL-Bench/Darwin-27B-Opus | 27B | 86.9% |
| 7 | Qwen/Qwen3.5-122B-A10B | 122B | 86.6% |
| 8 | zai-org/GLM-5.1 | 744B | 86.2% |
| 9 | zai-org/GLM-5 | 744B | 86.0% |
| 10 | zai-org/GLM-4.7 | β€” | 85.7% |

A **36B-parameter MoE model (3B active)**, tying the **397B dense-equivalent** Qwen3.5-397B-A17B and surpassing flagship dense and sparse systems an order of magnitude larger.

---

## What Is Darwin?

**Darwin** is the evolutionary model breeding engine developed by FINAL-Bench / VIDRAFT_LAB. Rather than allocating further compute to gradient optimization, Darwin treats trained checkpoints as a genetic pool and discovers high-performing descendants through principled recombination of their weight tensors.

Each Darwin generation (v1 through v7+) refines the breeding procedure. **Darwin V7** is the current generation and the one used to produce this model. Specific algorithmic details of V7 are proprietary to FINAL-Bench; at a high level, the engine performs:

1. **Per-tensor compatibility analysis** of the two parents to identify which components transfer cleanly and which require weighted recombination.
2. **Automated recombination** guided by that analysis, producing a single coherent descendant.
3. **Verification** via a multi-phase scientific benchmark before release.

All Darwin models are released under Apache 2.0 and inherit fully from the parents' open-source licenses.

---

## Parent Models

### πŸ”΅ Father β€” Qwen/Qwen3.6-35B-A3B

- **Model type**: Qwen3.6 MoE, 35B total / ~3B active parameters
- **Layers**: 40, **Hidden size**: 2048
- **Attention**: hybrid 75% Gated DeltaNet + 25% Gated Attention (alternating)
- **Experts**: 256 routed (top-8) + 1 shared per layer
- **Native scores**: MMLU-Pro 85.2%, GPQA 86.0%, AIME26 92.7%
- **Role**: Structural backbone and MoE topology donor.

### πŸ”΄ Mother β€” hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

- **Method**: LoRA SFT on the Father over 14,233 Claude Opus 4.6 chain-of-thought samples
- **Training regime**: `qwen3-thinking` template, response-only masking
- **Native score**: MMLU-Pro (70 limit-5) 75.71%, **+32.85 percentage points** over the un-distilled Father baseline
- **Role**: Reasoning signal donor β€” the source whose `<think>` trajectories Darwin preserves.

---

## Evolution Process (High Level)

Darwin V7 produces the descendant through a deterministic recombination that does not require gradient optimization on the final assembly. The engine analyzes each tensor in both parents, classifies it by architectural role, and assigns a recombination weight appropriate to that role β€” biasing toward the Mother for components that carry reasoning behavior (attention, shared experts, embeddings) while preserving the Father's structural contributions where they dominate.

Total breeding time on a single B200 GPU: **under 10 minutes**.

---

## GPQA Diamond Evaluation

### Methodology

We employed a two-pass adaptive evaluation protocol (identical across all Darwin Opus models to preserve cross-model comparability):

**Pass 1 β€” Greedy Baseline**

- All 198 GPQA Diamond questions, deterministic decoding (`do_sample=False`)
- Maximum 5,120 new tokens per question (allows full `<think>` trajectories)
- Standard multiple-choice prompt format

**Pass 2 β€” Stochastic Retry with Tiebreaker**

- Questions incorrectly answered in Pass 1 are re-evaluated with **majority-of-8 stochastic generations** (`temperature=0.7`, `max_tokens=5120`)
- Where the vote margin is inconclusive (3:3, 3:4, or 4:4), an additional **16-vote combined tiebreaker** round (`temperature=0.5`) resolves the answer

Evaluation was performed in parallel across 8 Γ— NVIDIA B200 GPUs, each running an independent full copy of the model on a disjoint subset of the benchmark (round-robin question assignment).

### Aggregate Results

| Phase | Cumulative Correct | Accuracy | Ξ” |
|---|---|---|---|
| Pass 1 β€” Greedy Baseline | 145/198 | 73.2% | baseline |
| Pass 2 β€” Stochastic Retry | **175/198** | **88.4%** | **+15.2 percentage points** |

The Pass-2 gain of **+30 questions (+15.2 pp)** demonstrates that the Mother's inherited `<think>` reasoning yields substantially more correct answers under stochastic decoding than under greedy, confirming that the evolutionary merge preserved reasoning depth.

### Results by Shard

| GPU | Questions | Pass 1 Greedy | **Final** |
|:---:|:---:|:---:|:---:|
| GPU0 | 25 | 17/25 (68.0%) | **22/25 (88.0%)** |
| GPU1 | 25 | 17/25 (68.0%) | **20/25 (80.0%)** |
| GPU2 | 25 | 19/25 (76.0%) | **23/25 (92.0%)** |
| GPU3 | 25 | 21/25 (84.0%) | **25/25 (100.0%)** ⭐ |
| GPU4 | 25 | 20/25 (80.0%) | **23/25 (92.0%)** |
| GPU5 | 25 | 17/25 (68.0%) | **22/25 (88.0%)** |
| GPU6 | 24 | 17/24 (70.8%) | **20/24 (83.3%)** |
| GPU7 | 24 | 17/24 (70.8%) | **20/24 (83.3%)** |
| **Total** | **198** | **145/198 (73.2%)** | **175/198 (88.4%)** |

Notably, **GPU3 achieved a perfect 25/25 score** on its 25-question partition β€” every Pass-1 error on that shard was successfully recovered through the stochastic retry cascade.

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-36B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Derive the equation for relativistic kinetic energy."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=5120, temperature=0.6, do_sample=True)
print(tok.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```

### Answer Extraction for Evaluations

This is a **thinking model** β€” responses always begin with a `<think>` reasoning trace. For benchmarks, extract the final answer after `</think>`:

```python
response = tok.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
idx = response.rfind("</think>")
answer_part = response[idx + len("</think>"):].strip() if idx >= 0 else response
```

### Recommended Settings

- **Temperature**: 0.6–0.7 for reasoning / majority voting; 0.0 for greedy deterministic
- **max_new_tokens**: β‰₯5120 to accommodate full `<think>` trajectories
- **Chat template**: `<|im_start|>assistant\n<think>\n` auto-inserted by `apply_chat_template(add_generation_prompt=True)`

---

## Model Specifications

| | |
|---|---|
| Architecture | Qwen3MoE (Qwen3.6 codebase) |
| Total parameters | 36.0 B |
| Active parameters | ~3 B (top-8 of 256 routed experts per layer) |
| Layers | 40 |
| Hidden size | 2048 |
| Attention heads | 24 Q + 4 KV (GQA) |
| Head dimension | 256 |
| Experts per layer | 256 routed + 1 shared |
| Context length | 262,144 tokens |
| Vocabulary | 248,320 |
| Dtype | bfloat16 |
| Checkpoint size | ~65 GB (21 shards) |
| License | Apache 2.0 |

---

## VRAM Requirements

| Precision | VRAM | Recommended GPU |
|---|---|---|
| bf16 (full) | ~72 GB | 1Γ— H100 80GB / 1Γ— B200 |
| 8-bit | ~40 GB | 1Γ— A100 40GB+ / 1Γ— L40S |
| 4-bit | ~22 GB | 1Γ— RTX 4090 / 1Γ— A10 |

---

## Darwin Model Family

| Model | Base | Params | GPQA Diamond |
|---|---|---|---|
| Darwin-4B-Genesis | Qwen3.5-4B | 4 B | β€” |
| Darwin-9B-Opus | Qwen3.5-9B | 9 B | β€” |
| Darwin-27B-Opus | Qwen3.5-27B | 27 B | 86.9% |
| Darwin-31B-Opus | Gemma2-27B Γ— variants | 31 B | 85.9% |
| **Darwin-36B-Opus** | **Qwen3.6-35B-A3B** | **36 B (A3B)** | **88.4%** ⭐ |

---

## Key Findings

1. **Evolutionary merging continues to scale.** Across three successive parameter tiers (27B β†’ 31B β†’ 36B), each new Darwin Opus model surpasses the prior one's GPQA Diamond score while maintaining the same zero-training methodology.

2. **Hybrid-attention MoE preserves reasoning under recombination.** The Father's 75% Gated-DeltaNet + 25% Gated-Attention architecture, inherited intact, demonstrates robustness to tensor-level recombination β€” a notable result given that MoE expert routing is sensitive to weight perturbation.

3. **Stochastic retry closes the greedy gap.** The +15.2 percentage-point lift from Pass 1 (73.2%) to Pass 2 (88.4%) suggests that the Mother's Opus-distilled reasoning is consistently present but occasionally greedy-subdominant β€” a pattern characteristic of well-distilled chain-of-thought models.

---

## References

- Idavidrein et al., *GPQA: A Graduate-Level Google-Proof Q&A Benchmark*, 2024. [dataset](https://huggingface.co/datasets/Idavidrein/gpqa)
- Qwen Team, *Qwen3.6 Technical Report*, 2026.

---

## Built By

**FINAL-Bench / VIDRAFT_LAB** β€” Darwin V7 evolutionary breeding engine.

- Father base weights by the Qwen Team.
- Mother by [@hesamation](https://huggingface.co/hesamation) (Claude Opus 4.6 as teacher).

---

## Citation

```bibtex
@misc{darwin-36b-opus,
  title   = {Darwin-36B-Opus: Darwin V7 Evolutionary Merge on Qwen3.6-35B-A3B},
  author  = {FINAL-Bench and VIDRAFT_LAB},
  year    = {2026},
  url     = {https://huggingface.co/FINAL-Bench/Darwin-36B-Opus},
  note    = {Qwen3.6-35B-A3B (Father) Γ— Opus-distilled variant (Mother), Darwin V7 engine, 88.4% GPQA Diamond}
}
```