File size: 9,807 Bytes
a0cbc88
 
3d8fb16
 
 
 
 
 
 
aa76bb8
a0cbc88
 
3d8fb16
a0cbc88
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0cbc88
 
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
 
493a5af
3d8fb16
493a5af
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
493a5af
3d8fb16
 
 
 
 
 
 
493a5af
 
 
 
 
 
3d8fb16
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
493a5af
3d8fb16
493a5af
 
 
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
 
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
493a5af
3d8fb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
license: apache-2.0
language:
  - en
  - zh
  - ko
  - ja
  - multilingual
library_name: transformers
pipeline_tag: text-generation
tags:
  - darwin
  - darwin-reason
  - reasoning
  - advanced-reasoning
  - chain-of-thought
  - thinking
  - reasoning-trace-distillation
  - rtd
  - darwin-delphi
  - test-time-compute
  - qwen3.6
  - qwen
  - gpqa
  - benchmark
  - open-source
  - apache-2.0
  - proto-agi
  - vidraft
  - eval-results
base_model:
  - FINAL-Bench/Darwin-28B-Opus
base_model_relation: finetune
model-index:
  - name: Darwin-28B-REASON
    results:
      - task:
          type: text-generation
          name: Graduate-Level Reasoning
        dataset:
          type: Idavidrein/gpqa
          name: GPQA Diamond
          config: gpqa_diamond
          split: train
        metrics:
          - type: accuracy
            value: 89.39
            name: Accuracy (with Darwin-DELPHI)
            verified: false
---

# Darwin-28B-REASON β€” Reasoning-Trace Distilled, Darwin-DELPHI Enhanced

<p align="center">
  <a href="https://huggingface.co/FINAL-Bench/Darwin-28B-REASON"><img src="https://img.shields.io/badge/⭐_GPQA_Diamond-89.39%25_Darwin--28B--REASON-gold?style=for-the-badge" alt="GPQA"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-28B-Opus"><img src="https://img.shields.io/badge/🧬_Base-Darwin--28B--Opus_(88.89%25)-blue?style=for-the-badge" alt="Opus"></a>
</p>

<p align="center">
  <a href="https://huggingface.co/FINAL-Bench/Darwin-36B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--36B--Opus_(88.4%25)-blue?style=for-the-badge" alt="36B"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--27B--Opus_(86.9%25)-blue?style=for-the-badge" alt="27B"></a>
  <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-NEG"><img src="https://img.shields.io/badge/⚑_Model-Darwin--9B--NEG_(84.3%25)-purple?style=for-the-badge" alt="NEG"></a>
</p>

<p align="center">
  <a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/🏠_Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a>
  <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/πŸ†_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
</p>

> Full standalone reasoning model derived from Darwin-28B-Opus Β· Reasoning-Trace Distillation (RTD) Β· Darwin-DELPHI test-time engine Β· 27.6 B Β· BF16 Β· Apache 2.0
> **GPQA Diamond: 89.39 % with Darwin-DELPHI**

---

## Overview

**Darwin-28B-REASON** is a reasoning-enhanced **standalone model** derived from **[Darwin-28B-Opus](https://huggingface.co/FINAL-Bench/Darwin-28B-Opus)**. It combines two components:

1. **Reasoning-Trace Distillation (RTD)** β€” a reasoning-trace distillation stage applied on top of the Darwin-28B-Opus base, producing this fully self-contained model (full weights, no external adapter required).
2. **Darwin-DELPHI** β€” a proprietary test-time reasoning engine.

Together they push graduate-level scientific reasoning to the top tier of the Darwin family: **89.39 %** on GPQA Diamond with Darwin-DELPHI. The model is released under **Apache-2.0**.

---

## 🧬 Darwin Platform & Research

**Darwin** is VIDRAFT's measuring-result-driven Korean reasoning model family β€” approximately **20 official models** plus **400+ community derivatives**, ranking **#3 globally on GPQA** among open models. The base model, **Darwin-28B-Opus**, is the HuggingFace-official **GPQA #3 (88.89 %)** model.

- **Platform technique** β€” MRI trust-weighted Evolutionary Merge ([arXiv:2605.14386](https://arxiv.org/abs/2605.14386)).
- **FINAL Bench** β€” VIDRAFT's evaluation framework (SSRN): MetaCognition **+14.05**, MA-ER Gap **0.392**.
- **4-layer Pre-AGI roadmap** β€” Darwin β†’ AETHER β†’ PROMETHEUS β†’ HEPHAESTUS.

---

## 🧬 Model Lineage

| Role | Model | Contribution |
|:---:|:---|:---|
| **Base** | [`FINAL-Bench/Darwin-28B-Opus`](https://huggingface.co/FINAL-Bench/Darwin-28B-Opus) | GPQA #3 (88.89 %) Qwen3.6-generation reasoning backbone. |
| **RTD training** | reasoning-trace distillation | Distills complete reasoning chains into the model on top of the Opus base. |
| **Test-time engine** | Darwin-DELPHI | Proprietary inference-time consensus engine (not stored in weights). |
| **Result** | **`Darwin-28B-REASON`** (this model) | Full standalone RTD model + Darwin-DELPHI β†’ **89.39 %** GPQA Diamond. |

---

## βš™οΈ Technical Specifications

| Component | Value |
|:---|:---|
| Architecture | `Qwen3_5ForConditionalGeneration` (Qwen3.6 generation, hybrid linear + full attention; text path, `language_model_only`) |
| Parameters | **27.6 B** (BF16) β€” full standalone weights |
| Layers | 64 (3 linear : 1 full attention, `full_attention_interval = 4`) |
| Vocab size | 248 320 |
| Context length | 262 144 (long-chain reasoning supported) |
| Delivery | Full self-contained model β€” no external base or adapter required |
| Precision | bfloat16 |
| License | Apache 2.0 |

---

## πŸ”¬ Core Techniques

### β‘  RTD β€” Reasoning-Trace Distillation

RTD distills **complete reasoning chains** from a publicly available mathematical corpus (Apache-2.0 source) on top of the Darwin-28B-Opus base, producing this standalone model. It strengthens long-form, multi-step scientific reasoning while preserving the base model's bilingual capability.

> The full RTD recipe (curation, trace selection, training schedule) is **proprietary** and is not disclosed.

### β‘‘ Darwin-DELPHI β€” Test-Time Reasoning Engine

**Darwin-DELPHI** is a proprietary test-time engine applied at inference. It performs **multi-sample cross-validation**, **re-examination of uncertain responses**, and **iterative self-critique**, converging to a **consensus** answer through a single-agent Delphi-method procedure.

> Darwin-DELPHI is **not stored in the model weights**. Its internal parameters β€” sampling counts, stage transitions, and decision thresholds β€” are a **trade secret** and are not published.

---

## πŸ† Benchmark β€” GPQA Diamond (198 questions)

GPQA Diamond is a 198-question, PhD-level graduate science reasoning benchmark.

| Model | Engine | **Accuracy** |
|:---|:---|:---:|
| Darwin-28B-Opus (base) | Standard | 88.89 % (176 / 198) |
| **Darwin-28B-REASON** | **Darwin-DELPHI** | **πŸ₯‡ 89.39 % (177 / 198)** |

The evaluation methodology for the Darwin-DELPHI result is **protected**; sample counts, staging, and thresholds are a **trade secret**.

---

## πŸš€ Usage

Darwin-28B-REASON is a **full standalone model** β€” load it directly, no base model or adapter merge required.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL = "FINAL-Bench/Darwin-28B-REASON"

tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [
    {"role": "user",
     "content": "A particle moves along x(t) = tΒ³ βˆ’ 6tΒ² + 9t. Find when it is at rest and classify the motion."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tok.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
```

> The 89.39 % GPQA Diamond result is produced with the Darwin-DELPHI test-time engine applied on top of this model. Darwin-DELPHI is provided through the Darwin-series evaluation harness.

---

## 🎯 Recommended Use-Cases

- **Graduate-level STEM reasoning** (GPQA / science qualifying exams)
- **Mathematical problem solving** (MATH, AIME-style problems)
- **Complex multi-step chain-of-thought tasks**
- **Code generation and debugging**
- **Bilingual reasoning** (strong English + Korean; also Chinese / Japanese)

## ⚠️ Limitations

- The 27.6 B model in bfloat16 requires β‰ˆ 55 GB of VRAM (a single A100-80GB or B200 is sufficient).
- The 89.39 % result depends on the Darwin-DELPHI test-time engine; the model on its own delivers strong but lower single-model accuracy.
- Optimised for English first, with secondary support for Korean, Chinese, and Japanese.
- Reasoning traces tend to be verbose β€” control with `max_new_tokens` as needed.

---

## πŸ“š Citation

```bibtex
@misc{darwin28b_reason_2026,
  title  = {Darwin-28B-REASON: Reasoning-Trace Distillation and Darwin-DELPHI Test-Time Reasoning on Darwin-28B-Opus},
  author = {FINAL-Bench / Darwin Research Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-28B-REASON}},
  note   = {RTD + Darwin-DELPHI Β· 89.39 % GPQA Diamond}
}

@misc{darwin_family_2026,
  title  = {Darwin Family: MRI Trust-Weighted Evolutionary Merging for Reasoning Models},
  author = {VIDRAFT / FINAL-Bench},
  year   = {2026},
  howpublished = {\url{https://arxiv.org/abs/2605.14386}}
}

@misc{final_bench_2026,
  title  = {FINAL Bench: A Measuring-Result-Driven Evaluation Framework for Reasoning Models},
  author = {VIDRAFT / FINAL-Bench},
  year   = {2026},
  howpublished = {SSRN}
}
```

---

## πŸ”— Related Darwin Models

- **Darwin-28B-Opus** β€” base model, Qwen3.6-27B Γ— Opus distilled, GPQA 88.89 %
- **Darwin-36B-Opus** β€” MoE 36B, GPQA 88.4 %
- **Darwin-27B-Opus** β€” 27B dense (Qwen3.5 generation), GPQA 86.9 %
- **Darwin-9B-NEG** β€” 9B with Negentropy distillation, GPQA 84.3 %
- **Darwin-4B-Genesis** β€” smallest Darwin member

---

This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).

*Darwin-28B-REASON Β· RTD + Darwin-DELPHI Β· 89.39 % GPQA Diamond Β· FINAL-Bench*