SeaWolf-AI commited on
Commit
f0bccf5
·
verified ·
1 Parent(s): ae4ecca

Modernize card to Darwin family standard: canonical GPQA 85.9 (Darwin-DELPHI), add model-index, remove stale 66pct/50Q contradiction, trade-secret-safe; merge/MDS/genome preserved

Browse files
Files changed (1) hide show
  1. README.md +284 -256
README.md CHANGED
@@ -1,257 +1,285 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - google/gemma-4-31B-it
5
- - TeichAI/gemma-4-31B-it-Claude-Opus-Distill
6
- tags:
7
- - darwin-v6
8
- - evolutionary-merge
9
- - mri-guided
10
- - dare-ties
11
- - gemma4
12
- - reasoning
13
- - thinking
14
- - proto-agi
15
- - vidraft
16
- language:
17
- - en
18
- - ko
19
- - ja
20
- - zh
21
- - multilingual
22
- pipeline_tag: text-generation
23
- library_name: transformers
24
- ---
25
-
26
- # Darwin-31B-Opus
27
-
28
- <p align="center">
29
- <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Opus"><img src="https://img.shields.io/badge/🧬_Gen1-Darwin--4B--Opus-blue?style=for-the-badge" alt="Gen1"></a>
30
- <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-David"><img src="https://img.shields.io/badge/🧬_Gen2-Darwin--4B--David-blue?style=for-the-badge" alt="Gen2"></a>
31
- <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/⭐_Gen3-Darwin--4B--Genesis-gold?style=for-the-badge" alt="Gen3"></a>
32
- </p>
33
-
34
- <p align="center">
35
- <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a>
36
- <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🚀_Space-9B_Demo-purple?style=for-the-badge" alt="9B Space"></a>
37
- <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a>
38
- <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🚀_Space-31B_Demo-purple?style=for-the-badge" alt="31B Space"></a>
39
- </p>
40
-
41
- <p align="center">
42
- <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B"></a>
43
- <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🚀_Space-35B_Demo-purple?style=for-the-badge" alt="35B Space"></a>
44
- <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-Q8--Official-yellow?style=for-the-badge" alt="Q8 GGUF"></a>
45
- <a href="https://huggingface.co/bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-bartowski-yellow?style=for-the-badge" alt="bartowski GGUF"></a>
46
- </p>
47
-
48
- <p align="center">
49
- <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/🏆_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
50
- <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/📊_ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
51
- </p>
52
-
53
- > Gemma 4 Dense 31B | Thinking Mode | 256K Context | 140+ Languages | BF16 | Apache 2.0
54
-
55
- ---
56
-
57
- ## Overview
58
-
59
- Darwin-31B-Opus is a reasoning-enhanced model created by merging google/gemma-4-31B-it (Father) and TeichAI/gemma-4-31B-it-Claude-Opus-Distill (Mother) using the Darwin V6 engine.
60
-
61
- Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each of the 1,188 tensors. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.
62
-
63
- ---
64
-
65
- ## Parent Models
66
-
67
- | Role | Model | Characteristics |
68
- |---|---|---|
69
- | Father | google/gemma-4-31B-it | Gemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3) |
70
- | Mother | TeichAI/gemma-4-31B-it-Claude-Opus-Distill | Claude 4.6 Opus high-effort reasoning distillation, code/science/analysis |
71
-
72
- ### Model Diagnostic Scan (MDS)
73
-
74
- <p align="center">
75
- <img src="s1.png" alt="Father (gemma-4-31B-it) MDS Scan" width="48%">
76
- <img src="s2.png" alt="Mother (Claude-Opus-Distill) MDS Scan" width="48%">
77
- </p>
78
-
79
- Left: Father (gemma-4-31B-it) balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) strong REASONING concentration in L50-L60, CODE activation in late layers, KOREAN at start and end. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.
80
-
81
- ---
82
-
83
- ## Benchmarks
84
-
85
- | Benchmark | Darwin-31B-Opus | Father (gemma-4-31B-it) | Condition |
86
- |---|---|---|---|
87
- | ARC-Challenge | 82.89% | - | loglikelihood, zero-shot, 200Q |
88
- | GPQA Diamond | 66.0% | 60.0% | generative thinking mode, greedy, 50Q |
89
-
90
- GPQA Diamond was evaluated under identical conditions for both models: same 50 questions, same seed (i+42), same prompt template, greedy decoding (do_sample=False), max_new_tokens=2048, enable_thinking=True. Darwin-31B-Opus achieved a 10% relative improvement over the Father model.
91
-
92
- Note: Gemma 4 architecture (Gemma4ForConditionalGeneration) has limited compatibility with lm-eval's loglikelihood method due to its multimodal wrapper structure. Only generative evaluation produces valid results for Gemma 4 based models. Full 198-question evaluation with Majority Voting is planned.
93
-
94
- ---
95
-
96
- ## Darwin V6 vs Conventional Merging
97
-
98
- | Capability | mergekit (DARE-TIES) | Darwin V6 |
99
- |---|---|---|
100
- | Implementation | Library call (mergekit CLI) | Direct PyTorch tensor operations, no external dependency |
101
- | Ratio selection | Uniform ratio across all tensors | Per-tensor ratio from MDS diagnostic (1,188 independent ratios) |
102
- | Pre-merge analysis | None | Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) |
103
- | Ratio formula | Human-set or grid search | combined = static × 0.4 + probe × 0.6, then evolutionary optimization |
104
- | Transplant | Not supported | ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise) |
105
- | Post-merge validation | Benchmark score only | Layer-by-layer Health Check: child vs both parents, interference and function loss detection |
106
- | Search method | Manual tuning | CMA-ES evolution with adaptive 14-dimensional genome |
107
- | Reproducibility | Config file | genome_hash seed guarantees identical output for identical genome |
108
- | GPU efficiency | Single merge per run | Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated) |
109
-
110
- ---
111
-
112
- ## How Darwin V6 Works
113
-
114
- Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.
115
-
116
- Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.
117
-
118
- The final merge ratio for each tensor:
119
-
120
- ```
121
- static_score = entropy × 0.3 + std × 0.2 + clamp(norm, 100) × 0.002
122
- probe_score = Σ(cosine_distance[probe_i] × weight_i)
123
- combined = static × 0.4 + probe × 0.6
124
- mri_ratio = combined_b / (combined_a + combined_b)
125
- final_ratio = mri_ratio × mri_trust + genome_ratio × (1 - mri_trust)
126
- ```
127
-
128
- The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.
129
-
130
- After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference (child importance >> parent max) or function loss (parent importance high but child dropped).
131
-
132
- ### Parent Comparison (MDS Result)
133
-
134
- <p align="center">
135
- <img src="parent_comparison.png" alt="Parent Comparison Layer-wise Importance" width="100%">
136
- </p>
137
-
138
- ---
139
-
140
- ## Evolution Result
141
-
142
- | | |
143
- |---|---|
144
- | Best Score (ARC-Challenge) | 0.8289 |
145
- | Merge Method | DARE-TIES (direct PyTorch) |
146
- | Tensors Merged | 1,188 |
147
- | Health Check | healthy |
148
- | Phase 2 Steps | 4 (early stop, patience=5) |
149
- | Total Time | 134 min |
150
- | Infrastructure | 4 x NVIDIA H100 NVL (100GB) |
151
-
152
- Optimal Genome (14-dimensional adaptive):
153
-
154
- ```
155
- global_ratio: 0.5147 (overall merge ratio)
156
- attn_ratio: 0.3169 (Attention layers Father dominant)
157
- ffn_ratio: 0.9316 (FFN layers — Mother dominant)
158
- embed_ratio: 0.7748 (Embedding)
159
- density_a: 0.8997 (Father DARE density)
160
- density_b: 0.9539 (Mother DARE density)
161
- block_0_ratio: 0.6628 (L0-L9)
162
- block_1_ratio: 0.6431 (L10-L19)
163
- block_2_ratio: 0.5146 (L20-L29, balanced)
164
- block_3_ratio: 0.5971 (L30-L39)
165
- block_4_ratio: 0.6339 (L40-L49)
166
- block_5_ratio: 0.8583 (L50-L59, reasoning core — Mother dominant)
167
- mri_trust: 0.3631 (MDS 36% + Genome 64%)
168
- merge_method_weight: 0.6897
169
- ```
170
-
171
- Key observations from the genome: ffn_ratio=0.93 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59)=0.86 shows the reasoning core layers also favor Mother. This aligns with the MDS heatmap pattern where Mother's reasoning capability concentrated in the final layers. Meanwhile, attn_ratio=0.32 preserves Father's attention structure, maintaining the original Gemma 4 multimodal and long-context capabilities.
172
-
173
- ---
174
-
175
- ## Model Specifications
176
-
177
- | | |
178
- |---|---|
179
- | Architecture | Gemma 4 Dense (Hybrid Attention: Sliding Window + Global) |
180
- | Parameters | 31B |
181
- | Precision | BF16 |
182
- | Context | 256,072 |
183
- | Languages | 140+ |
184
- | Thinking | enable_thinking=True chain-of-thought |
185
- | License | Apache 2.0 |
186
-
187
- ---
188
-
189
- ## Usage
190
-
191
- ### Transformers
192
-
193
- ```python
194
- from transformers import AutoTokenizer, AutoModelForCausalLM
195
- import torch
196
-
197
- tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-31B-Opus", trust_remote_code=True)
198
- model = AutoModelForCausalLM.from_pretrained(
199
- "FINAL-Bench/Darwin-31B-Opus",
200
- torch_dtype=torch.bfloat16,
201
- device_map="auto",
202
- trust_remote_code=True,
203
- )
204
-
205
- messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
206
- text = tokenizer.apply_chat_template(
207
- messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
208
- )
209
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
210
- outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
211
- print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
212
- ```
213
-
214
- ---
215
-
216
- ## VRAM Requirements
217
-
218
- | Setup | VRAM | Status |
219
- |---|---|---|
220
- | BF16 Full Precision | ~62 GB | |
221
- | NVIDIA H100 80GB | 80 GB | Single GPU |
222
- | NVIDIA A100 80GB x 2 | 160 GB | Comfortable |
223
- | NVIDIA RTX 4090 24GB x 4 | 96 GB | device_map=auto |
224
-
225
- ---
226
-
227
- ## References
228
-
229
- - DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
230
- - Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
231
- - FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
232
-
233
- ---
234
-
235
- ## Built By
236
-
237
- | | |
238
- |---|---|
239
- | Developer | VIDRAFT |
240
- | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) |
241
- | Architecture | Gemma-4-31B |
242
- | License | Apache 2.0 |
243
-
244
- ---
245
-
246
- ## Citation
247
-
248
- ```bibtex
249
- @misc{vidraft_darwin_31b_opus,
250
- title = {Darwin-31B-Opus: Diagnostic-Guided Evolutionary Merge on Gemma 4},
251
- author = {VIDRAFT},
252
- year = {2026},
253
- publisher = {Hugging Face},
254
- howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-31B-Opus}}
255
- }
256
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
257
  This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - google/gemma-4-31B-it
5
+ - TeichAI/gemma-4-31B-it-Claude-Opus-Distill
6
+ tags:
7
+ - darwin
8
+ - darwin-v6
9
+ - evolutionary-merge
10
+ - mri-guided
11
+ - dare-ties
12
+ - gemma4
13
+ - reasoning
14
+ - thinking
15
+ - darwin-delphi
16
+ - gpqa
17
+ - benchmark
18
+ - eval-results
19
+ - apache-2.0
20
+ - proto-agi
21
+ - vidraft
22
+ language:
23
+ - en
24
+ - ko
25
+ - ja
26
+ - zh
27
+ - multilingual
28
+ pipeline_tag: text-generation
29
+ library_name: transformers
30
+ model-index:
31
+ - name: Darwin-31B-Opus
32
+ results:
33
+ - task:
34
+ type: text-generation
35
+ name: Graduate-Level Reasoning
36
+ dataset:
37
+ type: Idavidrein/gpqa
38
+ name: GPQA Diamond
39
+ config: gpqa_diamond
40
+ split: train
41
+ metrics:
42
+ - type: accuracy
43
+ value: 85.9
44
+ name: Accuracy (with Darwin-DELPHI)
45
+ verified: false
46
+ ---
47
+
48
+ # Darwin-31B-Opus
49
+
50
+ <p align="center">
51
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/⭐_GPQA_Diamond-85.9%25_with_Darwin--DELPHI-gold?style=for-the-badge" alt="GPQA"></a>
52
+ </p>
53
+
54
+ <p align="center">
55
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Opus"><img src="https://img.shields.io/badge/🧬_Gen1-Darwin--4B--Opus-blue?style=for-the-badge" alt="Gen1"></a>
56
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-David"><img src="https://img.shields.io/badge/🧬_Gen2-Darwin--4B--David-blue?style=for-the-badge" alt="Gen2"></a>
57
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/⭐_Gen3-Darwin--4B--Genesis-gold?style=for-the-badge" alt="Gen3"></a>
58
+ </p>
59
+
60
+ <p align="center">
61
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a>
62
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🚀_Space-9B_Demo-purple?style=for-the-badge" alt="9B Space"></a>
63
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a>
64
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🚀_Space-31B_Demo-purple?style=for-the-badge" alt="31B Space"></a>
65
+ </p>
66
+
67
+ <p align="center">
68
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B"></a>
69
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🚀_Space-35B_Demo-purple?style=for-the-badge" alt="35B Space"></a>
70
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-Q8--Official-yellow?style=for-the-badge" alt="Q8 GGUF"></a>
71
+ <a href="https://huggingface.co/bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-bartowski-yellow?style=for-the-badge" alt="bartowski GGUF"></a>
72
+ </p>
73
+
74
+ <p align="center">
75
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/🏆_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
76
+ <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/📊_ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
77
+ </p>
78
+
79
+ > Gemma 4 Dense 31B | Thinking Mode | 256K Context | 140+ Languages | BF16 | Apache 2.0
80
+
81
+ ---
82
+
83
+ ## Overview
84
+
85
+ Darwin-31B-Opus is a reasoning-enhanced model created by merging google/gemma-4-31B-it (Father) and TeichAI/gemma-4-31B-it-Claude-Opus-Distill (Mother) using the Darwin V6 engine.
86
+
87
+ Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each of the 1,188 tensors. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.
88
+
89
+ ---
90
+
91
+ ## Parent Models
92
+
93
+ | Role | Model | Characteristics |
94
+ |---|---|---|
95
+ | Father | google/gemma-4-31B-it | Gemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3) |
96
+ | Mother | TeichAI/gemma-4-31B-it-Claude-Opus-Distill | Claude 4.6 Opus high-effort reasoning distillation, code/science/analysis |
97
+
98
+ ### Model Diagnostic Scan (MDS)
99
+
100
+ <p align="center">
101
+ <img src="s1.png" alt="Father (gemma-4-31B-it) MDS Scan" width="48%">
102
+ <img src="s2.png" alt="Mother (Claude-Opus-Distill) MDS Scan" width="48%">
103
+ </p>
104
+
105
+ Left: Father (gemma-4-31B-it) balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) strong REASONING concentration in L50-L60, CODE activation in late layers, KOREAN at start and end. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.
106
+
107
+ ---
108
+
109
+ ## 🏆 Benchmark — GPQA Diamond (198 questions)
110
+
111
+ GPQA Diamond is a 198-question, PhD-level graduate science reasoning benchmark.
112
+
113
+ | Benchmark | Darwin-31B-Opus | Engine |
114
+ |---|---|---|
115
+ | **GPQA Diamond** | **🥇 85.9%** | Darwin-DELPHI test-time engine |
116
+ | ARC-Challenge | 82.89% | evolutionary-selection metric (loglikelihood, 0-shot, 200Q) |
117
+
118
+ The 85.9 % GPQA Diamond result is produced with the **Darwin-DELPHI** test-time reasoning engine applied on top of this model. The evaluation methodology is **protected**; sample counts, staging, and thresholds are a **trade secret**. ARC-Challenge 82.89 % is the internal evolutionary-selection score used during the Darwin V6 merge search.
119
+
120
+ > Note: the Gemma 4 architecture (`Gemma4ForConditionalGeneration`) has a multimodal wrapper that limits `lm-eval` loglikelihood compatibility; generative evaluation is the valid path for Gemma 4 based models, and Darwin-DELPHI evaluates generatively accordingly.
121
+
122
+ ---
123
+
124
+ ## Darwin V6 vs Conventional Merging
125
+
126
+ | Capability | mergekit (DARE-TIES) | Darwin V6 |
127
+ |---|---|---|
128
+ | Implementation | Library call (mergekit CLI) | Direct PyTorch tensor operations, no external dependency |
129
+ | Ratio selection | Uniform ratio across all tensors | Per-tensor ratio from MDS diagnostic (1,188 independent ratios) |
130
+ | Pre-merge analysis | None | Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) |
131
+ | Ratio formula | Human-set or grid search | combined = static × 0.4 + probe × 0.6, then evolutionary optimization |
132
+ | Transplant | Not supported | ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise) |
133
+ | Post-merge validation | Benchmark score only | Layer-by-layer Health Check: child vs both parents, interference and function loss detection |
134
+ | Search method | Manual tuning | CMA-ES evolution with adaptive 14-dimensional genome |
135
+ | Reproducibility | Config file | genome_hash seed guarantees identical output for identical genome |
136
+ | GPU efficiency | Single merge per run | Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated) |
137
+
138
+ ---
139
+
140
+ ## How Darwin V6 Works
141
+
142
+ Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.
143
+
144
+ Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.
145
+
146
+ The final merge ratio for each tensor:
147
+
148
+ ```
149
+ static_score = entropy × 0.3 + std × 0.2 + clamp(norm, 100) × 0.002
150
+ probe_score = Σ(cosine_distance[probe_i] × weight_i)
151
+ combined = static × 0.4 + probe × 0.6
152
+ mri_ratio = combined_b / (combined_a + combined_b)
153
+ final_ratio = mri_ratio × mri_trust + genome_ratio × (1 - mri_trust)
154
+ ```
155
+
156
+ The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.
157
+
158
+ After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference (child importance >> parent max) or function loss (parent importance high but child dropped).
159
+
160
+ ### Parent Comparison (MDS Result)
161
+
162
+ <p align="center">
163
+ <img src="parent_comparison.png" alt="Parent Comparison — Layer-wise Importance" width="100%">
164
+ </p>
165
+
166
+ ---
167
+
168
+ ## Evolution Result
169
+
170
+ | | |
171
+ |---|---|
172
+ | Best Score (ARC-Challenge) | 0.8289 |
173
+ | Merge Method | DARE-TIES (direct PyTorch) |
174
+ | Tensors Merged | 1,188 |
175
+ | Health Check | healthy |
176
+ | Phase 2 Steps | 4 (early stop, patience=5) |
177
+ | Total Time | 134 min |
178
+ | Infrastructure | 4 x NVIDIA H100 NVL (100GB) |
179
+
180
+ Optimal Genome (14-dimensional adaptive):
181
+
182
+ ```
183
+ global_ratio: 0.5147 (overall merge ratio)
184
+ attn_ratio: 0.3169 (Attention layers Father dominant)
185
+ ffn_ratio: 0.9316 (FFN layers Mother dominant)
186
+ embed_ratio: 0.7748 (Embedding)
187
+ density_a: 0.8997 (Father DARE density)
188
+ density_b: 0.9539 (Mother DARE density)
189
+ block_0_ratio: 0.6628 (L0-L9)
190
+ block_1_ratio: 0.6431 (L10-L19)
191
+ block_2_ratio: 0.5146 (L20-L29, balanced)
192
+ block_3_ratio: 0.5971 (L30-L39)
193
+ block_4_ratio: 0.6339 (L40-L49)
194
+ block_5_ratio: 0.8583 (L50-L59, reasoning core Mother dominant)
195
+ mri_trust: 0.3631 (MDS 36% + Genome 64%)
196
+ merge_method_weight: 0.6897
197
+ ```
198
+
199
+ Key observations from the genome: ffn_ratio=0.93 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59)=0.86 shows the reasoning core layers also favor Mother. This aligns with the MDS heatmap pattern where Mother's reasoning capability concentrated in the final layers. Meanwhile, attn_ratio=0.32 preserves Father's attention structure, maintaining the original Gemma 4 multimodal and long-context capabilities.
200
+
201
+ ---
202
+
203
+ ## Model Specifications
204
+
205
+ | | |
206
+ |---|---|
207
+ | Architecture | Gemma 4 Dense (Hybrid Attention: Sliding Window + Global) |
208
+ | Parameters | 31B |
209
+ | Precision | BF16 |
210
+ | Context | 256,072 |
211
+ | Languages | 140+ |
212
+ | Thinking | enable_thinking=True chain-of-thought |
213
+ | License | Apache 2.0 |
214
+
215
+ ---
216
+
217
+ ## Usage
218
+
219
+ ### Transformers
220
+
221
+ ```python
222
+ from transformers import AutoTokenizer, AutoModelForCausalLM
223
+ import torch
224
+
225
+ tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-31B-Opus", trust_remote_code=True)
226
+ model = AutoModelForCausalLM.from_pretrained(
227
+ "FINAL-Bench/Darwin-31B-Opus",
228
+ torch_dtype=torch.bfloat16,
229
+ device_map="auto",
230
+ trust_remote_code=True,
231
+ )
232
+
233
+ messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
234
+ text = tokenizer.apply_chat_template(
235
+ messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
236
+ )
237
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
238
+ outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
239
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
240
+ ```
241
+
242
+ ---
243
+
244
+ ## VRAM Requirements
245
+
246
+ | Setup | VRAM | Status |
247
+ |---|---|---|
248
+ | BF16 Full Precision | ~62 GB | |
249
+ | NVIDIA H100 80GB | 80 GB | Single GPU |
250
+ | NVIDIA A100 80GB x 2 | 160 GB | Comfortable |
251
+ | NVIDIA RTX 4090 24GB x 4 | 96 GB | device_map=auto |
252
+
253
+ ---
254
+
255
+ ## References
256
+
257
+ - DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
258
+ - Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
259
+ - FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
260
+
261
+ ---
262
+
263
+ ## Built By
264
+
265
+ | | |
266
+ |---|---|
267
+ | Developer | VIDRAFT |
268
+ | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) |
269
+ | Architecture | Gemma-4-31B |
270
+ | License | Apache 2.0 |
271
+
272
+ ---
273
+
274
+ ## Citation
275
+
276
+ ```bibtex
277
+ @misc{vidraft_darwin_31b_opus,
278
+ title = {Darwin-31B-Opus: Diagnostic-Guided Evolutionary Merge on Gemma 4},
279
+ author = {VIDRAFT},
280
+ year = {2026},
281
+ publisher = {Hugging Face},
282
+ howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-31B-Opus}}
283
+ }
284
+ ```
285
  This model is introduced in [Darwin Family](https://arxiv.org/abs/2605.14386).