SeaWolf-AI commited on
Commit
ef598d3
Β·
verified Β·
1 Parent(s): dbdd6bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -52
README.md CHANGED
@@ -26,10 +26,19 @@ library_name: transformers
26
  # Darwin-31B-Opus
27
 
28
  <p align="center">
29
- <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="Model"></a>
30
- <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B Model"></a>
31
- <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
32
- <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
 
 
 
 
 
 
 
 
 
33
  </p>
34
 
35
  > Gemma 4 Dense 31B | Thinking Mode | 256K Context | 140+ Languages | BF16 | Apache 2.0
@@ -38,9 +47,9 @@ library_name: transformers
38
 
39
  ## Overview
40
 
41
- Darwin-31B-Opus is a reasoning-enhanced model created by the Darwin V6 engine, using Google's Gemma-4-31B-it as Father and TeichAI's Claude Opus Distill as Mother.
42
 
43
- Darwin V6 diagnoses both parent models at the tensor level and computes an independent optimal merge ratio for each tensor. Unlike conventional merging methods that apply a uniform ratio across all tensors, Darwin V6 assigns a unique ratio to each of the 1,188 tensors, determined by the combination of MRI diagnostic results and evolutionary algorithm optimization.
44
 
45
  ---
46
 
@@ -49,53 +58,73 @@ Darwin V6 diagnoses both parent models at the tensor level and computes an indep
49
  | Role | Model | Characteristics |
50
  |---|---|---|
51
  | Father | google/gemma-4-31B-it | Gemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3) |
52
- | Mother | TeichAI/gemma-4-31B-it-Claude-Opus-Distill | Claude 4.6 Opus high-effort reasoning distillation, coding/science/analysis |
 
 
 
 
 
 
 
 
 
53
 
54
  ---
55
 
56
- ## Benchmark
57
 
58
  | Benchmark | Darwin-31B-Opus | Father (gemma-4-31B-it) | Condition |
59
  |---|---|---|---|
60
- | ARC-Challenge | 82.89% | - | loglikelihood, zero-shot, 200 questions |
 
61
 
62
- Note: Gemma 4 architecture (Gemma4ForConditionalGeneration) is a multimodal wrapper structure with limited compatibility with lm-eval's loglikelihood method. In generative evaluation (greedy, thinking mode), Darwin showed improvement over Father under identical conditions. Full GPQA Diamond 198-question evaluation with Majority Voting is scheduled.
 
 
63
 
64
  ---
65
 
66
- ## Model Specifications
67
 
68
- | | |
69
- |---|---|
70
- | Architecture | Gemma 4 Dense (Hybrid Attention: Sliding Window + Global) |
71
- | Total Parameters | 31B |
72
- | Precision | BF16 |
73
- | Context Length | 256,072 |
74
- | Languages | 140+ |
75
- | Thinking | enable_thinking=True chain-of-thought reasoning |
76
- | License | Apache 2.0 |
 
 
77
 
78
  ---
79
 
80
- ## How Darwin V6 Merges
81
 
82
- Darwin V6 does not use any external merge library such as mergekit. It re-implements the DARE-TIES algorithm (Yadav et al., 2023) directly via PyTorch tensor operations, with per-tensor diagnostic ratios as the key differentiator.
83
 
84
- Before merging, Darwin performs an MRI diagnostic on both parent models. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 probing prompts (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model to measure each layer's functional importance via cosine distance when that layer is skipped.
85
 
86
- The final merge ratio for each tensor is determined by:
87
 
88
  ```
89
- static_score = entropy * 0.3 + std * 0.2 + clamp(norm, 100) * 0.002
90
- probe_score = sum(cosine_distance[probe_i] * weight_i)
91
- combined = static * 0.4 + probe * 0.6
92
  mri_ratio = combined_b / (combined_a + combined_b)
93
- final_ratio = mri_ratio * mri_trust + genome_ratio * (1 - mri_trust)
94
  ```
95
 
96
- mri_trust itself is optimized by the CMA-ES evolutionary algorithm. When the ratio is extreme (< 0.15 or > 0.85), the tensor is transplanted entirely from one parent without interpolation, preventing noise injection.
 
 
97
 
98
- After merging, a Health Check compares the child model against both parents layer by layer, automatically detecting interference or function loss.
 
 
 
 
99
 
100
  ---
101
 
@@ -103,34 +132,48 @@ After merging, a Health Check compares the child model against both parents laye
103
 
104
  | | |
105
  |---|---|
106
- | ARC-Challenge Best Score | 0.8289 |
107
- | Merge Method | DARE-TIES (direct PyTorch implementation) |
108
  | Tensors Merged | 1,188 |
109
  | Health Check | healthy |
110
  | Phase 2 Steps | 4 (early stop, patience=5) |
111
  | Total Time | 134 min |
112
  | Infrastructure | 4 x NVIDIA H100 NVL (100GB) |
113
 
114
- Optimal genome (14-dimensional adaptive):
115
 
116
  ```
117
- global_ratio: 0.5147 (overall merge ratio)
118
- attn_ratio: 0.3169 (Attention layers)
119
- ffn_ratio: 0.9316 (FFN layers β€” Mother dominant)
120
- embed_ratio: 0.7748 (Embedding)
121
- density_a: 0.8997 (Father DARE density)
122
- density_b: 0.9539 (Mother DARE density)
123
- block_0_ratio: 0.6628 (L0-L9)
124
- block_1_ratio: 0.6431 (L10-L19)
125
- block_2_ratio: 0.5146 (L20-L29)
126
- block_3_ratio: 0.5971 (L30-L39)
127
- block_4_ratio: 0.6339 (L40-L49)
128
- block_5_ratio: 0.8583 (L50-L59 β€” reasoning core, Mother dominant)
129
- mri_trust: 0.3631 (MRI 36% + Genome 64%)
130
  merge_method_weight: 0.6897
131
  ```
132
 
133
- Notable: ffn_ratio=0.93 indicates FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59) at 0.86 also favors the Mother. This is consistent with the MRI heatmap pattern showing that the Mother's reasoning capabilities are concentrated in the later layers.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ---
136
 
@@ -168,14 +211,14 @@ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_
168
  | BF16 Full Precision | ~62 GB | |
169
  | NVIDIA H100 80GB | 80 GB | Single GPU |
170
  | NVIDIA A100 80GB x 2 | 160 GB | Comfortable |
171
- | NVIDIA RTX 4090 24GB x 4 | 96 GB | Possible (device_map=auto) |
172
 
173
  ---
174
 
175
  ## References
176
 
177
- - DARE-TIES algorithm: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) β€” re-implemented, not library-dependent
178
- - Darwin V6 engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
179
  - FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
180
 
181
  ---
@@ -185,8 +228,8 @@ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_
185
  | | |
186
  |---|---|
187
  | Developer | VIDRAFT |
188
- | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Model Merge) |
189
- | Base Architecture | Gemma-4-31B |
190
  | License | Apache 2.0 |
191
 
192
  ---
 
26
  # Darwin-31B-Opus
27
 
28
  <p align="center">
29
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B Model"></a>
30
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B Model"></a>
31
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B Model"></a>
32
+ <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF"><img src="https://img.shields.io/badge/πŸ“¦_GGUF-Q8--Official-yellow?style=for-the-badge" alt="Q8 GGUF"></a>
33
+ <a href="https://huggingface.co/bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF"><img src="https://img.shields.io/badge/πŸ“¦_GGUF-bartowski-yellow?style=for-the-badge" alt="bartowski GGUF"></a>
34
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/πŸš€_Space-9B_Demo-purple?style=for-the-badge" alt="9B Space"></a>
35
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/πŸš€_Space-35B_Demo-purple?style=for-the-badge" alt="35B Space"></a>
36
+ <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/πŸ†_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
37
+ <a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/πŸ“Š_ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
38
+ </p>
39
+
40
+ <p align="center">
41
+ <img src="info.png" alt="Darwin-31B-Opus" width="100%">
42
  </p>
43
 
44
  > Gemma 4 Dense 31B | Thinking Mode | 256K Context | 140+ Languages | BF16 | Apache 2.0
 
47
 
48
  ## Overview
49
 
50
+ Darwin-31B-Opus is a reasoning-enhanced model created by merging google/gemma-4-31B-it (Father) and TeichAI/gemma-4-31B-it-Claude-Opus-Distill (Mother) using the Darwin V6 engine.
51
 
52
+ Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each of the 1,188 tensors. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.
53
 
54
  ---
55
 
 
58
  | Role | Model | Characteristics |
59
  |---|---|---|
60
  | Father | google/gemma-4-31B-it | Gemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3) |
61
+ | Mother | TeichAI/gemma-4-31B-it-Claude-Opus-Distill | Claude 4.6 Opus high-effort reasoning distillation, code/science/analysis |
62
+
63
+ ### Model Diagnostic Scan (MDS)
64
+
65
+ <p align="center">
66
+ <img src="s1.png" alt="Father (gemma-4-31B-it) MDS Scan" width="48%">
67
+ <img src="s2.png" alt="Mother (Claude-Opus-Distill) MDS Scan" width="48%">
68
+ </p>
69
+
70
+ Left: Father (gemma-4-31B-it) β€” balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) β€” strong REASONING concentration in L50-L60, CODE activation in late layers, KOREAN at start and end. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.
71
 
72
  ---
73
 
74
+ ## Benchmarks
75
 
76
  | Benchmark | Darwin-31B-Opus | Father (gemma-4-31B-it) | Condition |
77
  |---|---|---|---|
78
+ | ARC-Challenge | 82.89% | - | loglikelihood, zero-shot, 200Q |
79
+ | GPQA Diamond | 66.0% | 60.0% | generative thinking mode, greedy, 50Q |
80
 
81
+ GPQA Diamond was evaluated under identical conditions for both models: same 50 questions, same seed (i+42), same prompt template, greedy decoding (do_sample=False), max_new_tokens=2048, enable_thinking=True. Darwin-31B-Opus achieved a 10% relative improvement over the Father model.
82
+
83
+ Note: Gemma 4 architecture (Gemma4ForConditionalGeneration) has limited compatibility with lm-eval's loglikelihood method due to its multimodal wrapper structure. Only generative evaluation produces valid results for Gemma 4 based models. Full 198-question evaluation with Majority Voting is planned.
84
 
85
  ---
86
 
87
+ ## Darwin V6 vs Conventional Merging
88
 
89
+ | Capability | mergekit (DARE-TIES) | Darwin V6 |
90
+ |---|---|---|
91
+ | Implementation | Library call (mergekit CLI) | Direct PyTorch tensor operations, no external dependency |
92
+ | Ratio selection | Uniform ratio across all tensors | Per-tensor ratio from MDS diagnostic (1,188 independent ratios) |
93
+ | Pre-merge analysis | None | Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) |
94
+ | Ratio formula | Human-set or grid search | combined = static Γ— 0.4 + probe Γ— 0.6, then evolutionary optimization |
95
+ | Transplant | Not supported | ratio < 0.15 β†’ Father 100%, ratio > 0.85 β†’ Mother 100% (zero interpolation noise) |
96
+ | Post-merge validation | Benchmark score only | Layer-by-layer Health Check: child vs both parents, interference and function loss detection |
97
+ | Search method | Manual tuning | CMA-ES evolution with adaptive 14-dimensional genome |
98
+ | Reproducibility | Config file | genome_hash seed guarantees identical output for identical genome |
99
+ | GPU efficiency | Single merge per run | Phase 1 proxy (200 steps, seconds) β†’ Phase 2 real merge (top-k only evaluated) |
100
 
101
  ---
102
 
103
+ ## How Darwin V6 Works
104
 
105
+ Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.
106
 
107
+ Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.
108
 
109
+ The final merge ratio for each tensor:
110
 
111
  ```
112
+ static_score = entropy Γ— 0.3 + std Γ— 0.2 + clamp(norm, 100) Γ— 0.002
113
+ probe_score = Ξ£(cosine_distance[probe_i] Γ— weight_i)
114
+ combined = static Γ— 0.4 + probe Γ— 0.6
115
  mri_ratio = combined_b / (combined_a + combined_b)
116
+ final_ratio = mri_ratio Γ— mri_trust + genome_ratio Γ— (1 - mri_trust)
117
  ```
118
 
119
+ The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.
120
+
121
+ After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference (child importance >> parent max) or function loss (parent importance high but child dropped).
122
 
123
+ ### Parent Comparison (MDS Result)
124
+
125
+ <p align="center">
126
+ <img src="parent_comparison.png" alt="Parent Comparison β€” Layer-wise Importance" width="100%">
127
+ </p>
128
 
129
  ---
130
 
 
132
 
133
  | | |
134
  |---|---|
135
+ | Best Score (ARC-Challenge) | 0.8289 |
136
+ | Merge Method | DARE-TIES (direct PyTorch) |
137
  | Tensors Merged | 1,188 |
138
  | Health Check | healthy |
139
  | Phase 2 Steps | 4 (early stop, patience=5) |
140
  | Total Time | 134 min |
141
  | Infrastructure | 4 x NVIDIA H100 NVL (100GB) |
142
 
143
+ Optimal Genome (14-dimensional adaptive):
144
 
145
  ```
146
+ global_ratio: 0.5147 (overall merge ratio)
147
+ attn_ratio: 0.3169 (Attention layers β€” Father dominant)
148
+ ffn_ratio: 0.9316 (FFN layers β€” Mother dominant)
149
+ embed_ratio: 0.7748 (Embedding)
150
+ density_a: 0.8997 (Father DARE density)
151
+ density_b: 0.9539 (Mother DARE density)
152
+ block_0_ratio: 0.6628 (L0-L9)
153
+ block_1_ratio: 0.6431 (L10-L19)
154
+ block_2_ratio: 0.5146 (L20-L29, balanced)
155
+ block_3_ratio: 0.5971 (L30-L39)
156
+ block_4_ratio: 0.6339 (L40-L49)
157
+ block_5_ratio: 0.8583 (L50-L59, reasoning core β€” Mother dominant)
158
+ mri_trust: 0.3631 (MDS 36% + Genome 64%)
159
  merge_method_weight: 0.6897
160
  ```
161
 
162
+ Key observations from the genome: ffn_ratio=0.93 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59)=0.86 shows the reasoning core layers also favor Mother. This aligns with the MDS heatmap pattern where Mother's reasoning capability concentrated in the final layers. Meanwhile, attn_ratio=0.32 preserves Father's attention structure, maintaining the original Gemma 4 multimodal and long-context capabilities.
163
+
164
+ ---
165
+
166
+ ## Model Specifications
167
+
168
+ | | |
169
+ |---|---|
170
+ | Architecture | Gemma 4 Dense (Hybrid Attention: Sliding Window + Global) |
171
+ | Parameters | 31B |
172
+ | Precision | BF16 |
173
+ | Context | 256,072 |
174
+ | Languages | 140+ |
175
+ | Thinking | enable_thinking=True chain-of-thought |
176
+ | License | Apache 2.0 |
177
 
178
  ---
179
 
 
211
  | BF16 Full Precision | ~62 GB | |
212
  | NVIDIA H100 80GB | 80 GB | Single GPU |
213
  | NVIDIA A100 80GB x 2 | 160 GB | Comfortable |
214
+ | NVIDIA RTX 4090 24GB x 4 | 96 GB | device_map=auto |
215
 
216
  ---
217
 
218
  ## References
219
 
220
+ - DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) β€” re-implemented, not library-dependent
221
+ - Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
222
  - FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
223
 
224
  ---
 
228
  | | |
229
  |---|---|
230
  | Developer | VIDRAFT |
231
+ | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) |
232
+ | Architecture | Gemma-4-31B |
233
  | License | Apache 2.0 |
234
 
235
  ---