Darwin-31B-Opus

4B Model 4B Space 9B Model 9B Space

31B Model 31B Space 35B Model 35B Space Q8 GGUF bartowski GGUF

FINAL Bench ALL Bench

Darwin-31B-Opus

Gemma 4 Dense 31B | Thinking Mode | 256K Context | 140+ Languages | BF16 | Apache 2.0


Overview

Darwin-31B-Opus is a reasoning-enhanced model created by merging google/gemma-4-31B-it (Father) and TeichAI/gemma-4-31B-it-Claude-Opus-Distill (Mother) using the Darwin V6 engine.

Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each of the 1,188 tensors. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.


Parent Models

Role Model Characteristics
Father google/gemma-4-31B-it Gemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3)
Mother TeichAI/gemma-4-31B-it-Claude-Opus-Distill Claude 4.6 Opus high-effort reasoning distillation, code/science/analysis

Model Diagnostic Scan (MDS)

Father (gemma-4-31B-it) MDS Scan Mother (Claude-Opus-Distill) MDS Scan

Left: Father (gemma-4-31B-it) — balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) — strong REASONING concentration in L50-L60, CODE activation in late layers, KOREAN at start and end. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.


Benchmarks

Benchmark Darwin-31B-Opus Father (gemma-4-31B-it) Condition
ARC-Challenge 82.89% - loglikelihood, zero-shot, 200Q
GPQA Diamond 66.0% 60.0% generative thinking mode, greedy, 50Q

GPQA Diamond was evaluated under identical conditions for both models: same 50 questions, same seed (i+42), same prompt template, greedy decoding (do_sample=False), max_new_tokens=2048, enable_thinking=True. Darwin-31B-Opus achieved a 10% relative improvement over the Father model.

Note: Gemma 4 architecture (Gemma4ForConditionalGeneration) has limited compatibility with lm-eval's loglikelihood method due to its multimodal wrapper structure. Only generative evaluation produces valid results for Gemma 4 based models. Full 198-question evaluation with Majority Voting is planned.


Darwin V6 vs Conventional Merging

Capability mergekit (DARE-TIES) Darwin V6
Implementation Library call (mergekit CLI) Direct PyTorch tensor operations, no external dependency
Ratio selection Uniform ratio across all tensors Per-tensor ratio from MDS diagnostic (1,188 independent ratios)
Pre-merge analysis None Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
Ratio formula Human-set or grid search combined = static × 0.4 + probe × 0.6, then evolutionary optimization
Transplant Not supported ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validation Benchmark score only Layer-by-layer Health Check: child vs both parents, interference and function loss detection
Search method Manual tuning CMA-ES evolution with adaptive 14-dimensional genome
Reproducibility Config file genome_hash seed guarantees identical output for identical genome
GPU efficiency Single merge per run Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated)

How Darwin V6 Works

Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.

Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.

The final merge ratio for each tensor:

static_score = entropy × 0.3 + std × 0.2 + clamp(norm, 100) × 0.002
probe_score  = Σ(cosine_distance[probe_i] × weight_i)
combined     = static × 0.4 + probe × 0.6
mri_ratio    = combined_b / (combined_a + combined_b)
final_ratio  = mri_ratio × mri_trust + genome_ratio × (1 - mri_trust)

The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.

After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference (child importance >> parent max) or function loss (parent importance high but child dropped).

Parent Comparison (MDS Result)

Parent Comparison — Layer-wise Importance


Evolution Result

Best Score (ARC-Challenge) 0.8289
Merge Method DARE-TIES (direct PyTorch)
Tensors Merged 1,188
Health Check healthy
Phase 2 Steps 4 (early stop, patience=5)
Total Time 134 min
Infrastructure 4 x NVIDIA H100 NVL (100GB)

Optimal Genome (14-dimensional adaptive):

global_ratio:        0.5147   (overall merge ratio)
attn_ratio:          0.3169   (Attention layers — Father dominant)
ffn_ratio:           0.9316   (FFN layers — Mother dominant)
embed_ratio:         0.7748   (Embedding)
density_a:           0.8997   (Father DARE density)
density_b:           0.9539   (Mother DARE density)
block_0_ratio:       0.6628   (L0-L9)
block_1_ratio:       0.6431   (L10-L19)
block_2_ratio:       0.5146   (L20-L29, balanced)
block_3_ratio:       0.5971   (L30-L39)
block_4_ratio:       0.6339   (L40-L49)
block_5_ratio:       0.8583   (L50-L59, reasoning core — Mother dominant)
mri_trust:           0.3631   (MDS 36% + Genome 64%)
merge_method_weight: 0.6897

Key observations from the genome: ffn_ratio=0.93 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59)=0.86 shows the reasoning core layers also favor Mother. This aligns with the MDS heatmap pattern where Mother's reasoning capability concentrated in the final layers. Meanwhile, attn_ratio=0.32 preserves Father's attention structure, maintaining the original Gemma 4 multimodal and long-context capabilities.


Model Specifications

Architecture Gemma 4 Dense (Hybrid Attention: Sliding Window + Global)
Parameters 31B
Precision BF16
Context 256,072
Languages 140+
Thinking enable_thinking=True chain-of-thought
License Apache 2.0

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-31B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-31B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

VRAM Requirements

Setup VRAM Status
BF16 Full Precision ~62 GB
NVIDIA H100 80GB 80 GB Single GPU
NVIDIA A100 80GB x 2 160 GB Comfortable
NVIDIA RTX 4090 24GB x 4 96 GB device_map=auto

References


Built By

Developer VIDRAFT
Engine Darwin V6 (Diagnostic-Guided Evolutionary Merge)
Architecture Gemma-4-31B
License Apache 2.0

Citation

@misc{vidraft_darwin_31b_opus,
  title        = {Darwin-31B-Opus: Diagnostic-Guided Evolutionary Merge on Gemma 4},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-31B-Opus}}
}
Downloads last month
147
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FINAL-Bench/Darwin-31B-Opus

Finetuned
(1)
this model

Space using FINAL-Bench/Darwin-31B-Opus 1

Paper for FINAL-Bench/Darwin-31B-Opus