---
license: apache-2.0
base_model:
- FINAL-Bench/Darwin-4B-Opus
- DavidAU/gemma-4-E4B-it-The-DECKARD-Expresso-Universe-HERETIC-UNCENSORED-Thinking
tags:
- darwin-v6
- generation-2
- evolutionary-merge
- mri-guided
- dare-ties
- gemma4
- reasoning
- thinking
- proto-agi
- vidraft
language:
- en
- ko
- ja
- zh
- multilingual
pipeline_tag: text-generation
library_name: transformers
---
# Darwin-4B-David — The First Second-Generation Darwin Model
> Gemma 4 E4B Dense | 4.5B Params | Thinking Mode | 128K Context | 140+ Languages | BF16 | Apache 2.0
> **The first-ever second-generation Darwin model — "Evolution of Evolution"**
---
## Overview
Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — **a model evolved from an already-evolved model.**
The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: **"Merge = Evolve"** applied recursively.
The name **"David"** pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a **4.5B small model challenges models many times its size.**
---
## Family Tree
### Generation Comparison
| | Gen 0 (Original) | Gen 1 (Opus) | Gen 2 (David) |
|---|---|---|---|
| Model | gemma-4-E4B-it | Darwin-4B-Opus | **Darwin-4B-David** |
| Parents | Google training | Original + Claude distill | **Evolved model + DECKARD** |
| GPQA Diamond | 58.6% | — | **85.0% (+26.4%p)** |
| Recursive evolution | None | 1× | **2× (evolution of evolution)** |
| Core genes | General-purpose | Claude reasoning | **Reasoning + Creativity + Thinking** |
---
## Parent Models
| Role | Model | Characteristics |
|---|---|---|
| Father (Gen-1 Evolved) | [FINAL-Bench/Darwin-4B-Opus](https://huggingface.co/FINAL-Bench/Darwin-4B-Opus) | Darwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation |
| Mother | [DavidAU/DECKARD-Expresso-Universe](https://huggingface.co/DavidAU/gemma-4-E4B-it-The-DECKARD-Expresso-Universe-HERETIC-UNCENSORED-Thinking) | BF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default |
### Model Diagnostic Scan (MDS)
**Left: Father (Darwin-4B-Opus)** — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
**Right: Mother (DECKARD-Expresso-Universe)** — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.
---
## Benchmarks
### Key Results
| Benchmark | gemma-4-E4B-it (Original) | Darwin-4B-David (Gen-2) | Improvement | Conditions |
|---|---|---|---|---|
| **GPQA Diamond** | 58.6% | **85.0%** | **+26.4%p** | Generative, maj@8, 50Q sampling |
| ARC-Challenge | 64.93% | 64.93% | ±0 | 25-shot, chat template, BF16, loglikelihood |
| KMMLU | 48.47% | 48.46% | ±0 | 5-shot, 225Q, loglikelihood |
### GPQA Diamond Evaluation Details
GPQA Diamond (graduate-level scientific reasoning) was evaluated using **generative (thinking mode) evaluation**.
| Setting | Value |
|---|---|
| Dataset | Idavidrein/gpqa, gpqa_diamond split |
| Questions | **50** (sampled from 198 total) |
| Evaluation method | **maj@8** (8 independent generations per question, majority vote determines final answer) |
| Prompt format | Epoch AI standard (`ANSWER: LETTER`) |
| Thinking mode | Enabled (chat_template, enable_thinking) |
| max_new_tokens | 4,096 |
| temperature | 1.0 |
| top_p / top_k | 0.95 / 64 |
| Precision | BF16 |
| Choice shuffling | Fixed seed per question (MD5 hash) |
**Why maj@8:**
- Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
- 8 independent generations with majority voting reflects the model's **stable reasoning capability**
- maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)
**Note on 50-question sampling:**
- GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
- 50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
- Full 198-question evaluation is planned
### Note on lm-eval Loglikelihood Results
ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in **generation quality, reasoning chains, or creativity**. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.
---
## MRI-Guided Evolution Recipe
Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.
| Layer Range | Weight | Strategy |
|---|---|---|
| Layer 0-3 | 0.81 | Absorb Mother's embedding-adjacent layers |
| Layer 15-16 | 0.91 | Maximum Mother creativity/character layer reinforcement |
| Layer 22-25 | **0.95** | **Maximum absorption of Mother's KOREAN hotspot** |
| Layer 26-27 | 0.40 | Father priority preservation zone |
| Layer 30-40 | 0.48 | Father REASONING/MATH preservation |
| Layer 40-42 | 0.62 | Output layer balance |
### Parent Comparison
### Evolution Parameters
| Setting | Value |
|---|---|
| Merge method | DARE-TIES (direct PyTorch, no mergekit dependency) |
| Density | 0.800 ~ 0.850 |
| Normalization | normalize: true |
| Evolution method | Darwin mergekit (MRI-guided) |
| Population size | 20 |
| Phase 1 (proxy search) | 200 steps |
| Phase 2 (real merge) | 10 steps, top 5 elite |
| Fitness function | kmmlu_lite (Korean knowledge) |
| Best fitness | **0.8412 (84.12%)** |
| Total time | 45.3 minutes (H100 ×1) |
---
## Darwin V6 vs Conventional Merging
| Capability | mergekit (DARE-TIES) | Darwin V6 |
|---|---|---|
| Implementation | Library call (mergekit CLI) | Direct PyTorch tensor operations, no external dependency |
| Ratio selection | Uniform ratio across all tensors | Per-tensor ratio from MDS diagnostic (independent ratios per tensor) |
| Pre-merge analysis | None | Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) |
| Transplant | Not supported | ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise) |
| Post-merge validation | Benchmark score only | Layer-by-layer Health Check: child vs both parents, interference and function loss detection |
| Search method | Manual tuning | CMA-ES evolution with adaptive genome |
| Reproducibility | Config file | genome_hash seed guarantees identical output for identical genome |
| GPU efficiency | Single merge per run | Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated) |
---
## Significance of Second-Generation Evolution
1. **Proof of "Evolution of Evolution"**: The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.
2. **85% GPQA Diamond at 4.5B parameters**: +26.4%p over the original 58.6%. This **surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters** — an exceptional result in parameter efficiency.
3. **Apache 2.0 + Edge deployment**: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.
4. **Multimodal preservation**: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.
5. **Community synergy**: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.
---
## Model Specifications
| | |
|---|---|
| Architecture | Gemma 4 E4B Dense |
| Effective Parameters | 4.5B (8B total with embeddings) |
| Layers | 42 |
| Sliding Window | 512 tokens |
| Precision | BF16 |
| Context | 128K |
| Vocabulary | 262K |
| Languages | 140+ |
| Thinking | enable_thinking=True chain-of-thought |
| Vision Encoder | ~150M (image, video) |
| Audio Encoder | ~300M (speech recognition) |
| License | Apache 2.0 |
---
## Usage
### Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-4B-David",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```
### Disable Thinking Mode
```python
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
```
---
## VRAM Requirements
| Setup | VRAM | Status |
|---|---|---|
| BF16 Full Precision | ~16 GB | |
| NVIDIA RTX 4090 24GB | 24 GB | Single GPU, very comfortable |
| NVIDIA RTX 3090 24GB | 24 GB | Single GPU, comfortable |
| NVIDIA RTX 4080 16GB | 16 GB | Single GPU |
| NVIDIA T4 16GB | 16 GB | Cloud/Colab friendly |
| Jetson Orin NX 16GB | 16 GB | Edge deployment ready |
---
## Darwin Opus Family
| Model | Gen | Architecture | Parameters | Context | Base | GPQA Diamond |
|---|---|---|---|---|---|---|
| **Darwin-4B-David** | **🥈 Gen 2** | **Dense (E4B)** | **4.5B** | **128K** | **Darwin-4B-Opus × DECKARD** | **85.0%** |
| Darwin-4B-Opus | Gen 1 | Dense (E4B) | 4.5B | 128K | gemma-4-E4B-it | — |
| Darwin-9B-Opus | Gen 1 | Dense | 9B | 131K | Qwen3.5-9B | — |
| Darwin-31B-Opus | Gen 1 | Dense | 31B | 256K | gemma-4-31B-it | — |
| Darwin-35B-A3B-Opus | Gen 1 | MoE | 35B (3B active) | 256K | Qwen3.5-35B-A3B | 90.0% |
---
## Roadmap
- Full 198-question GPQA Diamond evaluation (maj@8)
- MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
- GRPO + TinyLoRA reinforcement learning
- SSD self-distillation
- Cross-architecture breeding research (Transformer × Mamba FFN transplantation)
---
## References
- DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
- Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
- FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
- DavidAU DECKARD Series: https://huggingface.co/DavidAU
- MTI: Minimal Test-Time Intervention (arXiv:2510.13940)
---
## Built By
| | |
|---|---|
| Developer | VIDRAFT |
| Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) |
| Generation | **Generation 2** — First in Darwin history |
| Architecture | Gemma-4-E4B Dense |
| License | Apache 2.0 |
---
## Citation
```bibtex
@misc{vidraft_darwin_4b_david_2026,
title = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
subtitle = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
}
```