README.md · FINAL-Bench/Darwin-4B-David at main

Darwin-4B-David / README.md

SeaWolf-AI

Update README.md

cdaba81 verified 1 day ago

preview code

raw

history blame contribute delete

14.5 kB

	---
	license: apache-2.0
	base_model:
	- FINAL-Bench/Darwin-4B-Opus
	- DavidAU/gemma-4-E4B-it-The-DECKARD-Expresso-Universe-HERETIC-UNCENSORED-Thinking
	tags:
	- darwin-v6
	- generation-2
	- evolutionary-merge
	- mri-guided
	- dare-ties
	- gemma4
	- reasoning
	- thinking
	- proto-agi
	- vidraft
	language:
	- en
	- ko
	- ja
	- zh
	- multilingual
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Darwin-4B-David — The First Second-Generation Darwin Model

	<p align="center">
	<a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Opus"><img src="https://img.shields.io/badge/🧬_Gen1-Darwin--4B--Opus-blue?style=for-the-badge" alt="Gen1"></a>
	<a href="https://huggingface.co/FINAL-Bench/Darwin-4B-David"><img src="https://img.shields.io/badge/🧬_Gen2-Darwin--4B--David-blue?style=for-the-badge" alt="Gen2"></a>
	<a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/⭐_Gen3-Darwin--4B--Genesis-gold?style=for-the-badge" alt="Gen3"></a>
	</p>

	<p align="center">
	<a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/🚀_Space-9B_Demo-purple?style=for-the-badge" alt="9B Space"></a>
	<a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/🚀_Space-31B_Demo-purple?style=for-the-badge" alt="31B Space"></a>
	</p>

	<p align="center">
	<a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🧬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/🚀_Space-35B_Demo-purple?style=for-the-badge" alt="35B Space"></a>
	<a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-Q8--Official-yellow?style=for-the-badge" alt="Q8 GGUF"></a>
	<a href="https://huggingface.co/bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF"><img src="https://img.shields.io/badge/📦_GGUF-bartowski-yellow?style=for-the-badge" alt="bartowski GGUF"></a>
	</p>

	<p align="center">
	<a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/🏆_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/📊_ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
	</p>

	> Gemma 4 E4B Dense \| 4.5B Params \| Thinking Mode \| 128K Context \| 140+ Languages \| BF16 \| Apache 2.0
	> The first-ever second-generation Darwin model — "Evolution of Evolution"

	---

	## Overview

	Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — a model evolved from an already-evolved model.

	The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: "Merge = Evolve" applied recursively.

	The name "David" pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a 4.5B small model challenges models many times its size.

	---

	## Family Tree

	<p align="center">
	<img src="family.png" alt="Darwin-4B-David" width="100%">
	</p>



	### Generation Comparison

	\| \| Gen 0 (Original) \| Gen 1 (Opus) \| Gen 2 (David) \|
	\|---\|---\|---\|---\|
	\| Model \| gemma-4-E4B-it \| Darwin-4B-Opus \| Darwin-4B-David \|
	\| Parents \| Google training \| Original + Claude distill \| Evolved model + DECKARD \|
	\| GPQA Diamond \| 58.6% \| — \| 85.0% (+26.4%p) \|
	\| Recursive evolution \| None \| 1× \| 2× (evolution of evolution) \|
	\| Core genes \| General-purpose \| Claude reasoning \| Reasoning + Creativity + Thinking \|

	---

	## Parent Models

	\| Role \| Model \| Characteristics \|
	\|---\|---\|---\|
	\| Father (Gen-1 Evolved) \| [FINAL-Bench/Darwin-4B-Opus](https://huggingface.co/FINAL-Bench/Darwin-4B-Opus) \| Darwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation \|
	\| Mother \| [DavidAU/DECKARD-Expresso-Universe](https://huggingface.co/DavidAU/gemma-4-E4B-it-The-DECKARD-Expresso-Universe-HERETIC-UNCENSORED-Thinking) \| BF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default \|

	### Model Diagnostic Scan (MDS)

	<p align="center">
	<img src="s1.png" alt="Father (Darwin-4B-Opus) MDS Scan" width="48%">
	<img src="s2.png" alt="Mother (DECKARD-Expresso-Universe) MDS Scan" width="48%">
	</p>

	Left: Father (Darwin-4B-Opus) — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
	Right: Mother (DECKARD-Expresso-Universe) — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.

	---

	## Benchmarks

	### Key Results

	\| Benchmark \| gemma-4-E4B-it (Original) \| Darwin-4B-David (Gen-2) \| Improvement \| Conditions \|
	\|---\|---\|---\|---\|---\|
	\| GPQA Diamond \| 58.6% \| 85.0% \| +26.4%p \| Generative, maj@8, 50Q sampling \|
	\| ARC-Challenge \| 64.93% \| 64.93% \| ±0 \| 25-shot, chat template, BF16, loglikelihood \|
	\| KMMLU \| 48.47% \| 48.46% \| ±0 \| 5-shot, 225Q, loglikelihood \|

	### GPQA Diamond Evaluation Details

	GPQA Diamond (graduate-level scientific reasoning) was evaluated using generative (thinking mode) evaluation.

	\| Setting \| Value \|
	\|---\|---\|
	\| Dataset \| Idavidrein/gpqa, gpqa_diamond split \|
	\| Questions \| 50 (sampled from 198 total) \|
	\| Evaluation method \| maj@8 (8 independent generations per question, majority vote determines final answer) \|
	\| Prompt format \| Epoch AI standard (`ANSWER: LETTER`) \|
	\| Thinking mode \| Enabled (chat_template, enable_thinking) \|
	\| max_new_tokens \| 4,096 \|
	\| temperature \| 1.0 \|
	\| top_p / top_k \| 0.95 / 64 \|
	\| Precision \| BF16 \|
	\| Choice shuffling \| Fixed seed per question (MD5 hash) \|

	Why maj@8:
	- Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
	- 8 independent generations with majority voting reflects the model's stable reasoning capability
	- maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)

	Note on 50-question sampling:
	- GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
	- 50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
	- Full 198-question evaluation is planned

	### Note on lm-eval Loglikelihood Results

	ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in generation quality, reasoning chains, or creativity. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.

	---

	## MRI-Guided Evolution Recipe


	Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.

	\| Layer Range \| Weight \| Strategy \|
	\|---\|---\|---\|
	\| Layer 0-3 \| 0.81 \| Absorb Mother's embedding-adjacent layers \|
	\| Layer 15-16 \| 0.91 \| Maximum Mother creativity/character layer reinforcement \|
	\| Layer 22-25 \| 0.95 \| Maximum absorption of Mother's KOREAN hotspot \|
	\| Layer 26-27 \| 0.40 \| Father priority preservation zone \|
	\| Layer 30-40 \| 0.48 \| Father REASONING/MATH preservation \|
	\| Layer 40-42 \| 0.62 \| Output layer balance \|

	### Parent Comparison

	<p align="center">
	<img src="parent_comparison.png" alt="Father vs Mother layer-wise importance comparison" width="100%">
	</p>

	### Evolution Parameters

	\| Setting \| Value \|
	\|---\|---\|
	\| Merge method \| DARE-TIES (direct PyTorch, no mergekit dependency) \|
	\| Density \| 0.800 ~ 0.850 \|
	\| Normalization \| normalize: true \|
	\| Evolution method \| Darwin mergekit (MRI-guided) \|
	\| Population size \| 20 \|
	\| Phase 1 (proxy search) \| 200 steps \|
	\| Phase 2 (real merge) \| 10 steps, top 5 elite \|
	\| Fitness function \| kmmlu_lite (Korean knowledge) \|
	\| Best fitness \| 0.8412 (84.12%) \|
	\| Total time \| 45.3 minutes (H100 ×1) \|

	---

	## Darwin V6 vs Conventional Merging

	\| Capability \| mergekit (DARE-TIES) \| Darwin V6 \|
	\|---\|---\|---\|
	\| Implementation \| Library call (mergekit CLI) \| Direct PyTorch tensor operations, no external dependency \|
	\| Ratio selection \| Uniform ratio across all tensors \| Per-tensor ratio from MDS diagnostic (independent ratios per tensor) \|
	\| Pre-merge analysis \| None \| Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes) \|
	\| Transplant \| Not supported \| ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise) \|
	\| Post-merge validation \| Benchmark score only \| Layer-by-layer Health Check: child vs both parents, interference and function loss detection \|
	\| Search method \| Manual tuning \| CMA-ES evolution with adaptive genome \|
	\| Reproducibility \| Config file \| genome_hash seed guarantees identical output for identical genome \|
	\| GPU efficiency \| Single merge per run \| Phase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated) \|

	---

	## Significance of Second-Generation Evolution

	1. Proof of "Evolution of Evolution": The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.

	2. 85% GPQA Diamond at 4.5B parameters: +26.4%p over the original 58.6%. This surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters — an exceptional result in parameter efficiency.

	3. Apache 2.0 + Edge deployment: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.

	4. Multimodal preservation: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.

	5. Community synergy: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.

	---

	## Model Specifications

	\| \| \|
	\|---\|---\|
	\| Architecture \| Gemma 4 E4B Dense \|
	\| Effective Parameters \| 4.5B (8B total with embeddings) \|
	\| Layers \| 42 \|
	\| Sliding Window \| 512 tokens \|
	\| Precision \| BF16 \|
	\| Context \| 128K \|
	\| Vocabulary \| 262K \|
	\| Languages \| 140+ \|
	\| Thinking \| enable_thinking=True chain-of-thought \|
	\| Vision Encoder \| ~150M (image, video) \|
	\| Audio Encoder \| ~300M (speech recognition) \|
	\| License \| Apache 2.0 \|

	---

	## Usage

	### Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"FINAL-Bench/Darwin-4B-David",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
	text = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
	)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
	```

	### Disable Thinking Mode

	```python
	text = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
	)
	```

	---

	## VRAM Requirements

	\| Setup \| VRAM \| Status \|
	\|---\|---\|---\|
	\| BF16 Full Precision \| ~16 GB \| \|
	\| NVIDIA RTX 4090 24GB \| 24 GB \| Single GPU, very comfortable \|
	\| NVIDIA RTX 3090 24GB \| 24 GB \| Single GPU, comfortable \|
	\| NVIDIA RTX 4080 16GB \| 16 GB \| Single GPU \|
	\| NVIDIA T4 16GB \| 16 GB \| Cloud/Colab friendly \|
	\| Jetson Orin NX 16GB \| 16 GB \| Edge deployment ready \|

	---

	## Darwin Opus Family

	\| Model \| Gen \| Architecture \| Parameters \| Context \| Base \| GPQA Diamond \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Darwin-4B-David \| 🥈 Gen 2 \| Dense (E4B) \| 4.5B \| 128K \| Darwin-4B-Opus × DECKARD \| 85.0% \|
	\| Darwin-4B-Opus \| Gen 1 \| Dense (E4B) \| 4.5B \| 128K \| gemma-4-E4B-it \| — \|
	\| Darwin-9B-Opus \| Gen 1 \| Dense \| 9B \| 131K \| Qwen3.5-9B \| — \|
	\| Darwin-31B-Opus \| Gen 1 \| Dense \| 31B \| 256K \| gemma-4-31B-it \| — \|
	\| Darwin-35B-A3B-Opus \| Gen 1 \| MoE \| 35B (3B active) \| 256K \| Qwen3.5-35B-A3B \| 90.0% \|

	---

	## Roadmap

	- Full 198-question GPQA Diamond evaluation (maj@8)
	- MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
	- GRPO + TinyLoRA reinforcement learning
	- SSD self-distillation
	- Cross-architecture breeding research (Transformer × Mamba FFN transplantation)

	---

	## References

	- DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
	- Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
	- FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
	- DavidAU DECKARD Series: https://huggingface.co/DavidAU
	- MTI: Minimal Test-Time Intervention (arXiv:2510.13940)

	---

	## Built By

	\| \| \|
	\|---\|---\|
	\| Developer \| VIDRAFT \|
	\| Engine \| Darwin V6 (Diagnostic-Guided Evolutionary Merge) \|
	\| Generation \| Generation 2 — First in Darwin history \|
	\| Architecture \| Gemma-4-E4B Dense \|
	\| License \| Apache 2.0 \|

	---

	## Citation

	```bibtex
	@misc{vidraft_darwin_4b_david_2026,
	title = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
	subtitle = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
	author = {VIDRAFT},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
	}
	```