Training-Free Reasoning at 88.89% on GPQA Diamond: How Darwin Family Hit Frontier Scores Without a Single Gradient Step
TL;DR — VIDRAFT's Darwin Family evolves frontier-level reasoning LLMs by recombining the weight spaces of existing checkpoints, with zero gradient-based training. Our flagship, Darwin-28B-Opus, reaches GPQA Diamond 88.89%. The paper went live on HuggingFace Daily Papers yesterday and is currently #3.🔗 Paper: https://huggingface.co/papers/2605.14386 🔗 arXiv: https://arxiv.org/abs/2605.14386 🔗 Model: https://huggingface.co/FINAL-Bench/Darwin-28B-Opus
1. The Problem: Post-Training Has Become a Tax
Since 2024, frontier LLM capability has been decided less by pretraining and more by post-training: RLHF, DPO, GRPO, synthetic-data SFT, reasoning distillation, the works. Every quarter brings a better recipe.
Every recipe has the same problem: the bill.
A single B200 node running for a month rivals an entire research lab's annual compute budget. Which prompts a natural question:
Hundreds of strong open-source LLMs already exist. What if the capabilities we want are already encoded in their weights — just waiting to be recombined?
Darwin Family is our answer.
2. The Core Idea: Don't Train. Diagnose and Breed.
Prior model merging research mostly falls into two camps:
- Heuristic-based (TIES, DARE, Model Soups) — simple but expressively limited
- Search-based (Sakana's EvoMerge and descendants) — powerful but the search space is huge and expensive to explore
Darwin Family breaks both ceilings simultaneously with three mechanisms.
① 14-Dimensional Adaptive Merge Genome
Prior evolutionary merging typically operated on a narrow space — usually "layer-wise mixing ratios." Darwin defines a 14-dimensional adaptive genome that allows recombination at the level of individual components (Attention / FFN / MLP / LayerNorm / Embedding) and blocks.
The expressive jump is large enough to capture capabilities that uniform layer-wise mixing simply cannot reach.
② MRI-Trust Fusion
The most important contribution of the paper.
We compute an MRI (Model Reasoning Importance) signal that diagnoses how much each layer contributes to a given reasoning capability. We then fuse this diagnostic with evolutionary search through a learnable trust parameter.
The intuition: trust the diagnostic too much and you collapse the search space; ignore it and evolution wastes generations on bad candidates. Darwin learns this balance directly from data, eliminating the manual tuning that plagued prior merge methods.
③ Architecture Mapper
The most ambitious piece. Darwin proposes a mapping module that lets heterogeneous architectures breed with each other — including Transformer attention layers and Mamba-style SSM layers.
Attention × SSM crossover. It actually works.
This opens a path most existing model-merge tools simply do not have: combining the strengths of two architectural families that previously lived in separate ecosystems.
3. Results: Zero Training, Frontier Reasoning
Darwin-28B-Opus, our flagship, scores 88.89% on GPQA Diamond — a benchmark explicitly designed to be difficult to game with surface pattern matching.
What makes the number meaningful:
- Zero gradient-based training steps. Not pretraining, not SFT, not RL.
- The model outperforms its own foundation parent, which was fully trained.
- The gain is not a one-off. Across scales from 4B to 35B, Darwin variants consistently improve over their parents.
- Recursive multi-generation evolution is stable — children become parents of the next generation without collapse.
4. Why This Matters
Three angles worth flagging.
Cost
A team without access to a large B200/H200 cluster can now produce a SOTA reasoning model. The capital barrier to producing — not just using — frontier-quality reasoning LLMs drops by orders of magnitude.
Scientific
Darwin Family makes a quantitative case for a hypothesis the field has flirted with for a while: open LLM weight spaces contain substantially more latent capability than we are currently extracting. Recombination, not just training, is now a credible lever.
Ecosystem
Every additional strong open-source parent expands Darwin's breeding space multiplicatively. The value of the framework compounds with the open-source ecosystem itself.
5. What's Next
Darwin Family is the LLM-layer foundation of VIDRAFT's broader Proto-AGI research direction. Active follow-ups include:
- NEG (Native Entropy Gating) — pushing the same training-free philosophy into the token-generation loop
- MoE → Dense SVD transplantation — moving capability between architectural families at the weight level
- MTQ + MFPx quantization — preserving attention while compressing FFN/MLP
All of them rest on the same thesis: extract more from what already exists, before you train more.
6. Links
- 📄 HuggingFace Paper: https://huggingface.co/papers/2605.14386
- 📄 arXiv: https://arxiv.org/abs/2605.14386
- 🤗 Model: https://huggingface.co/FINAL-Bench/Darwin-28B-Opus
- 🏢 Organization: VIDRAFT
Authors: Taebong Kim · Youngsik Hong · Minsik Kim · Sunyoung Choi · Jaewon Jang · Junghoon Shin · Minseo Kim
Thanks to everyone in the Daily Papers community who upvoted, commented, and challenged us. The next generation of Darwins will be better because of you.