Training-Free Reasoning at 88.89% on GPQA Diamond: How Darwin Family Hit Frontier Scores Without a Single Gradient Step

Community Article Published May 15, 2026

1. The Problem: Post-Training Has Become a Tax
2. The Core Idea: Don't Train. Diagnose and Breed.
① 14-Dimensional Adaptive Merge Genome
② MRI-Trust Fusion
③ Architecture Mapper
3. Results: Zero Training, Frontier Reasoning
4. Why This Matters
Cost
Scientific
Ecosystem
5. What's Next
6. Links
TL;DR — VIDRAFT's Darwin Family evolves frontier-level reasoning LLMs by recombining the weight spaces of existing checkpoints, with zero gradient-based training. Our flagship, Darwin-28B-Opus, reaches GPQA Diamond 88.89%. The paper went live on HuggingFace Daily Papers yesterday and is currently #3.

🔗 Paper: https://huggingface.co/papers/2605.14386 🔗 arXiv: https://arxiv.org/abs/2605.14386 🔗 Model: https://huggingface.co/FINAL-Bench/Darwin-28B-Opus

1. The Problem: Post-Training Has Become a Tax

Since 2024, frontier LLM capability has been decided less by pretraining and more by post-training: RLHF, DPO, GRPO, synthetic-data SFT, reasoning distillation, the works. Every quarter brings a better recipe.

Every recipe has the same problem: the bill.

A single B200 node running for a month rivals an entire research lab's annual compute budget. Which prompts a natural question:

Hundreds of strong open-source LLMs already exist. What if the capabilities we want are already encoded in their weights — just waiting to be recombined?

Darwin Family is our answer.

2. The Core Idea: Don't Train. Diagnose and Breed.

Prior model merging research mostly falls into two camps:

Heuristic-based (TIES, DARE, Model Soups) — simple but expressively limited
Search-based (Sakana's EvoMerge and descendants) — powerful but the search space is huge and expensive to explore

Darwin Family breaks both ceilings simultaneously with three mechanisms.

① 14-Dimensional Adaptive Merge Genome

Prior evolutionary merging typically operated on a narrow space — usually "layer-wise mixing ratios." Darwin defines a 14-dimensional adaptive genome that allows recombination at the level of individual components (Attention / FFN / MLP / LayerNorm / Embedding) and blocks.

The expressive jump is large enough to capture capabilities that uniform layer-wise mixing simply cannot reach.

② MRI-Trust Fusion

The most important contribution of the paper.

We compute an MRI (Model Reasoning Importance) signal that diagnoses how much each layer contributes to a given reasoning capability. We then fuse this diagnostic with evolutionary search through a learnable trust parameter.

The intuition: trust the diagnostic too much and you collapse the search space; ignore it and evolution wastes generations on bad candidates. Darwin learns this balance directly from data, eliminating the manual tuning that plagued prior merge methods.

③ Architecture Mapper

The most ambitious piece. Darwin proposes a mapping module that lets heterogeneous architectures breed with each other — including Transformer attention layers and Mamba-style SSM layers.

Attention × SSM crossover. It actually works.

This opens a path most existing model-merge tools simply do not have: combining the strengths of two architectural families that previously lived in separate ecosystems.

3. Results: Zero Training, Frontier Reasoning

Darwin-28B-Opus, our flagship, scores 88.89% on GPQA Diamond — a benchmark explicitly designed to be difficult to game with surface pattern matching.

What makes the number meaningful:

Zero gradient-based training steps. Not pretraining, not SFT, not RL.
The model outperforms its own foundation parent, which was fully trained.
The gain is not a one-off. Across scales from 4B to 35B, Darwin variants consistently improve over their parents.
Recursive multi-generation evolution is stable — children become parents of the next generation without collapse.

4. Why This Matters

Three angles worth flagging.

Cost

A team without access to a large B200/H200 cluster can now produce a SOTA reasoning model. The capital barrier to producing — not just using — frontier-quality reasoning LLMs drops by orders of magnitude.

Scientific

Darwin Family makes a quantitative case for a hypothesis the field has flirted with for a while: open LLM weight spaces contain substantially more latent capability than we are currently extracting. Recombination, not just training, is now a credible lever.

Ecosystem

Every additional strong open-source parent expands Darwin's breeding space multiplicatively. The value of the framework compounds with the open-source ecosystem itself.

5. What's Next

Darwin Family is the LLM-layer foundation of VIDRAFT's broader Proto-AGI research direction. Active follow-ups include:

NEG (Native Entropy Gating) — pushing the same training-free philosophy into the token-generation loop
MoE → Dense SVD transplantation — moving capability between architectural families at the weight level
MTQ + MFPx quantization — preserving attention while compressing FFN/MLP

All of them rest on the same thesis: extract more from what already exists, before you train more.

6. Links

📄 HuggingFace Paper: https://huggingface.co/papers/2605.14386
📄 arXiv: https://arxiv.org/abs/2605.14386
🤗 Model: https://huggingface.co/FINAL-Bench/Darwin-28B-Opus
🏢 Organization: VIDRAFT

Authors: Taebong Kim · Youngsik Hong · Minsik Kim · Sunyoung Choi · Jaewon Jang · Junghoon Shin · Minseo Kim

Thanks to everyone in the Daily Papers community who upvoted, commented, and challenged us. The next generation of Darwins will be better because of you.

Models mentioned in this article 1

Papers mentioned in this article 1

Darwin-TTS: We Gave a TTS Model 3% of an LLM's Brain — It Started Showing Emotion

April 15, 2026

"Darwin-27B-Opus: Surpassing the Foundation Model Without Training"

April 13, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote