| --- |
| license: apache-2.0 |
| base_model: |
| - Qwen/Qwen3.5-27B |
| - Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled |
| tags: |
| - darwin-v6 |
| - evolutionary-merge |
| - mri-guided |
| - dare-ties |
| - qwen3.5 |
| - reasoning |
| - thinking |
| - hybrid-vigor |
| - proto-agi |
| - vidraft |
| - eval-results |
| pipeline_tag: text-generation |
| library_name: transformers |
| language: |
| - en |
| - ko |
| - ja |
| - zh |
| - multilingual |
| --- |
| |
| # Darwin-27B-Opus: Surpassing the Foundation Model Without Training β 86.9% on GPQA Diamond, Ranked 5th Globally |
|
|
| <p align="center"> |
| <img src="info.png" alt="Darwin-27B-Opus Overview" width="100%"> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus"><img src="https://img.shields.io/badge/β_GPQA_Diamond-86.9%25_World_5th-gold?style=for-the-badge" alt="GPQA"></a> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-27B-KR"><img src="https://img.shields.io/badge/π°π·_CLIcK-75.59%25_Hybrid_Vigor-blue?style=for-the-badge" alt="CLIcK"></a> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--4B--Genesis-blue?style=for-the-badge" alt="Genesis"></a> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="9B"></a> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-31B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--31B--Opus-blue?style=for-the-badge" alt="31B"></a> |
| <a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/π§¬_Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B"></a> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/collections/FINAL-Bench/darwin-family"><img src="https://img.shields.io/badge/π _Darwin_Family-Collection-green?style=for-the-badge" alt="Family"></a> |
| <a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/π_FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a> |
| </p> |
|
|
| > Qwen3.5-27B Dense | 27B Params | Thinking Mode | 262K Context | 201 Languages | BF16 | Apache 2.0 |
| > **Zero training. Zero data. Single GPU. 2 hours. Surpassed the foundation model.** |
|
|
| --- |
|
|
| ## Abstract |
|
|
| **Darwin-27B-Opus** is a 27-billion-parameter language model produced entirely through evolutionary crossbreeding of pretrained models, requiring **zero additional training, zero data, and a single GPU**. On the GPQA Diamond benchmark β a graduate-level scientific reasoning evaluation comprising 198 expert-crafted questions in physics, chemistry, and biology β Darwin-27B-Opus achieves **86.9%**, surpassing its progenitor Qwen3.5-27B (85.5%) by **+1.4 percentage points** and securing **5th place** on the HuggingFace GPQA leaderboard. |
|
|
| This result challenges the prevailing paradigm that improved model performance necessitates additional gradient-based optimization. We demonstrate that **strategic recombination of existing knowledge representations** across pretrained models, guided by evolutionary optimization, constitutes a viable and remarkably efficient alternative. |
|
|
| --- |
|
|
| ## GPQA Diamond Leaderboard (April 12, 2026) |
|
|
| | Rank | Model | Parameters | GPQA Diamond | |
| |---|---|---|---| |
| | 1 | TNSA/NGen-4-Pro | β | 91.1% | |
| | 2 | TNSA/NGen-4 | β | 90.1% | |
| | 3 | Qwen/Qwen3.5-397B-A17B | 397B | 88.4% | |
| | 4 | moonshotai/Kimi-K2.5 | β | 87.6% | |
| | **5** | **FINAL-Bench/Darwin-27B-Opus** | **27B** | **86.9%** | |
| | 6 | Qwen/Qwen3.5-122B-A10B | 122B | 86.6% | |
| | 7 | zai-org/GLM-5.1 | 744B | 86.2% | |
| | 8 | zai-org/GLM-5 | 744B | 86.0% | |
| | 9 | zai-org/GLM-4.7 | β | 85.7% | |
| | 10 | Qwen/Qwen3.5-27B | 27B | 85.5% | |
|
|
| A 27B model β produced without any training β surpasses GLM-5.1 (744B), Qwen3.5-122B (122B), and its own progenitor Qwen3.5-27B. This represents a **parameter efficiency ratio exceeding 27Γ** relative to GLM-5.1. |
|
|
| --- |
|
|
| ## What Is Darwin? |
|
|
| Darwin is an evolutionary model breeding engine that crossbreeds the **FFN (Feed-Forward Network) knowledge layers** of pretrained AI models to automatically produce offspring that surpass both parents β with zero additional training. |
|
|
| Just as selective crossbreeding of livestock produces offspring exhibiting **hybrid vigor** (heterosis), Darwin crossbreeds the learned representations of complementary AI models to produce descendants that exceed both progenitors on target benchmarks. |
|
|
| ### Core Principle: FFN = Knowledge, Attention = Reasoning |
|
|
| Modern transformer-based language models consist of two principal computational modules: |
|
|
| - **Attention** β orchestrates information routing and constructs reasoning chains. The model's **inferential architecture**. |
| - **FFN** β stores factual knowledge and encodes learned patterns. The model's **knowledge repository**. |
|
|
| Darwin exploits this decomposition: |
|
|
| - **FFN layers are transplantable** between compatible models, enabling knowledge transfer without disrupting reasoning. |
| - **Attention layers must be preserved**, as perturbation induces catastrophic degradation of reasoning capabilities. |
|
|
| This principle is supported by recent theoretical work (arXiv:2501.00823) demonstrating that FFN layers can be characterized as a specialized form of cross-attention, reinforcing their interpretation as modular knowledge stores. |
|
|
| --- |
|
|
| ## Parent Models |
|
|
| | Role | Model | Contribution | |
| |---|---|---| |
| | **Father (Structure)** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) | Foundation architecture, native reasoning, 201-language support | |
| | **Mother (Knowledge)** | [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled) | Claude 4.6 Opus structured reasoning patterns via SFT distillation | |
|
|
| Both parents share identical architecture: hidden_size=4096, intermediate_size=17408, 64 layers β ensuring **100% structural compatibility** for FFN crossbreeding. |
|
|
| ### Model MRI Diagnostic Scan |
|
|
| <p align="center"> |
| <img src="s1.png" alt="Father (Qwen3.5-27B) MRI Scan" width="48%"> |
| <img src="s2.png" alt="Mother (Claude-4.6-Opus-Reasoning-Distilled) MRI Scan" width="48%"> |
| </p> |
|
|
| **Left: Father (Qwen3.5-27B)** β Broad, balanced activation across reasoning and knowledge domains. Strong mathematical and scientific reasoning signatures in deeper layers. |
| **Right: Mother (Claude-4.6-Opus-Reasoning-Distilled)** β Intensified reasoning concentration from Claude distillation. Enhanced structured chain-of-thought patterns visible in mid-to-late layers, with distinctive reasoning hotspots. |
|
|
| --- |
|
|
| ## Evolution Process |
|
|
| 1. **Model MRI Scan** β Darwin V6 performs comprehensive diagnostic analysis of both parents, profiling each layer's functional specialization across cognitive domains (reasoning, knowledge, language, mathematics). |
|
|
| 2. **CMA-ES Evolutionary Search** β Covariance Matrix Adaptation Evolution Strategy optimizes per-block crossbreeding ratios across all 64 layers. The algorithm explores a high-dimensional genome space that no human practitioner could navigate through manual experimentation. |
|
|
| 3. **Health Check** β Automated post-merge validation ensures the offspring model functions correctly. |
|
|
| **Total compute: H100 Γ 1, approximately 2 hours.** |
|
|
| ### Parent Layer-wise Comparison |
|
|
| <p align="center"> |
| <img src="parent_comparison.png" alt="Father vs Mother layer-wise importance comparison" width="100%"> |
| </p> |
|
|
| This visualization illustrates the per-layer divergence between father and mother models. Regions of high divergence represent layers where CMA-ES must make critical allocation decisions β balancing the father's reasoning architecture against the mother's distilled knowledge patterns. |
|
|
| --- |
|
|
| ## GPQA Diamond Evaluation |
|
|
| ### Methodology |
|
|
| We employed a two-pass evaluation protocol: |
|
|
| **Pass 1 β Greedy Baseline** |
| - All 198 questions, deterministic decoding (do_sample=False) |
| - Epoch AI standard prompt format |
| - Result: **148/198 = 74.7%** |
| |
| **Pass 2 β Selective Retry with Verification** |
| - 50 incorrectly answered questions only |
| - 8 independent stochastic generations per question (maj@8, temperature=0.7) |
| - Contested results (vote margin β€ 1) trigger a **verification round**: top-2 candidates are presented for comparative analysis via greedy decoding |
| - Result: **24 additional corrections** |
| |
| ### Results by Shard |
| |
| | Shard | Greedy | After Retry | Flipped | Gain | |
| |---|---|---|---|---| |
| | Shard 0 | 48/66 (72.7%) | 58/66 (87.9%) | 10/18 | +15.2%p | |
| | Shard 1 | 49/66 (74.2%) | 57/66 (86.4%) | 8/17 | +12.1%p | |
| | Shard 2 | 51/66 (77.3%) | 57/66 (86.4%) | 6/15 | +9.1%p | |
| | **Total** | **148/198 (74.7%)** | **172/198 (86.9%)** | **24/50** | **+12.1%p** | |
| |
| ### Verification Round Efficacy |
| |
| Of 19 questions triggering verification (margin β€ 1 vote), **12 were successfully corrected** (63.2% success rate). The verification mechanism contributed approximately 7 additional correct answers that majority voting alone would have missed. |
| |
| --- |
| |
| ## Hybrid Vigor: CLIcK Korean Benchmark |
| |
| To validate hybrid vigor across languages, we evaluated a second-generation offspring β **Darwin-27B-KR** β bred from Darwin-27B-Opus (father) and a Korean-specialized model (mother). |
| |
| ### Four-Generation Comparison (200 questions, 0-shot) |
| |
| | Generation | Model | CLIcK Overall | |
| |---|---|---| |
| | Gen 0 (Ancestor) | Qwen3.5-27B | 69.52% | |
| | Gen 1 (Father) | Darwin-27B-Opus | 70.19% | |
| | β (Mother) | Korean-specialized SFT | 74.74% | |
| | **Gen 2 (Child)** | **Darwin-27B-KR** | **75.59%** β
| |
| |
| **The child surpasses both parents** β winning 7 out of 11 evaluation categories. Largest gains: Law (+9.5pp), Functional Language (+7.6pp), History (+6.5pp). |
| |
| Two generations of zero-training evolution achieved **+6.07 percentage points** over the original Qwen3.5-27B foundation model. |
| |
| --- |
| |
| ## Computational Economics |
| |
| | | Darwin-27B-Opus | Conventional Fine-Tuning | |
| |---|---|---| |
| | GPU | H100 Γ 1 | H100 Γ 8β64 | |
| | Time | ~2 hours | Days to weeks | |
| | Training tokens | 0 | 10βΆβ10βΉ | |
| | Gradient computation | None | Full backpropagation | |
| | Output model size | Identical to parent | Identical to parent | |
| | Inference overhead | Zero | Zero | |
| |
| The resultant model is architecturally indistinguishable from its progenitor β identical parameter count, identical inference speed, identical deployment requirements. |
| |
| --- |
| |
| ## Model Specifications |
| |
| | | | |
| |---|---| |
| | Architecture | Qwen3.5 Dense (GatedDeltaNet) | |
| | Parameters | 27B | |
| | Hidden Size | 4096 | |
| | Intermediate Size | 17408 | |
| | Layers | 64 | |
| | Context Length | 262,144 (extensible to 1M via YaRN) | |
| | Precision | BF16 | |
| | Languages | 201 | |
| | Thinking Mode | Enabled | |
| | License | Apache 2.0 | |
| |
| --- |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| "FINAL-Bench/Darwin-27B-Opus", trust_remote_code=True |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| "FINAL-Bench/Darwin-27B-Opus", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}] |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True |
| ) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False) |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) |
| ``` |
| |
| --- |
|
|
| ## VRAM Requirements |
|
|
| | Setup | VRAM | Status | |
| |---|---|---| |
| | BF16 Full Precision | ~55 GB | H100 single GPU | |
| | NVIDIA H100 80GB | 80 GB | Very comfortable | |
| | 2Γ RTX 4090 48GB | 48 GB | Tensor parallel | |
| | 4-bit Quantized | ~16 GB | RTX 4090 single GPU | |
|
|
| --- |
|
|
| ## Darwin Model Family |
|
|
| | Model | Gen | Parameters | GPQA Diamond | CLIcK | Specialty | |
| |---|---|---|---|---|---| |
| | **Darwin-27B-Opus** | **Gen 1** | **27B** | **86.9%** β
| 70.19% | Claude reasoning | |
| | Darwin-27B-KR | Gen 2 | 27B | β | **75.59%** β
| Korean hybrid vigor | |
| | Darwin-4B-Genesis | Gen 3 | 4B | ~60% | 92% | Cross-architecture breeding | |
| | Darwin-31B-Opus | Gen 1 | 31B | 66% | β | Gemma4 reasoning | |
| | Darwin-35B-A3B-Opus | Gen 1 | 35B MoE | 90% | β | MoE reasoning | |
| | Darwin-9B-Opus | Gen 1 | 9B | β | β | Edge deployment | |
|
|
| --- |
|
|
| ## Key Findings |
|
|
| 1. **FFN = Knowledge, Attention = Reasoning** β Empirically validated through ablation: attention blending causes GPQA collapse (60% β 10%), while FFN crossbreeding consistently enhances performance. |
|
|
| 2. **Hybrid vigor scales with model size** β Confirmed at 4B (Genesis, CLIcK 92%) and 27B (KR, CLIcK 75.59%). |
|
|
| 3. **Zero-training evolution is recursive** β Gen 0 β Gen 1 β Gen 2, each generation improving without gradient updates. |
|
|
| 4. **CMA-ES discovers what humans cannot** β Manual 50:50 blending degrades performance; evolutionary search finds non-obvious optimal ratios. |
|
|
| 5. **Verification rounds recover contested answers** β 63.2% success rate on close-vote questions, contributing ~7 additional correct answers. |
|
|
| --- |
|
|
| ## Roadmap |
|
|
| - [ ] K-AI Leaderboard official submission (Korean government-certified evaluation) |
| - [ ] MMLU-Pro, AIME 2025 evaluation |
| - [ ] Cross-architecture breeding at 27B scale (Transformer Γ Mamba FFN) |
| - [ ] Third-generation recursive evolution |
| - [ ] Darwin engine research paper |
|
|
| --- |
|
|
| ## References |
|
|
| - DARE-TIES: Yadav et al., 2023 ([arXiv:2311.03099](https://arxiv.org/abs/2311.03099)) β re-implemented without library dependency |
| - FFN as Cross-Attention: [arXiv:2501.00823](https://arxiv.org/abs/2501.00823) |
| - CLIcK: Kim et al., 2024 ([arXiv:2403.06412](https://arxiv.org/abs/2403.06412)) |
| - GPQA: Rein et al., 2023 ([arXiv:2311.12022](https://arxiv.org/abs/2311.12022)) |
| - CMA-ES: Hansen & Ostermeier, 2001 |
| - Darwin V6 Engine: [HuggingFace Space](https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP) |
|
|
| --- |
|
|
| ## Built By |
|
|
| | | | |
| |---|---| |
| | Developer | [VIDRAFT](https://huggingface.co/FINAL-Bench) | |
| | Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) | |
| | Architecture | Qwen3.5-27B Dense | |
| | License | Apache 2.0 | |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{vidraft_darwin_27b_opus_2026, |
| title = {Darwin-27B-Opus: Surpassing the Foundation Model Without Training}, |
| subtitle = {86.9\% on GPQA Diamond via Evolutionary FFN Crossbreeding}, |
| author = {VIDRAFT}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}} |
| } |
| ``` |