@SeaWolf-AI on Hugging Face: "🧬 Darwin-35B-A3B-Opus — The Child That Surpassed Both Parents What if a…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Mar 31

Post

2195

🧬 Darwin-35B-A3B-Opus — The Child That Surpassed Both Parents

What if a merged model could beat both its parents? We proved it can.
Darwin-35B-A3B-Opus is a 35B MoE model (3B active) built with our Darwin V5 engine — the first evolution system that CT-scans parent models before merging them.
🤗 Model: FINAL-Bench/Darwin-35B-A3B-Opus

The result speaks for itself: GPQA Diamond 90.0%, versus Father (Qwen3.5-35B-A3B) at 84.2% and Mother (Claude 4.6 Opus Distilled) at 85.0%. That's +6.9% over Father and +5.9% over Mother. Not a tradeoff — a genuine leap. Meanwhile, MMMLU sits at 85.0% (Father: 85.2%), multimodal is fully intact, and all 201 languages are preserved.

How? Model MRI changed everything. Traditional merging is guesswork. Darwin V4 added evolution. Darwin V5 added X-ray vision. Model MRI scans each parent layer by layer and discovers: Mother's L34–L38 is the reasoning engine (peak cosine distance), 50–65% of Mother's experts are dead (killed by text-only distillation), and Father is a healthy generalist with every expert alive. The prescription: transplant Mother's reasoning brain at L38 (90% weight), replace her dead experts with Father's living ones, and let Father's router handle the output layer. Reasoning went up. Versatility stayed intact. No tradeoff — just evolution.

35B total, 3B active (MoE) · GPQA Diamond 90.0% · MMMLU 85.0% (201 languages) · Multimodal Image & Video · 262K native context · 147.8 tok/s on H100 · Runs on a single RTX 4090 (Q4) · Apache 2.0
Darwin V5's full algorithm and technical details will be released alongside an upcoming paper.

🚀 Live Demo: FINAL-Bench/Darwin-35B-A3B-Opus

🏆 FINAL Bench Leaderboard: FINAL-Bench/Leaderboard

📊 ALL Bench Leaderboard: FINAL-Bench/all-bench-leaderboard

Built by VIDRAFT · Supported by the Korean Government GPU Support Program

sthenno

Apr 2

Hello @SeaWolf-AI ,

This is a truly impressive piece of work. If you’re open to it, I would appreciate the opportunity to connect and explore potential collaboration.

You can find my email address in my profile—feel free to reach out. I look forward to hearing from you.

inflatebot

Apr 2

do you explain what you mean by these medical terms that you're using in an AI context anywhere, or what?

inflatebot

Apr 2

ok this is a scam but i'm on the phone so i'll go over why in a minute

inflatebot

Apr 2

ok i'm back so like not a single word of this is meaningful in any way and is riddled with factual errors in the claims it does make

first off, evolutionary merging as a concept isn't new, mergekit can already do this

There is no way you merged two models of different architectures and got a positive result. If the "mother" were "text only", it would definitionally have to also be multimodal. Otherwise it's not Qwen3.5. And there is no architecture that is even remotely compatible with Qwen3.5's, the DeltaNet attention heads see to this fact. If what you're saying is true, then you're splicing a cat's brain into a dog's (or, to be somewhat reasonable, a cat's brain into an ocelot's.) This is all I needed but there's more actually.

201 Languages ❌ Potentially degraded ✅ Inherited from Father

You presume. Any amount of finetuning is going to result in specialization in the target area. 201 languages are going to necessarily be represented across the model in a way that layer-wise merging can't preserve without other techniques that are already implemented in a mathematical way (tl;dr Task Arithmetic Works.)

Benchmark Transparency ❌ No scores published ✅ Fully open

Your "mother" is a community finetune. This is vexatious language that degrades a hobbyist project as being "opaque." Not that a human wrote this model card and I don't even need to ask Pangram about that.

Model MRI Integration — CT-scans parent models layer by layer before merging, guiding evolution with structural insight
If conventional merging is "mixing recipes blindfolded," Darwin V5 is "precision surgery with X-ray guidance."

Extremely waffly use of medical terminology with no technical definition whatsoever in this context, neither extant nor provided.

Traditional model merging relies on humans setting hyperparameters like ratio and density by intuition. Set ratio=0.5, density=0.9, run once, and hope for the best. The result depends on luck, and applying the same ratio uniformly across billions of parameters ignores each layer's unique role.

Laughably wrong. The entire point of merging is that iteration is fast. "By intuition" is not a meaningful critique because intuition is the only way that humans do this. "and applying the same ratio uniformly across billions of parameters ignores each layer's unique role" presumes that A) that layers have a "unique role" (if it were that simple, mechanistic interpretability would be solved), and B) that we have to use the same ratios, essentially "model merging hasn't evolved since 2023" which I can literally prove by Just Look At It.

Darwin V4's Advance
Darwin V4 solved this with evolutionary algorithms — automatically searching hundreds of parameter combinations and selecting survivors by real benchmark scores.

You didn't invent that. See above.

You need more than this, my guy. You can't just expect us to take your word for this. Give some actual theory or get lost.

Discovering attn=0.168 and ffn=0.841 — this extreme asymmetry — is virtually impossible by human intuition.

Perhaps not those precise numbers, but people literally already do layerwise merge ratios, and this is literally already what my friends in Allura have found in their experiments with finetuning; changing attention vs. feed-forward layers provides drastically different results. We're a bunch of gooner dorks in our bedrooms, you've rediscovered this as a government-funded AI lab. What's going on here exactly?

No rigorous definition of "dead" is provided through this entire model card. From what I can tell it means "inactive to a higher degree"

MRI didn't apply uniform ratios. It split 40 layers into 3 blocks:

Thanks GPT-4o.

But again, these terms are meaningless. We don't know what "MRI" means, we have no way to verify that your process actually results in the numbers you're providing.

Dead Expert 50~65% is the fingerprint of Claude text-only distillation. The fine-tuning killed multimodal and multilingual experts that were no longer activated during text-only training.

Didn't you say at the top that the Claude distill is a text-only model?? Why would you expect layers with connections to the multimodal tower to be activated?????? Are we for real??????????

Father MRI: Healthy Generalist (Organ Donor)

Yet another metaphor with no technical definition extant or provided.

The Father (Qwen3.5-35B-A3B) shows healthy, uniform expert activation across all 40 layers — a well-balanced generalist with all experts alive. This is the "organ donor" that revives the Mother's dead 50–65% experts.

Of course. It's the base model. You would expect that

Why This Matters

Thanks GPT-4o.

I can't critique this section but I don't think I have to because the reason I can't critique it is because it's unfalsifiable on account of the blatant and egregious lack of any kind of technical direction in this model card. There is nothing to critique. This is Ancient Aliens tier. This is a wall made of saltine crackers.

So what do we have here?

A layer-wise merge of a Claude 4.6 Opus distillation onto the Qwen 3.5 base, improving degradation caused by what might have been an underdeveloped finetune methodology, that results in better performance, because model merging is a validated technique that works well. The layerwise ratios were discovered with an evolutionary process, a thing that already exists, but isn't often done because it's more expensive.

That alone is interesting enough to promote. It's good PR for evolutionary merging, which I think more people should be focusing on.

But what's stapled on top is a cheap facade of irrelevant jargon from medicine that communicates nothing of value to anything that might have changed about the process, along with false claims about merging that demonstrate that nobody involved with this project respects it as a method, with a model card shat out by a free-tier LLM that understands what it's saying perhaps less than the humans who could have conceivably produced the graphs.

I am insulted having spent my time reading this. There is so much more I could go into but I just keep repeating myself over and over and I only have so many hours in a day.

Come back with a paper with some actual math on it, and I'll change my tone. Until then, stay off of our HF feed, please. This crap makes us all look bad.

Get that government bag tho I guess.

SeaWolf-AI

Apr 2

•

edited Apr 2

Thanks for the detailed critique. Genuinely appreciate someone reading the model card this closely.

Some points are fair, some are misunderstandings. Let me address them.

Evolutionary merging isn't new:
Correct. mergekit already has an evolve feature, and we never claimed to have invented evolutionary merging. What Darwin adds is layer-level profiling of parent models before merging to reduce the search space. Whether this warrants a separate name is debatable, and we should have been clearer about what's new vs. existing methodology.

Architecture mismatch:
This is a misunderstanding. Both parents are Qwen3.5-35B-A3B — identical architecture. The Mother (Jackrong's Claude distill) is a LoRA SFT on the same base, not a different architecture. 'Text-only' refers to the training data (text reasoning chains only), not the model structure. The model card was unclear on this.

No technical definitions for medical terms:
Fair point. Here are the actual definitions:

Model MRI = Layer-level profiling. We run a 1K-sample calibration dataset across 40 layers x 256 experts and measure expert activation frequency.
Dead Expert = An expert with activation frequency below 5% on the calibration set. The Mother model showed 50-65% dead experts in middle layers, consistent with LoRA SFT updating only a subset of parameters.
Routing Entropy = Shannon entropy of the router's softmax distribution. Healthy range for top-8-of-256 is 3.0-4.5 bits. Low entropy means the router is collapsing to a few experts.
MRI-guided merging = Layers with high dead-expert ratios get higher Father (base model) weight. Healthy layers keep more Mother weight. This is the per-block ratio constraint.

We should have led with these definitions instead of metaphors.

'Impossible by human intuition' is overstated:
Agreed. Communities like Allura already do layerwise ratio tuning. What we meant was that automated search across 40 layers x multiple parameters is faster than manual tuning — not that humans can't get there.

Model card quality:
Fair. We'll rewrite it with formal definitions, reproducible methodology, and ablation results.

On using mergekit:
Darwin uses mergekit as its merge backend. Intentionally. mergekit is excellent infrastructure, and Darwin adds the evolutionary loop and diagnostic layer on top. Using good tools is not a contradiction.

We're currently running MMLU-Pro benchmarks and will publish full results when complete. Based on this feedback, we'll overhaul the model card to be technically grounded.

Appreciate the sharp critique. It helps make the next version better.

inflatebot

Apr 2

Good stuff. That actually sounds kind of interesting.

Just the least you could do is write the model card yourselves such that you catch these things. I only continued reading because I was like "wait I think I might see the vision" but nothing came of it. That's why I got so frustrated. The least we can do for each other is read the things we post. I'd be pretty disappointed if the model card sold my work short so hard.

I acknowledge I went a little far, it's just tiring to be in this space sometimes.

Looking forward to the detailed report.

In this post