README.md · FINAL-Bench/Darwin-9B-Opus at main

Darwin-9B-Opus / README.md

SeaWolf-AI

Update README.md

d74f2b2 verified 2 days ago

preview code

raw

history blame contribute delete

9.78 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3.5-9B
	- Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
	tags:
	- merge
	- evolutionary-merge
	- darwin
	- darwin-v5
	- model-mri
	- reasoning
	- advanced-reasoning
	- chain-of-thought
	- thinking
	- qwen3.5
	- qwen
	- claude-opus
	- distillation
	- multilingual
	- benchmark
	- open-source
	- apache-2.0
	- layer-wise-merge
	- coding-agent
	- tool-calling
	- long-context
	language:
	- en
	- zh
	- ko
	- ja
	- de
	- fr
	- es
	- ru
	- ar
	- multilingual
	pipeline_tag: text-generation
	library_name: transformers
	model-index:
	- name: Darwin-9B-Opus
	results:
	- task:
	type: text-generation
	name: Graduate-Level Reasoning
	dataset:
	type: Idavidrein/gpqa
	name: GPQA Diamond
	config: gpqa_diamond
	split: train
	metrics:
	- type: accuracy
	value: 90.0
	name: Accuracy
	verified: false
	---

	# Darwin-9B-Opus

	<p align="center">
	<a href="https://huggingface.co/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/Model-Darwin--9B--Opus-blue?style=for-the-badge" alt="Model"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-9B-Opus"><img src="https://img.shields.io/badge/Space-9B_Live_Demo-purple?style=for-the-badge" alt="Space"></a>
	<a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/Model-Darwin--35B--A3B--Opus-blue?style=for-the-badge" alt="35B Model"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Darwin-35B-A3B-Opus"><img src="https://img.shields.io/badge/Space-35B_Live_Demo-purple?style=for-the-badge" alt="35B Space"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard"><img src="https://img.shields.io/badge/FINAL_Bench-Leaderboard-green?style=for-the-badge" alt="FINAL Bench"></a>
	<a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard"><img src="https://img.shields.io/badge/ALL_Bench-Leaderboard-orange?style=for-the-badge" alt="ALL Bench"></a>
	</p>

	<p align="center">
	<img src="info.png" alt="Darwin-9B-Opus" width="100%">
	</p>

	> Qwen3.5 Dense 9B \| Reasoning \| Chain-of-Thought \| 131K Context \| 201 Languages \| BF16 \| Apache 2.0

	---

	## Technical Definitions

	\| Term \| Definition \| Measurement \|
	\|---\|---\|---\|
	\| Model MRI \| Layer-level profiling of tensor health indicators \| L2 norm, Shannon entropy, std per tensor across all layers \|
	\| LayerMRI.compare_layers \| Per-tensor A vs B quality comparison yielding optimal ratio_b \| score = entropy * 0.5 + std * 0.3 + clamp(norm, 100) * 0.002 per model; ratio_b = score_b / (score_a + score_b) \|
	\| MRI-Guided Merge \| Per-tensor merge ratios derived from parent diagnostics (70% MRI + 30% genome) \| final_ratio = mri_ratio * 0.7 + genome_ratio * 0.3 \|
	\| DARE-TIES \| Merge algorithm: random binary mask on delta, then weighted addition \| merged = A + (B - A) * random_mask(density) * ratio \|
	\| Transplant A / B \| When MRI ratio falls below 0.05 or above 0.95, one parent is used entirely \| No interpolation — direct tensor copy \|
	\| Evolutionary Search \| CMA-ES population evolution over genome space (ratio, attn, ffn, embed, density_a, density_b) \| Phase 1: 200 steps heuristic proxy, Phase 2: 10 steps real benchmark \|

	---

	## Overview

	Darwin-9B-Opus is a 9B dense parameter reasoning model created using Darwin V5. Both parent models share the identical Qwen3.5-9B architecture — the Mother is a LoRA SFT on the same base, not a different architecture.

	\| Role \| Model \| Training \|
	\|---\|---\|---\|
	\| Father \| [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) \| Original pre-training + RLHF \|
	\| Mother \| [Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled) \| LoRA SFT with text-only Claude 4.6 Opus reasoning chains \|

	---

	## How Darwin V5 Works

	Darwin V5 does not use mergekit or any external merge library. It implements DARE-TIES merge directly via PyTorch tensor operations, with MRI-guided per-layer ratios. The algorithm is inspired by the DARE-TIES method but re-implemented from scratch to support per-tensor diagnostic-guided ratios.

	### Merge Implementation (actual code logic)

	```python
	# For each tensor pair (A, B) across all safetensor shards:
	ta = model_a[key] # Father tensor
	tb = model_b[key] # Mother tensor

	# 1. MRI diagnoses both tensors
	diag_a = LayerMRI.diagnose_tensor(ta) # {norm, entropy, std}
	diag_b = LayerMRI.diagnose_tensor(tb) # {norm, entropy, std}

	# 2. Quality score comparison determines ratio_b
	score_a = diag_a["entropy"] * 0.5 + diag_a["std"] * 0.3 + min(diag_a["norm"], 100) * 0.002
	score_b = diag_b["entropy"] * 0.5 + diag_b["std"] * 0.3 + min(diag_b["norm"], 100) * 0.002
	mri_ratio = score_b / (score_a + score_b) # Higher = Mother is better

	# 3. Final ratio = MRI 70% + evolutionary genome 30%
	final_ratio = mri_ratio * 0.7 + genome_type_ratio * 0.3

	# 4. DARE-TIES merge with per-tensor ratio
	mask = torch.rand_like(tb) < density_b
	delta = (tb - ta) * mask
	merged = (ta + delta * final_ratio).bfloat16()
	```

	### Pipeline

	```
	Phase 0: Model MRI
	For every tensor in both parents, measure:
	- L2 norm (layer energy)
	- Shannon entropy (weight distribution uniformity)
	- Standard deviation (activation spread)
	Compare A vs B quality scores -> per-tensor ratio prescription

	Phase 1: Evolutionary Search (200 steps, heuristic proxy)
	Population of 20 genomes (ratio, attn, ffn, embed, density_a, density_b)
	Fitness: heuristic score based on genome balance + differentiation
	Selection -> SLERP crossover -> Gaussian mutation

	Phase 2: Real Merge + Benchmark (10 steps)
	Top genomes from Phase 1 undergo actual tensor merge
	Each merge: MRI prescription (70%) + genome ratio (30%)
	Fitness: real benchmark score (ARC-Challenge)
	Best model selected and auto-uploaded

	Phase 3: Health Check
	Layer-by-layer importance comparison: child vs both parents
	Detect interference (child >> parents) or function loss (parents >> child)
	```

	### What Makes This Different from Standard Merging

	\| Capability \| Standard DARE-TIES \| Darwin V5 \|
	\|---\|---\|---\|
	\| Implementation \| mergekit library call \| Direct PyTorch tensor operations \|
	\| Ratio selection \| Uniform ratio across all tensors \| Per-tensor ratio from MRI diagnosis \|
	\| Pre-merge analysis \| None \| Tensor-level norm/entropy/std profiling \|
	\| Ratio determination \| Human-set or grid search \| MRI 70% + evolutionary genome 30% \|
	\| Post-merge validation \| Benchmark score only \| Layer-by-layer child vs parents comparison \|
	\| Transplant support \| No \| ratio < 0.05 -> use A entirely, ratio > 0.95 -> use B entirely \|
	\| Failure diagnosis \| "Score went down" \| Per-tensor quality delta identifies problematic layers \|

	---

	## Model Specifications

	\| \| \|
	\|---\|---\|
	\| Architecture \| Qwen3.5 Dense (Gated DeltaNet hybrid) \|
	\| Total Parameters \| 9B \|
	\| Precision \| BF16 \|
	\| Context Length \| 131,072 native \|
	\| Languages \| 201 \|
	\| Thinking \| `<think>` tag chain-of-thought reasoning \|
	\| License \| Apache 2.0 \|

	---

	## Hardware Requirements

	\| Setup \| VRAM \| Status \|
	\|---\|---\|---\|
	\| BF16 Full Precision \| ~20 GB \| \|
	\| NVIDIA RTX 4090 24GB \| 24 GB \| Comfortable \|
	\| NVIDIA A100 40GB \| 40 GB \| Very comfortable \|
	\| NVIDIA T4 16GB \| 16 GB \| Requires quantization \|

	---

	## Usage

	### Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained(
	"FINAL-Bench/Darwin-9B-Opus",
	trust_remote_code=True,
	)
	model = AutoModelForCausalLM.from_pretrained(
	"FINAL-Bench/Darwin-9B-Opus",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=4096)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
	```

	### SGLang

	```bash
	python -m sglang.launch_server \
	--model-path FINAL-Bench/Darwin-9B-Opus \
	--tp 1 \
	--mem-fraction-static 0.90 \
	--context-length 32768 \
	--trust-remote-code
	```

	### vLLM

	```bash
	vllm serve FINAL-Bench/Darwin-9B-Opus \
	--trust-remote-code \
	--enforce-eager
	```

	---

	## Evolution Details

	\| \| \|
	\|---\|---\|
	\| Engine \| Darwin V5 (Evolutionary Merge + Layer-Level Diagnostics) \|
	\| Merge Method \| DARE-TIES (direct PyTorch implementation, no external library) \|
	\| MRI Integration \| Per-tensor diagnosis: norm, entropy, std -> ratio prescription \|
	\| Ratio Formula \| final_ratio = mri_ratio * 0.7 + genome_ratio * 0.3 \|
	\| Evolution \| Phase 1: 200 steps proxy + Phase 2: 10 steps real benchmark \|
	\| Best Score \| 0.8508 (ARC-Challenge) \|
	\| Infrastructure \| 4 x NVIDIA H100 NVL (100GB each) \|

	---

	## Acknowledgements

	- Korean Government — GPU Support Program research grant
	- [Qwen Team](https://huggingface.co/Qwen) — Qwen3.5 base architecture
	- [Jackrong](https://huggingface.co/Jackrong) — Claude 4.6 Opus Reasoning Distilled model
	- DARE-TIES algorithm — [Yadav et al., 2023](https://arxiv.org/abs/2311.03099) (re-implemented, not library-dependent)

	---

	## Built By

	\| \| \|
	\|---\|---\|
	\| Developer \| VIDRAFT \|
	\| Engine \| Darwin V5 \|
	\| Base Architecture \| Qwen3.5-9B \|

	---

	## Citation

	```bibtex
	@misc{vidraft_darwin_9b_opus,
	title = {Darwin-9B-Opus: Diagnostic-Guided Evolutionary Merge},
	author = {VIDRAFT},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-9B-Opus}}
	}
	```