Instructions to use FINAL-Bench/Darwin-36B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FINAL-Bench/Darwin-36B-Opus with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-36B-Opus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus")
model = AutoModelForCausalLM.from_pretrained("FINAL-Bench/Darwin-36B-Opus", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use FINAL-Bench/Darwin-36B-Opus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FINAL-Bench/Darwin-36B-Opus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FINAL-Bench/Darwin-36B-Opus

SGLang

How to use FINAL-Bench/Darwin-36B-Opus with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FINAL-Bench/Darwin-36B-Opus" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FINAL-Bench/Darwin-36B-Opus" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use FINAL-Bench/Darwin-36B-Opus with Docker Model Runner:
```
docker model run hf.co/FINAL-Bench/Darwin-36B-Opus
```

fyi

by jukefr - opened May 21

Discussion

jukefr

May 21

•

edited May 21

sample on this subset of term-bench2.0 tasks was already enough to me feel free to bench more if you want, tested with pi-agent

  ┌───────────────────────────┬─────────────┬────────────┬───────┬───────┬───────┬───────┐
  │           Task            │   Qwen3.6   │   Darwin   │ Q dur │ D dur │ Q out │ D out │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ fix-git                   │         3/3 │        2/3 │   41s │   31s │  2.3K │  1.6K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ prove-plus-comm           │         2/3 │        2/3 │  377s │   36s │   11K │  1.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ cobol-modernization       │         1/3 │        2/3 │  439s │  215s │   26K │   13K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ overfull-hbox             │         0/3 │        0/3 │  484s │  103s │   29K │  5.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ break-filter-js-from-html │         0/3 │        0/3 │  297s │  275s │   18K │   16K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ filter-js-from-html       │         0/3 │        0/3 │   80s │  671s │  4.6K │   33K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ kv-store-grpc             │         2/3 │        0/3 │   34s │   42s │  1.3K │  1.8K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ multi-source-data-merger  │         3/3 │        1/3 │   64s │   98s │  3.5K │  5.8K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ regex-log                 │         1/3 │        0/3 │  461s │  580s │   28K │   34K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ git-leak-recovery         │         2/3 │        1/3 │   35s │   39s │  1.8K │  1.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ pypi-server               │         0/3 │        0/3 │   23s │   46s │  0.9K │  2.5K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ TOTAL                     │ 14/33 (42%) │ 8/33 (24%) │       │       │       │       │
  └───────────────────────────┴─────────────┴────────────┴───────┴───────┴───────┴───────┘

SeaWolf-AI

FINAL_Bench org May 21

Thanks for running the benchmark and sharing the numbers.Quick note on positioning: Darwin-36B-Opus is published as a reasoning-focused evolutionary merge (GPQA Diamond 88.4%, tying Qwen3.5-397B-A17B), not as an agentic coder. The Darwin Opus line is bred for graduate-level scientific reasoning — physics, chemistry, biology Q&A in the GPQA style — and is not tuned for terminal/agent workflows. For agent and coding tasks we'd recommend the Qwen Coder line.Two observations on your runs that may explain part of the gap:
System prompt: Darwin needs enable_thinking=true via the Qwen chat template, and the agent harness needs to leave room for the ... block before tool calls. If pi-agent strips or truncates the thinking trace, Darwin loses most of its reasoning lift. You can confirm in the output — if you don't see a block, the harness is filtering it.

Output token compactness is by design: Darwin Opus inherits a Father with 75% Gated-DeltaNet + 25% Gated-Attention. Post-thinking responses are deliberately compressed (FFN α asymmetry from the merge genome), which is the opposite of what agent benchmarks reward — they reward verbose step-by-step tool chains. That's a known trade-off for this checkpoint, not a regression.
We'd be very interested to see your numbers on the same subset with (a) enable_thinking=true set in the request, and (b) the agent template that preserves the thinking trace. Happy to help if there's a specific task where you'd like to dig in.For full context: the Darwin Family methodology is currently under peer review at ARR May 2026 (training-free reasoning scaling) — coding/agent performance is explicitly out of scope of that submission.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment