Instructions to use FINAL-Bench/Darwin-36B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FINAL-Bench/Darwin-36B-Opus with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-36B-Opus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus")
model = AutoModelForCausalLM.from_pretrained("FINAL-Bench/Darwin-36B-Opus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FINAL-Bench/Darwin-36B-Opus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FINAL-Bench/Darwin-36B-Opus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FINAL-Bench/Darwin-36B-Opus

SGLang

How to use FINAL-Bench/Darwin-36B-Opus with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FINAL-Bench/Darwin-36B-Opus" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FINAL-Bench/Darwin-36B-Opus" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use FINAL-Bench/Darwin-36B-Opus with Docker Model Runner:
```
docker model run hf.co/FINAL-Bench/Darwin-36B-Opus
```

APEX Quant Request + Real World Performance

by el4 - opened 11 days ago

Discussion

el4

11 days ago

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):

Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec	Value
OS	Linux 7.0.1-1-cachyos-rt-bore-lto
GPU	RTX 4080 Max-Q 12GB @ 60w TDP
CPU	Intel Ultra 9 185H
RAM	32GB LPDDR5x
Backend	ik_llama.cpp (main)
Context	65k max
Darwin Quant	bartowski IQ4_XS imatrix
4.7 Quant	APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model	Quant	Size (on disk)	Prefill (16k)	Generate
Darwin-36B-Opus	bartowski IQ4_XS	~17.5 GB	293 tps	51.0 tps
Qwen-4.7 Fine-Tune	APEX-I-Compact	16.1 GB	313 tps	46.6 tps
Qwen-4.7 Fine-Tune	APEX-I-Nano	10.8 GB	1047 tps	67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):

Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.

Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):

The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)

Output Quality (Final Answers)

Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)
Both produced identical async code with retry logic + bug fix
Both produced identical Spanish→Japanese attention analogy
Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
License: Apache 2.0

el4 changed discussion title from APEX Quant Request to APEX Quant Request + Real World Performance 11 days ago

SeaWolf-AI

FINAL_Bench org 11 days ago

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):

Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec Value

OS Linux 7.0.1-1-cachyos-rt-bore-lto

GPU RTX 4080 Max-Q 12GB @ 60w TDP

CPU Intel Ultra 9 185H

RAM 32GB LPDDR5x

Backend ik_llama.cpp (main)

Context 65k max

Darwin Quant bartowski IQ4_XS imatrix

4.7 Quant APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model Quant Size (on disk) Prefill (16k) Generate

Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps

Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps

Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.
Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)
Output Quality (Final Answers)

Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)

Both produced identical async code with retry logic + bug fix

Both produced identical Spanish→Japanese attention analogy

Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus

Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF

License: Apache 2.0

Thank you @el4 for the detailed benchmark and the kind words! 🙏

You're right about the denser reasoning — Darwin-36B-Opus inherits Claude
Opus reasoning patterns through our Darwin V7 evolutionary merge, which
tends to produce more compact thinking traces compared to standard
fine-tunes.

@mudler an APEX quant would be wonderful — happy to coordinate if needed.

In the meantime, our team is also working on:

NVFP4 native quantization (Blackwell-optimized)
FP8 build for vLLM serving

Stay tuned, and feel free to ping us with any feedback!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment