APEX Quant Request + Real World Performance
Love this model.
@mudler please consider this model for an APEX quant.
It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.
Since they both share the same MOE base model- speeds if apex quantized should be very similar
visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
Hardware / Config
| Spec | Value |
|---|---|
| OS | Linux 7.0.1-1-cachyos-rt-bore-lto |
| GPU | RTX 4080 Max-Q 12GB @ 60w TDP |
| CPU | Intel Ultra 9 185H |
| RAM | 32GB LPDDR5x |
| Backend | ik_llama.cpp (main) |
| Context | 65k max |
| Darwin Quant | bartowski IQ4_XS imatrix |
| 4.7 Quant | APEX-I-Compact |
Performance Eval
(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)
| Model | Quant | Size (on disk) | Prefill (16k) | Generate |
|---|---|---|---|---|
| Darwin-36B-Opus | bartowski IQ4_XS | ~17.5 GB | 293 tps | 51.0 tps |
| Qwen-4.7 Fine-Tune | APEX-I-Compact | 16.1 GB | 313 tps | 46.6 tps |
| Qwen-4.7 Fine-Tune | APEX-I-Nano | 10.8 GB | 1047 tps | 67.2 tps |
Reasoning Samples (Physics Task)
Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.
The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).
Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.
The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.
Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:
1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1
Let me solve each one properly, then write the Python script.
Task 1: WKB for V(x) = αx⁴
The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h
where p(x) = √(2m(E - V(x)))
For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)
The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx
where a = (E/α)^(1/4)
Output Quality (Final Answers)
- Both models produced identical final physics derivation:
Eₙ ∝ (n+1/2)^(4/3) - Both produced identical async code with retry logic + bug fix
- Both produced identical Spanish→Japanese attention analogy
- Both acknowledged quantization-induced numerical instability
Key Observation
Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.
The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.
Model Links
- Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
- Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
- License: Apache 2.0
Love this model.
@mudler please consider this model for an APEX quant.
It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.
Since they both share the same MOE base model- speeds if apex quantized should be very similar
visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
Hardware / Config
Spec Value OS Linux 7.0.1-1-cachyos-rt-bore-lto GPU RTX 4080 Max-Q 12GB @ 60w TDP CPU Intel Ultra 9 185H RAM 32GB LPDDR5x Backend ik_llama.cpp (main) Context 65k max Darwin Quant bartowski IQ4_XS imatrix 4.7 Quant APEX-I-Compact Performance Eval
(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)
Model Quant Size (on disk) Prefill (16k) Generate Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps Reasoning Samples (Physics Task)
Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise. The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4). Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function. The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task: 1. Physics: WKB approximation for V(x) = αx⁴, derive E_n 2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug 3. Language: Attention analogy in Spanish, translate to Japanese 4. Meta: One genuine limitation and how it affects Task 1 Let me solve each one properly, then write the Python script. Task 1: WKB for V(x) = αx⁴ The WKB quantization condition: ∮ p(x) dx = (n + 1/2) h where p(x) = √(2m(E - V(x))) For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4) The integral: ∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx where a = (E/α)^(1/4)Output Quality (Final Answers)
- Both models produced identical final physics derivation:
Eₙ ∝ (n+1/2)^(4/3)- Both produced identical async code with retry logic + bug fix
- Both produced identical Spanish→Japanese attention analogy
- Both acknowledged quantization-induced numerical instability
Key Observation
Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.
The "thinking density" difference is the real win — Darwin's concise
<think>traces reduce cognitive load more than raw TPS gains from aggressive quantization.Model Links
- Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
- Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
- License: Apache 2.0
Thank you @el4 for the detailed benchmark and the kind words! 🙏
You're right about the denser reasoning — Darwin-36B-Opus inherits Claude
Opus reasoning patterns through our Darwin V7 evolutionary merge, which
tends to produce more compact thinking traces compared to standard
fine-tunes.
@mudler an APEX quant would be wonderful — happy to coordinate if needed.
In the meantime, our team is also working on:
- NVFP4 native quantization (Blackwell-optimized)
- FP8 build for vLLM serving
Stay tuned, and feel free to ping us with any feedback!