APEX Quant Request + Real World Performance

#3
by el4 - opened

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
image


Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec Value
OS Linux 7.0.1-1-cachyos-rt-bore-lto
GPU RTX 4080 Max-Q 12GB @ 60w TDP
CPU Intel Ultra 9 185H
RAM 32GB LPDDR5x
Backend ik_llama.cpp (main)
Context 65k max
Darwin Quant bartowski IQ4_XS imatrix
4.7 Quant APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model Quant Size (on disk) Prefill (16k) Generate
Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps
Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps
Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):

Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.

Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):

The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)

Output Quality (Final Answers)

  • Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)
  • Both produced identical async code with retry logic + bug fix
  • Both produced identical Spanish→Japanese attention analogy
  • Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

el4 changed discussion title from APEX Quant Request to APEX Quant Request + Real World Performance
FINAL_Bench org

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
image


Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec Value
OS Linux 7.0.1-1-cachyos-rt-bore-lto
GPU RTX 4080 Max-Q 12GB @ 60w TDP
CPU Intel Ultra 9 185H
RAM 32GB LPDDR5x
Backend ik_llama.cpp (main)
Context 65k max
Darwin Quant bartowski IQ4_XS imatrix
4.7 Quant APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model Quant Size (on disk) Prefill (16k) Generate
Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps
Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps
Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):

Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.

Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):

The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)

Output Quality (Final Answers)

  • Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)
  • Both produced identical async code with retry logic + bug fix
  • Both produced identical Spanish→Japanese attention analogy
  • Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

Thank you @el4 for the detailed benchmark and the kind words! 🙏

You're right about the denser reasoning — Darwin-36B-Opus inherits Claude
Opus reasoning patterns through our Darwin V7 evolutionary merge, which
tends to produce more compact thinking traces compared to standard
fine-tunes.

@mudler an APEX quant would be wonderful — happy to coordinate if needed.

In the meantime, our team is also working on:

  • NVFP4 native quantization (Blackwell-optimized)
  • FP8 build for vLLM serving

Stay tuned, and feel free to ping us with any feedback!

Sign up or log in to comment