Qwen3-Coder-30B-A3B APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of Qwen3-Coder-30B-A3B-Instruct.

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

All measurements on NVIDIA DGX Spark (GB10, 128 GB VRAM). Perplexity on wikitext-2-raw, context 2048. Accuracy benchmarks via llama.cpp (400 tasks).

Configuration Size (GB) Perplexity KL mean HellaSwag Winogrande MMLU ARC TruthfulQA tg128 (t/s)
Q8_0 30.3 9.537 0.0031 75.8% 68.0% 39.6% 45.8% 30.0% 57.1
APEX I-Balanced 20.8 9.516 0.0074 76.5% 68.3% 40.2% 46.2% 30.4% 68.5
APEX I-Quality 18.1 9.535 0.0108 75.3% 68.5% 39.8% 44.8% 30.5% 74.1
APEX Quality 18.1 9.560 0.0117 75.5% 68.0% 40.1% 44.5% 31.8% 73.7
APEX Balanced 20.5 9.563 0.0083 75.5% 68.5% 39.6% 45.2% 30.5% 68.1
Unsloth Q5_K_S 19.6 9.513 0.0119 75.3% 68.5% 39.8% 45.2% 30.2% 72.2
Unsloth UD-Q4_K_XL 16.5 9.676 0.0246 76.3% 67.0% 39.7% 47.5% 30.5% 82.3
APEX I-Compact 13.8 9.667 0.0418 76.3% 68.8% 39.0% 44.1% 29.0% 84.5
APEX Compact 13.8 9.765 0.0492 75.0% 67.0% 39.1% 45.8% 30.4% 83.8
APEX Mini 11.3 9.838 0.0862 73.5% 68.8% 39.0% 44.1% 31.0% 91.4

Highlights

  • APEX I-Balanced beats Q8_0 in PPL (9.516 vs 9.537), HellaSwag (76.5% vs 75.8%), MMLU (40.2% vs 39.6%), and ARC (46.2% vs 45.8%) while being 31% smaller and 20% faster.
  • APEX I-Compact matches UD-Q4_K_XL quality at 16% less size (13.8 vs 16.5 GB) with higher Winogrande (68.8% vs 67.0%).
  • APEX Mini (11.3 GB) delivers 91.4 t/s -- fastest of any configuration -- while maintaining viable quality for coding tasks.

Available Files

File Profile Size Best For
Qwen3-Coder-30B-APEX-I-Balanced.gguf I-Balanced 20.8 GB Best overall -- beats Q8_0 quality
Qwen3-Coder-30B-APEX-I-Quality.gguf I-Quality 18.1 GB Best accuracy with imatrix
Qwen3-Coder-30B-APEX-Quality.gguf Quality 18.1 GB Lowest perplexity at this size
Qwen3-Coder-30B-APEX-Balanced.gguf Balanced 20.5 GB General purpose, low KL
Qwen3-Coder-30B-APEX-I-Compact.gguf I-Compact 13.8 GB Consumer GPUs, best quality at size
Qwen3-Coder-30B-APEX-Compact.gguf Compact 13.8 GB Consumer 24 GB GPUs
Qwen3-Coder-30B-APEX-Mini.gguf Mini 11.3 GB 16 GB VRAM, fastest inference

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling -- no Wikipedia).

See the APEX project for full details, technical report, and scripts.

Run with LocalAI

local-ai run mudler/Qwen3-Coder-30B-APEX-GGUF@Qwen3-Coder-30B-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Downloads last month
2,618
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Qwen3-Coder-30B-APEX-GGUF

Quantized
(132)
this model

Collection including mudler/Qwen3-Coder-30B-APEX-GGUF