--- license: apache-2.0 base_model: FINAL-Bench/Darwin-28B-Coder base_model_relation: quantized library_name: gguf pipeline_tag: text-generation tags: - gguf - llama.cpp - mtp - multi-token-prediction - speculative-decoding - code language: - en - ko --- # Darwin-28B-Coder — GGUF (MTP-enabled) GGUF builds of [**FINAL-Bench/Darwin-28B-Coder**](https://huggingface.co/FINAL-Bench/Darwin-28B-Coder) with the native **Multi-Token Prediction (MTP)** head preserved, for **self-speculative decoding in llama.cpp**. Requested in the [base model discussion](https://huggingface.co/FINAL-Bench/Darwin-28B-Coder/discussions/1). ## Files | File | Quant | Size | Notes | |------|-------|------|-------| | `Darwin-28B-Coder-Q4_K_M.gguf` | Q4_K_M | 16.8 GB | recommended for most GPUs | | `Darwin-28B-Coder-Q8_0.gguf` | Q8_0 | 29.0 GB | near-lossless | | `Darwin-28B-Coder-F16.gguf` | F16 | 54.7 GB | full precision | All files include the MTP layer — verified in metadata: `general.architecture = qwen35`, `qwen35.nextn_predict_layers = 1`, tensors `blk.64.nextn.*`. ## Multi-Token Prediction (MTP) This model ships with a trained MTP head (1 prediction layer). With a recent **llama.cpp** build that includes MTP support (merged in [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673)), the `nextn` layer is used for **self-speculative decoding** — typically **~1.5–2× faster generation with identical output** (the main model verifies every drafted token, so quality is unchanged). > A standard (non-MTP) GGUF does **not** contain the prediction head — you need these MTP-enabled files to benefit from the speedup. ## Usage ```bash # 1) Build a recent llama.cpp (MTP support is in mainline since PR #22673) git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release # 2) Run — the nextn (MTP) layer enables self-speculative decoding ./build/bin/llama-cli \ -m Darwin-28B-Coder-Q4_K_M.gguf \ -ngl 99 -c 8192 \ -p "Write a quicksort in Python." ``` For the exact MTP/speculative flags and the latest behaviour, see the llama.cpp MTP documentation / PR #22673. Works with `llama-cli` and `llama-server`. ## Model spec (public) | | | |---|---| | Architecture | `qwen35` (hybrid attention) | | Layers | 64 + 1 MTP | | Hidden size | 5120 | | Attention heads | 24 (KV 4) | | Context length | 262,144 | | Vocab | 248,320 | | Precision (source) | bfloat16 | ## License & attribution License and usage follow the base model [FINAL-Bench/Darwin-28B-Coder](https://huggingface.co/FINAL-Bench/Darwin-28B-Coder). These are GGUF conversions only; refer to the base model card for model details, intended use, and limitations. GGUF conversion + quantization by the FINAL-Bench team using `llama.cpp/convert_hf_to_gguf.py`.