How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Ohjaaja/Qwen3-Coder-Next-Q3mix",
	filename="Qwen3-Coder-Next-Q3mix.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Base model: Qwen/Qwen3-Coder-Next

Perplexity (WikiText-2 test set, ctx=512, 584 chunks)

Model PPL ±
Qwen3-Coder-Next-Q3mix (this) 8.4837 0.06573
Ubergarm smol IQ3_KS (reference) 8.4649 0.06633

Model is compatible with mainline llama.cpp. However test was ran with llama.cpp TurboQuant build](https://github.com/TheTom/llama-cpp-turboquant) Take note benchmark was not ran using mainline llama.cpp and ubergarm model was benchmarked using ik_llama.cpp so test result is not exactly comparable.

This quant is inspired by ubergarm / Qwen3-Coder-Next-GGUF smol-IQ3_KS 30.728 GiB (3.313 BPW)

Qwen3-Coder-Next-Q3mix was quantized using:

#!/usr/bin/env bash cat > /tmp/qwen3_tensors.txt << 'EOF' attn_gate.weight=q6_k
attn_qkv.weight=q6_k
attn_output.weight=q6_k
attn_q.weight=q6_k
attn_k.weight=q6_k
attn_v.weight=q6_k
ssm_ba.weight=q6_k
ssm_out.weight=q6_k
ffn_down_shexp.weight=q6_k
ffn_gate_shexp.weight=q6_k
ffn_up_shexp.weight=q6_k
ffn_gate_inp.weight=q8_0
ffn_gate_inp_shexp.weight=q8_0
ffn_down_exps.weight=iq3_s
ffn_gate_exps.weight=iq3_s
ffn_up_exps.weight=iq3_s
token_embd.weight=iq4_nl
output.weight=q6_k
EOF

~/Documents/llama-cpp-turboquant/build/bin/llama-quantize
--tensor-type-file /tmp/qwen3_tensors.txt
--imatrix ~/imatrix-Qwen3-Coder-Next-BF16.dat
~/Qwen3-Coder-Next-f16.gguf
~/Qwen3-Coder-Next-Q3mix.gguf
IQ3_S
$(nproc)

So huge thanks for ubergarm for inspiration and expertise. This model is not intended to rival. It's from personal need to run on mainline llama.cpp on 12gb vram and 32gb ram.

Best Practices from the base models card:

To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40. pipeline_tag: text-generation

tags:

  • code
Downloads last month
285
GGUF
Model size
80B params
Architecture
qwen3next
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support