How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
# Run inference directly in the terminal:
llama-cli -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
# Run inference directly in the terminal:
llama-cli -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
# Run inference directly in the terminal:
./llama-cli -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
Use Docker
docker model run hf.co/dinerburger/Qwen3-Coder-Next-GGUF:IQ3_S
Quick Links

This is a custom GGUF quantization of Qwen3-Coder-Next, using the unsloth imatrix data with specific focus on retaining quality in embedding, output and attention tensors.

IQ4_XS quantization script:

QUANT="IQ4_XS"
llama-quantize \
  --output-tensor-type q8_0 \
  --token-embedding-type q8_0 \
  --tensor-type attn_qkv=bf16 \
  --tensor-type attn_v=bf16 \
  --tensor-type attn_q=bf16 \
  --tensor-type attn_k=bf16 \
  --tensor-type attn_gate=bf16 \
  --tensor-type attn_output=bf16 \
  --tensor-type ssm_ba=bf16 \
  --tensor-type ssm_beta=bf16 \
  --tensor-type ssm_alpha=bf16 \
  --tensor-type ssm_out=bf16 \
  --tensor-type ffn_down_shexp=bf16 \
  --tensor-type ffn_gate_shexp=bf16 \
  --tensor-type ffn_up_shexp=bf16 \
  --tensor-type ffn_down_exps=iq4_nl \
  --imatrix Qwen-Coder-Next-imatrix.gguf_file \
  BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  Qwen3-Coder-Next.${QUANT}.gguf \
  ${QUANT}

IQ3_S quantization script:

QUANT="IQ3_S"
llama-quantize \
  --output-tensor-type q6_k \
  --token-embedding-type q6_k \
  --tensor-type attn_qkv=bf16 \
  --tensor-type attn_v=bf16 \
  --tensor-type attn_q=bf16 \
  --tensor-type attn_k=bf16 \
  --tensor-type attn_gate=bf16 \
  --tensor-type attn_output=bf16 \
  --tensor-type ssm_ba=bf16 \
  --tensor-type ssm_beta=bf16 \
  --tensor-type ssm_alpha=bf16 \
  --tensor-type ssm_out=bf16 \
  --tensor-type ffn_down_shexp=bf16 \
  --tensor-type ffn_gate_shexp=bf16 \
  --tensor-type ffn_up_shexp=bf16 \
  --tensor-type ffn_down_exps=iq4_xs \
  --imatrix Qwen-Coder-Next-imatrix.gguf_file \
  BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  Qwen3-Coder-Next.${QUANT}.gguf \
  ${QUANT}
Downloads last month
83
GGUF
Model size
80B params
Architecture
qwen3next
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dinerburger/Qwen3-Coder-Next-GGUF

Quantized
(97)
this model