Qwen3-Coder-Next-UD-TQ3_0 (GGUF)

This repository contains the TQ3_0 quantized version of the Qwen3-Coder-Next model, specifically optimized for the latest NVIDIA hardware.

πŸš€ Model Highlights

  • Quantization Method: TurboQuant (TQ3_0) β€” Fine-tuned for superior intelligence retention.
  • Target Bitrate: 3.25 bpw (Bits Per Weight) β€” Strategic sweet spot between 3-bit and 4-bit quantization.
  • Hardware Used: Quantized on a dedicated NVIDIA GeForce RTX 5090.
  • Optimization: Built using a custom-patched llama.cpp (llama-turbo) to support the high-efficiency TQ3_0 algorithm.

πŸ› οΈ Quantization Details

The TQ3_0 format utilizes advanced Lloyd-Max quantization and Walsh-Hadamard Transform (WHT) to minimize information loss. This specific version has been calibrated to 3.25 bpw, offering a balanced sweet spot between 3-bit and 4-bit quantization.

  • BPW (Bits Per Weight): 3.25
  • Size: Approximately 30.4 GB (ideally suited for 32GB VRAM GPUs like the RTX 5090)
  • Efficiency: Balanced for ultra-fast throughput while maintaining high-level coding logic.

πŸ’» How to Use

To run this model, you need a compatible inference engine that supports TurboQuant.

Using llama-server (Example)

./llama-server \
    -m Qwen3-Coder-Next-UD-TQ3_0.gguf \
    -ngl 99 \
    -c 32768 \
    --port 8080
Downloads last month
567
GGUF
Model size
80B params
Architecture
qwen3next
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for edwardyoon79/Qwen3-Coder-Next-TQ3_0

Quantized
(92)
this model