Qwen3-Coder-Next-UD-TQ3_0 (GGUF)

This repository contains the TQ3_0 quantized version of the Qwen3-Coder-Next model, specifically optimized for the latest NVIDIA hardware.

🚀 Model Highlights

Quantization Method: TurboQuant (TQ3_0) — Fine-tuned for superior intelligence retention.
Target Bitrate: 3.25 bpw (Bits Per Weight) — Strategic sweet spot between 3-bit and 4-bit quantization.
Hardware Used: Quantized on a dedicated NVIDIA GeForce RTX 5090.
Optimization: Built using a custom-patched llama.cpp (llama-turbo) to support the high-efficiency TQ3_0 algorithm.

🛠️ Quantization Details

The TQ3_0 format utilizes advanced Lloyd-Max quantization and Walsh-Hadamard Transform (WHT) to minimize information loss. This specific version has been calibrated to 3.25 bpw, offering a balanced sweet spot between 3-bit and 4-bit quantization.

BPW (Bits Per Weight): 3.25
Size: Approximately 30.4 GB (ideally suited for 32GB VRAM GPUs like the RTX 5090)
Efficiency: Balanced for ultra-fast throughput while maintaining high-level coding logic.

💻 How to Use

To run this model, you need a compatible inference engine that supports TurboQuant.

Using llama-server (Example)

./llama-server \
    -m Qwen3-Coder-Next-UD-TQ3_0.gguf \
    -ngl 99 \
    -c 32768 \
    --port 8080

Downloads last month: 567

GGUF

Model size

80B params

Architecture

qwen3next

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for edwardyoon79/Qwen3-Coder-Next-TQ3_0

Base model

Qwen/Qwen3-Coder-Next

Quantized

(92)

this model