Qwen3-8B-Instruct β€” GGUF Quants

Quantized GGUF versions of Qwen/Qwen3-8B-Instruct β€” Alibaba's Qwen3 instruction-tuned 8B model. Fine-tuned for chat and instruction following with improved reasoning and tool use over Qwen2.5-7B-Instruct.

Available Files

File Quant Size Use Case
Qwen3-8B-Instruct-Q8_0.gguf Q8_0 ~8.6GB Maximum quality
Qwen3-8B-Instruct-Q6_K.gguf Q6_K ~6.6GB Near-lossless
Qwen3-8B-Instruct-Q5_K_M.gguf Q5_K_M ~5.7GB High quality
Qwen3-8B-Instruct-Q4_K_M.gguf Q4_K_M ~4.9GB Recommended default
Qwen3-8B-Instruct-Q3_K_M.gguf Q3_K_M ~3.9GB Low VRAM
Qwen3-8B-Instruct-IQ4_XS.gguf IQ4_XS ~4.3GB Imatrix 4-bit
Qwen3-8B-Instruct-IQ3_XXS.gguf IQ3_XXS ~3.2GB Imatrix 3-bit
Qwen3-8B-Instruct-IQ2_M.gguf IQ2_M ~2.8GB Imatrix 2-bit
Qwen3-8B-Instruct-IQ1_S.gguf IQ1_S ~2.0GB Extreme compression
Qwen3-8B-Instruct-fp16.gguf FP16 ~16.0GB Full precision
imatrix.dat β€” β€” Importance matrix

Usage

./llama-cli -m Qwen3-8B-Instruct-Q4_K_M.gguf --ctx-size 8192 -n 512 \
  -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"

# Ollama
ollama run hf.co/DuoNeural/Qwen3-8B-Instruct-GGUF:Q4_K_M
  • Parameters: 8B | License: Apache 2.0 | Context: 32K tokens

Quantized by DuoNeural using llama.cpp on RTX 5090.


DuoNeural

DuoNeural is an open AI research lab β€” human + AI in collaboration.

DuoNeural Research Publications

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura β€” DuoNeural.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support