Qwen2-VL-2B-ternary

Qwen2-VL-2B-Instruct with the text/language backbone quantized to ternary weights. Vision backbone kept at FP16 for full image quality. 196 transformer layers quantized; weights stored as 2-bit packed ternary codes with per-group float16 scales. Load with ternary-quant for inference on GPU, CPU (auto BF16), or Apple Silicon.

Compression: 1.8Γ— (text backbone only; vision stays FP16) vs FP16 | Quality: coherent output βœ“ (image + text verified)

Usage

pip install ternary-quant
from ternary_quant.inference import load_ternary_model

# GPU β€” fastest (dequantize once, use cuBLAS)
model, tokenizer = load_ternary_model(
    "AsadIsmail/Qwen2-VL-2B-ternary", device="cuda", runtime_mode="cached"
)

# GPU β€” lowest VRAM (~1.5 GB (use runtime_mode='gemlite' or 'triton'))
model, tokenizer = load_ternary_model(
    "AsadIsmail/Qwen2-VL-2B-ternary", device="cuda", runtime_mode="gemlite"
)

# CPU β€” auto BF16 on Intel AVX2+
model, tokenizer = load_ternary_model(
    "AsadIsmail/Qwen2-VL-2B-ternary", device="cpu", runtime_mode="cached"
)

# Apple Silicon
model, tokenizer = load_ternary_model(
    "AsadIsmail/Qwen2-VL-2B-ternary", device="mps", runtime_mode="metal"
)

# Generate
from ternary_quant.inference import generate_text
output = generate_text(model, tokenizer, "Ternary quantization works by", max_new_tokens=100)
print(output)

Performance

Hardware Speed Mode
NVIDIA GPU 29.7 tok/s (CUDA, runtime_mode='cached') runtime_mode='cached' (dequantize to FP16 once)
Intel CPU 7.6 tok/s (Intel CPU, BF16 auto-detected) runtime_mode='cached' (auto BF16)
Apple Silicon β€” runtime_mode='metal'

Quantization details

Property Value
Base model Qwen/Qwen2-VL-2B-Instruct
Quantized components text_backbone
Scheme tritplane3 (sum of ternary planes)
Weight values {-1, 0, +1} packed 2-bit (4 values per byte)
Group size 32 (per-group float16 scale + offset)
FP16 VRAM 2.6 GB
Packed VRAM ~1.5 GB (use runtime_mode='gemlite' or 'triton')
Compression 1.8Γ— (text backbone only; vision stays FP16)
Quality coherent output βœ“ (image + text verified)

Quantized with ternary-quant β€” the only PTQ tool that quantizes VLMs and seq2seq models.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AsadIsmail/Qwen2-VL-2B-ternary

Base model

Qwen/Qwen2-VL-2B
Finetuned
(345)
this model

Space using AsadIsmail/Qwen2-VL-2B-ternary 1

Collection including AsadIsmail/Qwen2-VL-2B-ternary