ternary-models: VLMs, Multimodal & Audio
Collection
Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. โข 8 items โข Updated
Qwen3-1.7B quantized to ternary weights using ternary-quant broad track. Transformer layers quantized to ~4 effective bits per weight; embeddings stay FP16. Measured 97.5% average benchmark retention across ARC-Easy, ARC-Challenge, HellaSwag, and WinoGrande (0-shot lm-eval).
Compression: 2.7ร vs FP16 | Quality: 97.5% avg lm-eval retain (ARC/HellaSwag/WinoGrande)
pip install ternary-quant
from ternary_quant.inference import load_ternary_model
# GPU โ fastest (dequantize once, use cuBLAS)
model, tokenizer = load_ternary_model(
"AsadIsmail/Qwen3-1.7B-ternary", device="cuda", runtime_mode="cached"
)
# GPU โ lowest VRAM (~1.0 GB (use runtime_mode='gemlite' or 'triton'))
model, tokenizer = load_ternary_model(
"AsadIsmail/Qwen3-1.7B-ternary", device="cuda", runtime_mode="gemlite"
)
# CPU โ auto BF16 on Intel AVX2+
model, tokenizer = load_ternary_model(
"AsadIsmail/Qwen3-1.7B-ternary", device="cpu", runtime_mode="cached"
)
# Apple Silicon
model, tokenizer = load_ternary_model(
"AsadIsmail/Qwen3-1.7B-ternary", device="mps", runtime_mode="metal"
)
# Generate
from ternary_quant.inference import generate_text
output = generate_text(model, tokenizer, "Ternary quantization works by", max_new_tokens=100)
print(output)
| Hardware | Speed | Mode |
|---|---|---|
| NVIDIA GPU | 28.8 tok/s (CUDA, runtime_mode='cached') | runtime_mode='cached' (dequantize to FP16 once) |
| Intel CPU | 11.3 tok/s (Intel CPU, BF16 auto-detected) | runtime_mode='cached' (auto BF16) |
| Apple Silicon | โ | runtime_mode='metal' |
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-1.7B |
| Quantized components | text_backbone |
| Scheme | tritplane3 (sum of ternary planes) |
| Weight values | {-1, 0, +1} packed 2-bit (4 values per byte) |
| Group size | 32 (per-group float16 scale + offset) |
| FP16 VRAM | 2.8 GB |
| Packed VRAM | ~1.0 GB (use runtime_mode='gemlite' or 'triton') |
| Compression | 2.7ร |
| Quality | 97.5% avg lm-eval retain (ARC/HellaSwag/WinoGrande) |
Quantized with ternary-quant โ the only PTQ tool that quantizes VLMs and seq2seq models.