ternary-models: VLMs, Multimodal & Audio
Collection
Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. โข 8 items โข Updated
Ternary-quantized version of Qwen/Qwen2.5-VL-7B-Instruct using ternary-quant.
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-VL-7B-Instruct |
| Parameters | 7.6B |
| Architecture | VLM (image + text input, text output) |
| Quantization | tritplane3 (196 layers quantized) |
| Vision Encoder | FP16 (preserved) |
| License | Apache 2.0 |
| Method | Size | VLM Support |
|---|---|---|
| FP16 (original) | 12.7 GB | Yes |
| Ternary tritplane3 | 7.2 GB | Yes (vision+text) |
| Compression | 1.8x |
Few quantized alternatives exist for Qwen2.5-VL. GGUF does not support this VLM architecture.
Tested with text generation โ produces correct, detailed output:
Prompt: "What are the three laws of thermodynamics?"
Output: The three laws of thermodynamics are fundamental principles... 1. First Law (Conservation of Energy): Energy cannot be created or destroyed... ฮU = Q + W...
Produces structured, accurate, well-formatted responses matching FP16 quality.
| Runtime | Min Memory | Hardware |
|---|---|---|
cached (CPU) |
~10 GB RAM | Any |
metal (Apple Silicon) |
~8 GB unified | M1+ |
triton_memory (CUDA) |
~6 GB VRAM | Any NVIDIA GPU |
pip install ternary-quant
from ternary_quant.inference import load_ternary_model
model, processor = load_ternary_model(
"AsadIsmail/Qwen2.5-VL-7B-Instruct-ternary",
runtime_mode="cached", device="auto"
)
inputs = processor(text="What is shown in this image?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
Part of ternary-models.
GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant
Base model
Qwen/Qwen2.5-VL-7B-Instruct