ternary-models: VLMs, Multimodal & Audio
Collection
Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. โข 8 items โข Updated
Ternary-quantized version of google/gemma-4-E4B-it using ternary-quant.
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E4B-it |
| Parameters | ~8B |
| Architecture | Dense transformer, multimodal (image + text) |
| Quantization | tritplane3 (3-plane progressive ternary) |
| Quantized Components | text_backbone + multimodal_connector (342 layers) |
| Vision Encoder | FP16 (preserved) |
| License | Gemma |
| Method | Size | Bits/Weight | VLM Support |
|---|---|---|---|
| FP16 (original) | 16 GB | 16 | Yes |
| Ternary tritplane3 | 4.2 GB | ~8-10 | Yes (vision+text) |
| Compression | 3.8x |
Few quantized alternatives exist for this model. GGUF variants typically don't support the E4B multimodal architecture.
| Prompt | FP16 Original | Ternary (ours) |
|---|---|---|
| "Capital of France?" | Paris | Paris |
| "Photosynthesis" | ...convert light energy into chemical energy...releases oxygen | ...convert light energy into chemical energy...releases oxygen, essential for life on Earth |
| "Python reverse string" | The Pythonic Way (Slicing) - Recommended | Using Slicing (The most Pythonic way) |
Near-identical quality. Same facts, same reasoning.
| Runtime | Min Memory | Hardware |
|---|---|---|
cached (CPU) |
~8 GB RAM | Any |
metal (Apple Silicon) |
~6 GB unified | M1+ |
triton_memory (CUDA) |
~5 GB VRAM | Any NVIDIA GPU |
pip install ternary-quant
from ternary_quant.inference import load_ternary_model
model, processor = load_ternary_model(
"AsadIsmail/gemma-4-E4B-it-ternary",
runtime_mode="metal", # "cached" for CPU/NVIDIA
device="auto"
)
messages = [{"role": "user", "content": [{"type": "text", "text": "Describe this image"}]}]
formatted = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(outputs[0], skip_special_tokens=True))
Part of ternary-models โ ternary-quantized VLMs, multimodal, and audio models.
GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant
Base model
google/gemma-4-E4B-it