🚀 Optimized Models: torchao & Pruna Quantization
Collection
Quantized Models using torchao & Pruna for efficient inference and deployment. • 8 items • Updated • 1
This is a quantized version of GLM‑4.1V‑9B‑Thinking, a powerful 9B‑parameter vision‑language model using the “thinking paradigm” and reinforced reasoning. The quantization enables significantly lighter memory usage and faster inference on consumer-grade GPUs while preserving its strong performance on multimodal reasoning tasks.
Method: torchao quantization Weight Precision: int8 Activation Precision: int8 dynamic Technique: Symmetric mapping Impact: Significant reduction in model size with minimal loss in reasoning, coding, and general instruction-following capabilities.
Perfect for: