AINovice2005
/

quantized-SmolVLM2-2.2B-Base

Image-Text-to-Text

Model card Files Files and versions

AINovice2005 commited on Jul 17, 2025

Commit

a113a67

·

verified ·

1 Parent(s): 8d80bbd

Update README.md

Files changed (1) hide show

README.md +49 -1

README.md CHANGED Viewed

@@ -4,4 +4,52 @@ base_model:
 - HuggingFaceTB/SmolVLM2-2.2B-Base
 pipeline_tag: image-text-to-text
 library_name: transformers
----

 - HuggingFaceTB/SmolVLM2-2.2B-Base
 pipeline_tag: image-text-to-text
 library_name: transformers
+---
+**SmolVLM2‑2.2B‑Base  Quantized**
+---
+### 🚀 Model Description
+This is a **quantized version** of **SmolVLM2‑2.2B‑Base**, a compact yet powerful vision+language model by Hugging Face. It’s designed for **multimodal understanding**—including images, multi‑image inputs, and videos—while offering **faster and more efficient inference** thanks to quantization. Perfect for on-device and resource-constrained deployments.
+---
+### 🔧 Base Model Summary
+* **Name**: SmolVLM2‑2.2B‑Base
+* **Publisher**: Hugging Face TB
+* **Architecture**: Idefics3 vision encoder + SmolLM2‑1.7B text decoder
+* **Modalities**: image, multi-image, video, text
+* **Capabilities**: captioning, VQA, video analysis, diagram understanding, text-in-image reading
+---
+### 📏 Quantization Details
+**Method**: torchao quantization
+**Weight Precision**: int8
+**Activation Precision**: int8 dynamic
+**Technique**: Symmetric mapping
+Impact: Significant reduction in model size with minimal loss in reasoning, coding, and general instruction-following capabilities.
+---
+### 🎯 Intended Use
+* On-device or low-VRAM systems (edge, mobile, small GPUs)
+* Multimodal tasks: VQA, captioning, comparing images, video transcription
+* Research on quantized multimodal models
+---
+### ⚠️ Limitations & Considerations
+* May underperform compared to full-precision version
+* Only supports the modalities supported by the base model
+---