|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
library_name: nanovlm |
|
|
license: mit |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- research |
|
|
--- |
|
|
|
|
|
**Introduction** |
|
|
|
|
|
You can find the history behind this work in this blog post: https://www.gradients.zone/blog/a-super-small-vision-language-model/ |
|
|
|
|
|
**datasets** |
|
|
|
|
|
- "localized_narratives" part from the_cauldron (200k items) |
|
|
- private dataset (30k items) |
|
|
|
|
|
**nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model. |
|
|
|
|
|
For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M. |
|
|
|
|
|
**Usage:** |
|
|
|
|
|
Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM. |
|
|
Follow the install instructions and run the following code: |
|
|
|
|
|
```python |
|
|
from models.vision_language_model import VisionLanguageModel |
|
|
|
|
|
model = VisionLanguageModel.from_pretrained("sbrzz/nanoVLM") |
|
|
``` |
|
|
|