| |
|
| | --- |
| | |
| | |
| | library_name: nanovlm |
| | license: mit |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - vision-language |
| | - multimodal |
| | - research |
| | --- |
| | |
| | **nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model. |
| |
|
| | For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M. |
| |
|
| | ## Training procedure |
| |
|
| | [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/dweam/nanovlm) |
| |
|
| | **Usage:** |
| |
|
| | Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM. |
| | Follow the install instructions and run the following code: |
| |
|
| | ```python |
| | from models.vision_language_model import VisionLanguageModel |
| | |
| | model = VisionLanguageModel.from_pretrained("theoden8/nanoVLM") |
| | ``` |
| |
|