theoden8
/

nanoVLM

Image-Text-to-Text

vision-language

Model card Files Files and versions

nanoVLM / README.md

theoden8's picture

Update README.md

d2fc9cc verified 6 months ago

|

history blame contribute delete

1.3 kB


	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	library_name: nanovlm
	license: mit
	pipeline_tag: image-text-to-text
	tags:
	- vision-language
	- multimodal
	- research
	---

	nanoVLM is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.

	For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M.

	## Training procedure

	[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/dweam/nanovlm)

	Usage:

	Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM.
	Follow the install instructions and run the following code:

	```python
	from models.vision_language_model import VisionLanguageModel

	model = VisionLanguageModel.from_pretrained("theoden8/nanoVLM")
	```