sbrzz commited on
Commit
947fc33
·
verified ·
1 Parent(s): dc64e73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -27
README.md CHANGED
@@ -1,27 +1,36 @@
1
-
2
- ---
3
- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
4
- # Doc / guide: https://huggingface.co/docs/hub/model-cards
5
- library_name: nanovlm
6
- license: mit
7
- pipeline_tag: image-text-to-text
8
- tags:
9
- - vision-language
10
- - multimodal
11
- - research
12
- ---
13
-
14
- **nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.
15
-
16
- For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M.
17
-
18
- **Usage:**
19
-
20
- Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM.
21
- Follow the install instructions and run the following code:
22
-
23
- ```python
24
- from models.vision_language_model import VisionLanguageModel
25
-
26
- model = VisionLanguageModel.from_pretrained("sbrzz/nanoVLM")
27
- ```
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
4
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
5
+ library_name: nanovlm
6
+ license: mit
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - vision-language
10
+ - multimodal
11
+ - research
12
+ ---
13
+
14
+ **Introduction**
15
+
16
+ You can find the history behind this work in this blog post:
17
+
18
+ **datasets**
19
+
20
+ - "localized_narratives" part from the_cauldron (200k items)
21
+ - private dataset (30k items)
22
+
23
+ **nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.
24
+
25
+ For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M.
26
+
27
+ **Usage:**
28
+
29
+ Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM.
30
+ Follow the install instructions and run the following code:
31
+
32
+ ```python
33
+ from models.vision_language_model import VisionLanguageModel
34
+
35
+ model = VisionLanguageModel.from_pretrained("sbrzz/nanoVLM")
36
+ ```