Safetensors
viperlm
huyquoctrinh commited on
Commit
f35843f
Β·
verified Β·
1 Parent(s): 4bd3438

Update readme to use

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -0
  2. README.md +237 -1
  3. viper-l1.png +3 -0
  4. viper-l1_represent.png +3 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ viper-l1_represent.png filter=lfs diff=lfs merge=lfs -text
37
+ viper-l1.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,239 @@
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
1
+ # 🐍 **VIPER-L1: A Family of Small Multimodal-LLMs**
2
+
3
+ <div align="center">
4
+ <a href="./">
5
+ <img src="viper-l1_represent.png" width="80%" alt="Viper-L1 Logo"/>
6
+ </a>
7
+ <br/>
8
+ <i>β€œFast. Compact. Vision-Language Intelligence.”</i>
9
+ </div>
10
+
11
  ---
12
+
13
+ ## 🌟 Overview
14
+
15
+ **Viper-L1** is an open-source **small multimodal large language model (Multimodal-LLM)** designed for efficient multimodal reasoning and deployment on consumer GPUs.
16
+ It is built upon the [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) architecture (β‰ˆ1.2B parameters), enabling a powerful yet lightweight foundation for **personal research, on-device applications, and internal experimentation**.
17
+
18
+ ---
19
+
20
+ ## 🧠 Key Features
21
+
22
+ * ⚑ **Efficient Training & Inference**
23
+ Trained on **2Γ— H100 GPUs** within **~2 days**, thanks to our lightweight multimodal fusion and liquid transformer design.
24
+ Inference runs smoothly even on **RTX 4070** GPUs.
25
+
26
+ * πŸ”— **Multimodal Connector (Sense Integration Module)**
27
+ Inspired by human perception, Viper-L1 introduces a *connector* that fuses signals from different sensory encoders (vision, audio, etc.), enabling deeper **cross-modal alignment** and improved reasoning.
28
+
29
+ * 🧩 **Hybrid Architecture**
30
+ Combines the **semantic strength of Transformers** with the **efficiency of Liquid Neural Networks**, resulting in a compact yet expressive multimodal model.
31
+
32
+ ---
33
+
34
+ ## πŸš€ Progress
35
+
36
+ * βœ… **Released** β€” Viper-L1 model checkpoint
37
+ * 🧩 **Coming Soon** β€” Fully documented training and inference scripts
38
+
39
+ Stay tuned for our next updates on model fine-tuning and multimodal reasoning enhancements.
40
+
41
+ ---
42
+
43
+ ## πŸ—οΈ Architecture
44
+
45
+ The overall architecture is shown below:
46
+
47
+ <div align="center">
48
+ <a href="./">
49
+ <img src="viper-l1.png" width="80%" alt="Viper-L1 Architecture"/>
50
+ </a>
51
+ </div>
52
+
53
+ **Main Components:**
54
+
55
+ 1. 🎨 **Vision Encoder** – Extracts compact visual embeddings
56
+ 2. πŸ”— **Multimodal Connector** – Fuses sensory inputs efficiently
57
+ 3. 🧠 **Language Backbone (LFM2-350M-based)** – Performs semantic reasoning and response generation
58
+
59
+ > πŸ§ͺ *The current Viper-L1 (1.2B parameters) was trained on ~4 million images using 2Γ— H100 GPUs for 2 days.*
60
+
61
+ ---
62
+
63
+ ## 🧩 Usage
64
+
65
+ To get started with **inference**, follow the setup in the main repository:
66
+
67
+ πŸ”— [**Viper-VLM Repository**](https://github.com/huyquoctrinh/Viper-LM)
68
+ πŸ“œ Example inference script: [`infer_viper.sh`](https://github.com/huyquoctrinh/Viper-LM/blob/feat/viper-vlm_cot/infer_viper.sh)
69
+
70
+ Or you can use these functions for inference
71
+
72
+ ```python
73
+ import os
74
+ import argparse
75
+ import torch
76
+ from PIL import Image
77
+ from transformers import AutoTokenizer, AutoProcessor
78
+ from model import ViperLMForCausalLM # your local class
79
+ IMAGE_TOKEN_ID = 64400
80
+ def build_messages(question: str, include_image: bool = True):
81
+ # Mirror CCDataset._format_prompt()
82
+ user_content = ("<image> " if include_image else "") + (question or "")
83
+ return [
84
+ {"role": "user", "content": user_content},
85
+ # assistant turn is left empty; apply_chat_template(add_generation_prompt=True) will add assistant prefix
86
+ ]
87
+
88
+ @torch.inference_mode()
89
+ def generate_answer(
90
+ ckpt_dir: str,
91
+ tokenizer_path: str,
92
+ processor_path: str,
93
+ image_path: str,
94
+ question: str,
95
+ device: str = "cuda",
96
+ dtype: str = "bf16",
97
+ max_new_tokens: int = 128,
98
+ temperature: float = 0.2,
99
+ top_p: float = 0.9,
100
+ repetition_penalty: float = 1.05,
101
+ ):
102
+ # --- device / dtype ---
103
+ device = torch.device(device if torch.cuda.is_available() else "cpu")
104
+ use_bf16 = (dtype.lower() == "bf16")
105
+ use_fp16 = (dtype.lower() == "fp16")
106
+ amp_dtype = torch.bfloat16 if use_bf16 else (torch.float16 if use_fp16 else torch.float32)
107
+
108
+ # --- tokenizer / processor ---
109
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=True)
110
+ if tokenizer.pad_token_id is None:
111
+ tokenizer.pad_token = tokenizer.eos_token
112
+ # optional but common for generation with left context
113
+ if not hasattr(tokenizer, "padding_side") or tokenizer.padding_side != "left":
114
+ tokenizer.padding_side = "left"
115
+
116
+ processor = AutoProcessor.from_pretrained(processor_path)
117
+
118
+ # --- model ---
119
+ model = ViperLMForCausalLM.from_pretrained(
120
+ ckpt_dir,
121
+ torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
122
+ ).to(device)
123
+ model.eval()
124
+ if getattr(model.config, "pad_token_id", None) is None:
125
+ model.config.pad_token_id = tokenizer.pad_token_id
126
+
127
+ # expose image token id if your forward expects it; keep it consistent with training
128
+ image_token_id = getattr(model.config, "image_token_id", None)
129
+ if image_token_id is None and "<image>" in tokenizer.get_vocab():
130
+ image_token_id = tokenizer.convert_tokens_to_ids("<image>")
131
+
132
+ # --- text input with the SAME chat template as training ---
133
+ messages = build_messages(question=question, include_image=True)
134
+ enc = tokenizer.apply_chat_template(
135
+ messages,
136
+ add_generation_prompt=True, # adds assistant header the model expects before generation
137
+ tokenize=True,
138
+ return_tensors="pt",
139
+ )
140
+ if isinstance(enc, torch.Tensor):
141
+ input_ids = enc
142
+ attention_mask = torch.ones_like(enc, dtype=torch.long)
143
+ else:
144
+ input_ids = enc["input_ids"]
145
+ attention_mask = enc.get("attention_mask")
146
+ if attention_mask is None:
147
+ attention_mask = torch.ones_like(input_ids, dtype=torch.long)
148
+
149
+ input_ids = input_ids.to(device)
150
+ attention_mask = attention_mask.to(device)
151
+
152
+ # --- image preprocessing (match training) ---
153
+ img = Image.open(image_path).convert("RGB")
154
+ proc = processor(images=[img], return_tensors="pt") # list, like training
155
+ pixel_values = proc.get("pixel_values", None)
156
+ if pixel_values is None:
157
+ raise ValueError("Processor did not return 'pixel_values'. Check processor_path.")
158
+ pixel_values = pixel_values.to(device) # (1, 3, H, W)
159
+
160
+ # --- generate ---
161
+ gen_kwargs = {
162
+ "max_new_tokens": max_new_tokens,
163
+ "do_sample": temperature > 0.0,
164
+ "temperature": max(temperature, 1e-6),
165
+ "top_p": top_p,
166
+ "repetition_penalty": repetition_penalty,
167
+ "eos_token_id": tokenizer.eos_token_id,
168
+ "pad_token_id": tokenizer.pad_token_id,
169
+ "image_inputs": pixel_values,
170
+ # IMPORTANT: use the same argument names your model.forward saw in training # not "image_inputs"
171
+ "image_token_id": image_token_id, # if your forward uses it
172
+ "use_cache": False,
173
+ }
174
+
175
+ if device.type == "cuda" and (use_bf16 or use_fp16):
176
+ with torch.autocast(device_type="cuda", dtype=amp_dtype):
177
+ out = model.generate(
178
+ input_ids=input_ids,
179
+ attention_mask=attention_mask,
180
+ **gen_kwargs
181
+ )
182
+ else:
183
+ out = model.generate(
184
+ input_ids=input_ids,
185
+ attention_mask=attention_mask,
186
+ **gen_kwargs
187
+ )
188
+
189
+ # --- decode only new tokens ---
190
+ generated = out[0]
191
+ prompt_len = input_ids.size(1)
192
+ new_tokens = generated[prompt_len:]
193
+ answer = tokenizer.decode(new_tokens, skip_special_tokens=True)
194
+ return answer.strip()
195
+
196
+ if __name__ == "__main__":
197
+ ckpt_dir = ""
198
+ tokenizer_path = ""
199
+ processor_path = ""
200
+ image_path = ""
201
+ question = ""
202
+ device = ""
203
+ ans = generate_answer(
204
+ ckpt_dir=ckpt_dir,
205
+ tokenizer_path=tokenizer_path,
206
+ processor_path=processor_path,
207
+ image_path=image_path,
208
+ question=question,
209
+ device=device,
210
+ dtype="bfloat16",
211
+ max_new_tokens=128,
212
+ temperature=0.7,
213
+ top_p=0.8,
214
+ repetition_penalty=1
215
+ )
216
+ print("\n ======Answer===== \n")
217
+ print(ans)
218
+
219
+ ```
220
+
221
+ ---
222
+
223
+ ## πŸ™ Acknowledgements
224
+
225
+ We gratefully thank the following foundational projects for inspiring and enabling our research:
226
+
227
+ * [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) – Base architecture for dynamic neural computation
228
+ * [**SigLIP**](https://huggingface.co/google/siglip2-base-patch16-naflex) – Vision encoder powering multimodal understanding
229
+
230
+ Their open-source contributions have made **Viper-L1** possible. πŸ’š
231
+
232
+ ---
233
+
234
+ ## πŸ“« Contact
235
+
236
+ If you’re interested in collaboration or research discussions:
237
+ πŸ‘‰ [**Contact us**](https://github.com/huyquoctrinh) or open an issue in the repository.
238
+
239
  ---
viper-l1.png ADDED

Git LFS Details

  • SHA256: b02fe89898e599f43daf47729655a1bcf28de08ae13ee4a7ba68c1a33b7e73ba
  • Pointer size: 131 Bytes
  • Size of remote file: 188 kB
viper-l1_represent.png ADDED

Git LFS Details

  • SHA256: 446ce7021b47348676ed01364e61bb68ca9aa6390459853b57da3a571979688f
  • Pointer size: 131 Bytes
  • Size of remote file: 309 kB