Files changed (1) hide show
  1. README.md +210 -93
README.md CHANGED
@@ -4,105 +4,99 @@ license: other
4
  license_name: lfm1.0
5
  license_link: LICENSE
6
  language:
7
- - en
8
- pipeline_tag: image-text-to-text
9
  tags:
10
  - liquid
11
  - lfm2
12
  - lfm2-vl
13
- - edge
 
 
 
 
 
14
  ---
15
 
16
- <center>
17
- <div style="text-align: center;">
18
- <img
19
- src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png"
20
- alt="Liquid AI"
21
- style="width: 100%; max-width: 66%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
22
- />
23
- </div>
24
- </center>
25
 
26
- # LFM2‑VL
27
 
28
- LFM2VL is [Liquid AI](https://www.liquid.ai/)'s first series of multimodal models, designed to process text and images with variable resolutions.
29
- Built on the [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) backbone, it is optimized for low-latency and edge AI applications.
30
 
31
- We're releasing the weights of two post-trained checkpoints with [450M](https://huggingface.co/LiquidAI/LFM2-VL-450M) (for highly constrained devices) and [1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B) (more capable yet still lightweight) parameters.
 
 
 
 
 
32
 
33
- * **2× faster inference speed** on GPUs compared to existing VLMs while maintaining competitive accuracy
34
- * **Flexible architecture** with user-tunable speed-quality tradeoffs at inference time
35
- * **Native resolution processing** up to 512×512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion
36
 
37
- Find more about our vision-language model in the [LFM2-VL post](https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models) and its language backbone in the [LFM2 blog post](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models).
 
 
 
 
 
38
 
39
- ## 📄 Model details
40
 
41
- Due to their small size, **we recommend fine-tuning LFM2-VL models on narrow use cases** to maximize performance.
42
- They were trained for instruction following and lightweight agentic flows.
43
- Not intended for safety‑critical decisions.
 
 
 
 
 
 
 
44
 
45
- | Property | [**LFM2-VL-450M**](https://huggingface.co/LiquidAI/LFM2-VL-450M) | [**LFM2-VL-1.6B**](https://huggingface.co/LiquidAI/LFM2-VL-1.6B) |
46
- |---|---:|---:|
47
- | **Parameters (LM only)** | 350M | 1.2B |
48
- | **Vision encoder** | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shape‑optimized (400M) |
49
- | **Backbone layers** | hybrid conv+attention | hybrid conv+attention |
50
- | **Context (text)** | 32,768 tokens | 32,768 tokens |
51
- | **Image tokens** | dynamic, user‑tunable | dynamic, user‑tunable |
52
- | **Vocab size** | 65,536 | 65,536 |
53
- | **Precision** | bfloat16 | bfloat16 |
54
- | **License** | LFM Open License v1.0 | LFM Open License v1.0 |
55
 
56
- **Supported languages:** English
57
 
58
- **Generation parameters**: We recommend the following parameters:
59
- - Text: `temperature=0.1`, `min_p=0.15`, `repetition_penalty=1.05`
60
- - Vision: `min_image_tokens=64` `max_image_tokens=256`, `do_image_splitting=True`
61
 
62
- **Chat template**: LFM2-VL uses a ChatML-like chat template as follows:
63
-
64
- ```
65
- <|startoftext|><|im_start|>system
66
- You are a helpful multimodal assistant by Liquid AI.<|im_end|>
67
- <|im_start|>user
68
- <image>Describe this image.<|im_end|>
69
- <|im_start|>assistant
70
- This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>
71
- ```
72
 
73
- Images are referenced with a sentinel (`<image>`), which is automatically replaced with the image tokens by the processor.
74
 
75
- You can apply it using the dedicated [`.apply_chat_template()`](https://huggingface.co/docs/transformers/en/chat_templating#applychattemplate) function from Hugging Face transformers.
 
 
 
 
 
76
 
77
- **Architecture**
78
- - **Hybrid backbone**: Language model tower (LFM2-1.2B or LFM2-350M) paired with SigLIP2 NaFlex vision encoders (400M shape-optimized or 86M base variant)
79
- - **Native resolution processing**: Handles images up to 512×512 pixels without upscaling and preserves non-standard aspect ratios without distortion
80
- - **Tiling strategy**: Splits large images into non-overlapping 512×512 patches and includes thumbnail encoding for global context (in 1.6B model)
81
- - **Efficient token mapping**: 2-layer MLP connector with pixel unshuffle reduces image tokens (e.g., 256×384 image → 96 tokens, 1000×3000 → 1,020 tokens)
82
- - **Inference-time flexibility**: User-tunable maximum image tokens and patch count for speed/quality tradeoff without retraining
83
 
84
- **Training approach**
85
- - Builds on the LFM2 base model with joint mid-training that fuses vision and language capabilities using a gradually adjusted text-to-image ratio
86
- - Applies joint SFT with emphasis on image understanding and vision tasks
87
- - Leverages large-scale open-source datasets combined with in-house synthetic vision data, selected for balanced task coverage
88
- - Follows a progressive training strategy: base model → joint mid-training → supervised fine-tuning
 
89
 
90
- ## 🏃 How to run LFM2-VL
91
 
92
- You can run LFM2-VL with Hugging Face [`transformers`](https://github.com/huggingface/transformers) v4.55 or more recent as follows:
93
 
94
  ```bash
95
  pip install -U transformers pillow
96
  ```
97
 
98
- Here is an example of how to generate an answer with transformers in Python:
99
 
100
  ```python
101
  from transformers import AutoProcessor, AutoModelForImageTextToText
102
  from transformers.image_utils import load_image
103
 
104
  # Load model and processor
105
- model_id = "LiquidAI/LFM2-VL-1.6B"
106
  model = AutoModelForImageTextToText.from_pretrained(
107
  model_id,
108
  device_map="auto",
@@ -111,20 +105,19 @@ model = AutoModelForImageTextToText.from_pretrained(
111
  )
112
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
113
 
114
- # Load image and create conversation
115
- url = "https://www.ilankelman.org/stopsigns/australia.jpg"
116
- image = load_image(url)
117
  conversation = [
118
  {
119
  "role": "user",
120
  "content": [
121
  {"type": "image", "image": image},
122
- {"type": "text", "text": "What is in this image?"},
123
  ],
124
  },
125
  ]
126
 
127
- # Generate Answer
128
  inputs = processor.apply_chat_template(
129
  conversation,
130
  add_generation_prompt=True,
@@ -132,40 +125,164 @@ inputs = processor.apply_chat_template(
132
  return_dict=True,
133
  tokenize=True,
134
  ).to(model.device)
135
- outputs = model.generate(**inputs, max_new_tokens=64)
136
- processor.batch_decode(outputs, skip_special_tokens=True)[0]
137
 
138
- # This image depicts a vibrant street scene in what appears to be a Chinatown or similar cultural area. The focal point is a large red stop sign with white lettering, mounted on a pole.
 
 
139
  ```
140
 
141
- You can directly run and test the model with this [Colab notebook](https://colab.research.google.com/drive/11EMJhcVB6OTEuv--OePyGK86k-38WU3q?usp=sharing).
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
- ## 🔧 How to fine-tune
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- We recommend fine-tuning LFM2-VL models on your use cases to maximize performance.
147
 
148
- | Notebook | Description | Link |
149
- |-----------|----------------------------------------------------------------------|------|
150
- | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | <a href="https://colab.research.google.com/drive/1csXCLwJx7wI7aruudBp6ZIcnqfv8EMYN?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
 
 
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- ## 📈 Performance
154
 
155
- | Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
156
- |-------------------|-------------|-----------|---------------|----------|-------|--------|------------|-----------|---------------|-------|----------|-------|
157
- | InternVL3-2B | 65.10 | 38.49 | 66.10 | 831 | 53.10 | 61.10 | 48.70 | 57.60 | 75.00 | 67.00 | 2186.40 | 64.80 |
158
- | InternVL3-1B | 57.00 | 31.14 | 54.94 | 798 | 43.00 | 52.30 | 43.20 | 46.90 | 71.20 | 58.70 | 1912.40 | 49.80 |
159
- | SmolVLM2-2.2B | 57.50 | 19.42 | 37.75 | 725 | 42.30 | 46.00 | 41.60 | 51.50 | 71.30 | 34.90 | 1792.50 | - |
160
- | LFM2-VL-1.6B | 65.23 | 37.66 | 58.68 | 742 | 44.40 | 49.53 | 38.44 | 51.10 | 71.97 | 48.07 | 1753.04 | 50.99 |
161
 
162
- | Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
163
- |-------------------|-------------|-----------|---------------|----------|-------|--------|------------|-----------|---------------|-------|----------|-------|
164
- | SmolVLM2-500M | 49.90 | 11.27 | 24.64 | 609 | 40.70 | 38.20 | 34.10 | 37.50 | 62.20 | 29.90 | 1448.30 | - |
165
- | LFM2-VL-450M | 52.29 | 26.18 | 46.51 | 655 | 41.98 | 40.87 | 33.11 | 44.70 | 63.50 | 33.76 | 1239.06 | 40.16 |
166
 
167
- We obtained MM-IFEval and InfoVQA (Val) scores for InternVL 3 and SmolVLM2 models using VLMEvalKit.
168
 
169
- ## 📬 Contact
170
 
171
- If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact).
 
 
 
 
 
4
  license_name: lfm1.0
5
  license_link: LICENSE
6
  language:
7
+ - ja
8
+ base_model: LiquidAI/LFM2-VL-1.6B
9
  tags:
10
  - liquid
11
  - lfm2
12
  - lfm2-vl
13
+ - vision-language
14
+ - japanese
15
+ - multimodal
16
+ - trl
17
+ - sft
18
+ pipeline_tag: image-text-to-text
19
  ---
20
 
21
+ # LFM2-VL-1.6B-jp (Japanese)
 
 
 
 
 
 
 
 
22
 
23
+ ## Model Description
24
 
25
+ **LFM2-VL-1.6B-jp** is a Japanese fine-tuned variant of [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B), optimized for Japanese vision-language tasks. This model maintains the efficiency and performance characteristics of the original LFM2-VL 1.6B architecture while specializing in Japanese language understanding and image description. With 1.6B parameters, this model offers enhanced capabilities compared to the 450M variant while remaining lightweight and suitable for edge deployment.
 
26
 
27
+ - **Developed by:** Alfaxad
28
+ - **Base Model:** [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B)
29
+ - **Model type:** Vision-Language Model (Multimodal)
30
+ - **Language:** Japanese (日本語)
31
+ - **License:** LFM Open License v1.0
32
+ - **Finetuned from:** LiquidAI/LFM2-VL-1.6B (1.6B parameters)
33
 
34
+ ## Key Features
 
 
35
 
36
+ - **Japanese Language Support:** Specialized for Japanese image understanding and description tasks
37
+ - **Enhanced Capabilities:** 1.6B parameters provide improved reasoning and generation quality
38
+ - **Advanced Vision Encoder:** SigLIP2 NaFlex shape-optimized (400M) for better visual understanding
39
+ - **Low Latency:** 2× faster inference speed on GPUs compared to similar-sized VLMs
40
+ - **Multi-turn Conversations:** Trained on conversational data for interactive vision-language tasks
41
+ - **Native Resolution Processing:** Handles images up to 512×512 pixels without upscaling, with intelligent tiling for larger images
42
 
43
+ ## Model Details
44
 
45
+ | Property | Value |
46
+ |---|---:|
47
+ | **Parameters (LM only)** | 1.2B |
48
+ | **Vision encoder** | SigLIP2 NaFlex shape-optimized (400M) |
49
+ | **Total parameters** | ~1.6B |
50
+ | **Backbone layers** | hybrid conv+attention |
51
+ | **Context (text)** | 32,768 tokens |
52
+ | **Image tokens** | dynamic, user-tunable |
53
+ | **Vocab size** | 65,536 |
54
+ | **Precision** | bfloat16 |
55
 
56
+ ## Training Data
 
 
 
 
 
 
 
 
 
57
 
58
+ The model was fine-tuned on approximately **98,000 multi-turn conversational samples** from:
59
 
60
+ - **Dataset:** [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)
61
+ - **Content:** Japanese visual question-answering conversations
62
+ - **Format:** Multi-turn dialogues with image context
63
 
64
+ ## Intended Use
 
 
 
 
 
 
 
 
 
65
 
66
+ ### Primary Use Cases
67
 
68
+ - Japanese image captioning and detailed description
69
+ - Visual question answering in Japanese with enhanced reasoning
70
+ - Multi-turn conversations about images in Japanese
71
+ - Japanese document understanding and OCR tasks
72
+ - Complex visual reasoning tasks in Japanese
73
+ - Edge AI applications requiring Japanese language support
74
 
75
+ ### Recommended Applications
 
 
 
 
 
76
 
77
+ - Japanese e-commerce product analysis and description
78
+ - Japanese accessibility tools for visual content
79
+ - Japanese educational applications requiring visual understanding
80
+ - Japanese content moderation and detailed analysis
81
+ - Japanese chatbots with advanced visual understanding
82
+ - Japanese document processing and information extraction
83
 
84
+ ## How to Use
85
 
86
+ ### Installation
87
 
88
  ```bash
89
  pip install -U transformers pillow
90
  ```
91
 
92
+ ### Basic Usage
93
 
94
  ```python
95
  from transformers import AutoProcessor, AutoModelForImageTextToText
96
  from transformers.image_utils import load_image
97
 
98
  # Load model and processor
99
+ model_id = "Alfaxad/LFM2-VL-1.6B-jp"
100
  model = AutoModelForImageTextToText.from_pretrained(
101
  model_id,
102
  device_map="auto",
 
105
  )
106
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
107
 
108
+ # Load image and create conversation in Japanese
109
+ image = load_image("your_image_url_or_path.jpg")
 
110
  conversation = [
111
  {
112
  "role": "user",
113
  "content": [
114
  {"type": "image", "image": image},
115
+ {"type": "text", "text": "この画像について詳しく説明してください。"},
116
  ],
117
  },
118
  ]
119
 
120
+ # Generate response
121
  inputs = processor.apply_chat_template(
122
  conversation,
123
  add_generation_prompt=True,
 
125
  return_dict=True,
126
  tokenize=True,
127
  ).to(model.device)
 
 
128
 
129
+ outputs = model.generate(**inputs, max_new_tokens=256)
130
+ response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
131
+ print(response)
132
  ```
133
 
134
+ ### Multi-turn Conversation Example
135
 
136
+ ```python
137
+ # Multi-turn conversation
138
+ conversation = [
139
+ {
140
+ "role": "user",
141
+ "content": [
142
+ {"type": "image", "image": image},
143
+ {"type": "text", "text": "この画像には何が写っていますか?"},
144
+ ],
145
+ },
146
+ {
147
+ "role": "assistant",
148
+ "content": [
149
+ {"type": "text", "text": "この画像には赤い車が道路に駐車されています。"},
150
+ ],
151
+ },
152
+ {
153
+ "role": "user",
154
+ "content": [
155
+ {"type": "text", "text": "車のメーカーはわかりますか?"},
156
+ ],
157
+ },
158
+ ]
159
 
160
+ inputs = processor.apply_chat_template(
161
+ conversation,
162
+ add_generation_prompt=True,
163
+ return_tensors="pt",
164
+ return_dict=True,
165
+ tokenize=True,
166
+ ).to(model.device)
167
+
168
+ outputs = model.generate(**inputs, max_new_tokens=128)
169
+ response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
170
+ ```
171
+
172
+ ### Recommended Generation Parameters
173
+
174
+ - **Temperature:** 0.1
175
+ - **min_p:** 0.15
176
+ - **repetition_penalty:** 1.05
177
+ - **min_image_tokens:** 64
178
+ - **max_image_tokens:** 256
179
+ - **do_image_splitting:** True
180
+ - **max_new_tokens:** 128-512 (depending on task complexity)
181
+
182
+ ### Chat Template
183
+
184
+ The model uses a ChatML-like format:
185
+
186
+ ```
187
+ <|startoftext|><|im_start|>system
188
+ あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
189
+ <|im_start|>user
190
+ <image>この画像を詳しく説明してください。<|im_end|>
191
+ <|im_start|>assistant
192
+ この画像には...<|im_end|>
193
+ ```
194
+
195
+ ## Architecture Highlights
196
+
197
+ - **Hybrid backbone:** LFM2-1.2B language model paired with SigLIP2 NaFlex shape-optimized vision encoder (400M)
198
+ - **Native resolution processing:** Handles images up to 512×512 pixels without upscaling
199
+ - **Tiling strategy:** Splits large images into non-overlapping 512×512 patches with thumbnail encoding for global context
200
+ - **Efficient token mapping:** 2-layer MLP connector with pixel unshuffle reduces image tokens efficiently
201
+ - **Inference-time flexibility:** User-tunable maximum image tokens and patch count for speed/quality tradeoff
202
+
203
+ ## Training Details
204
 
205
+ ### Training Procedure
206
 
207
+ - **Base Model:** LiquidAI/LFM2-VL-1.6B
208
+ - **Fine-tuning Method:** Supervised Fine-Tuning (SFT) with LoRA adapters
209
+ - **Framework:** Hugging Face TRL (Transformer Reinforcement Learning)
210
+ - **Training Data:** ~98,000 multi-turn conversations
211
+ - **Training Regime:** bfloat16 mixed precision
212
 
213
+ ### Training Hyperparameters
214
+
215
+ - **Training approach:** LoRA (Low-Rank Adaptation) fine-tuning
216
+ - **Dataset size:** ~98,000 samples
217
+ - **Data format:** Multi-turn conversational VQA
218
+ - **Language focus:** Japanese
219
+
220
+ ## Performance Considerations
221
+
222
+ As a fine-tuned variant of LFM2-VL-1.6B:
223
+ - **Enhanced Capabilities:** The 1.6B model offers improved reasoning, more detailed descriptions, and better handling of complex visual scenarios compared to the 450M variant
224
+ - **Optimized for Japanese:** Best performance on Japanese language tasks
225
+ - **Resource Efficient:** Still lightweight enough for edge devices while providing enhanced capabilities
226
+ - **Speed vs Quality:** Offers better balance between inference speed and output quality
227
+ - **Recommended Use:** Can be used out-of-the-box for many Japanese VLM tasks, though further fine-tuning on specific use cases will maximize performance
228
+
229
+ ## Comparison with 450M Variant
230
+
231
+ | Aspect | LFM2-VL-450M-jp | LFM2-VL-1.6B-jp |
232
+ |--------|-----------------|-----------------|
233
+ | **Parameters** | 450M total | 1.6B total |
234
+ | **Vision Encoder** | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shape-optimized (400M) |
235
+ | **Use Case** | Highly constrained devices | More capable while still lightweight |
236
+ | **Output Quality** | Good for simple tasks | Better for complex reasoning |
237
+ | **Inference Speed** | Faster | Still fast, slightly slower |
238
+ | **Memory Usage** | Lower | Higher but manageable |
239
+
240
+
241
+
242
+ ## Limitations
243
+
244
+ - **Language Specialization:** Primarily designed for Japanese; performance on other languages may be limited
245
+ - **Domain Specificity:** Performance is optimized for the types of conversations present in the training data
246
+ - **Safety:** Not intended for safety-critical decisions without additional validation
247
+ - **Complex Reasoning:** While improved over 450M, may still struggle with highly complex multi-step reasoning compared to much larger models
248
+ - **Cultural Context:** Trained on Japanese data; cultural nuances should be considered
249
+
250
+
251
+ ## Citation
252
+
253
+ If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:
254
+
255
+ ```bibtex
256
+ @misc{lfm2-vl-1.6b-jp,
257
+ author = {Alfaxad},
258
+ title = {LFM2-VL-1.6B-jp: Japanese Fine-tuned Vision-Language Model},
259
+ year = {2025},
260
+ publisher = {HuggingFace},
261
+ url = {https://huggingface.co/Alfaxad/LFM2-VL-1.6B-jp}
262
+ }
263
+
264
+ @misc{liquid-lfm2-vl,
265
+ author = {Liquid AI},
266
+ title = {LFM2-VL: Efficient Vision-Language Models},
267
+ year = {2025},
268
+ url = {https://huggingface.co/LiquidAI/LFM2-VL-1.6B}
269
+ }
270
+ ```
271
 
272
+ ## Acknowledgments
273
 
274
+ - **Base Model:** [Liquid AI](https://www.liquid.ai/) for the LFM2-VL architecture
275
+ - **Training Data:** [llm-jp](https://huggingface.co/llm-jp) for the ja-vg-vqa-conversation dataset
276
+ - **Framework:** Hugging Face for transformers and TRL libraries
 
 
 
277
 
278
+ ## Contact
 
 
 
279
 
280
+ For questions or issues regarding this model, please open an issue on the model's Hugging Face page or contact the model developer.
281
 
282
+ ## Additional Resources
283
 
284
+ - **Original Model:** [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B)
285
+ - **Smaller Variant:** [Alfaxad/LFM2-VL-450M-jp](https://huggingface.co/Alfaxad/LFM2-VL-450M-jp)
286
+ - **Training Dataset:** [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)
287
+ - **LFM2-VL Blog Post:** [Liquid AI Blog](https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models)
288
+ - **Original Paper/Documentation:** [LFM2 Blog Post](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models)