Update README.md
#7
by
Alfaxad
- opened
README.md
CHANGED
|
@@ -4,105 +4,99 @@ license: other
|
|
| 4 |
license_name: lfm1.0
|
| 5 |
license_link: LICENSE
|
| 6 |
language:
|
| 7 |
-
-
|
| 8 |
-
|
| 9 |
tags:
|
| 10 |
- liquid
|
| 11 |
- lfm2
|
| 12 |
- lfm2-vl
|
| 13 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
|
| 17 |
-
<div style="text-align: center;">
|
| 18 |
-
<img
|
| 19 |
-
src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png"
|
| 20 |
-
alt="Liquid AI"
|
| 21 |
-
style="width: 100%; max-width: 66%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
|
| 22 |
-
/>
|
| 23 |
-
</div>
|
| 24 |
-
</center>
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
LFM2
|
| 29 |
-
Built on the [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) backbone, it is optimized for low-latency and edge AI applications.
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
* **Flexible architecture** with user-tunable speed-quality tradeoffs at inference time
|
| 35 |
-
* **Native resolution processing** up to 512×512 with intelligent patch-based handling for larger images, avoiding upscaling and distortion
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
##
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|---|---:|---:|
|
| 47 |
-
| **Parameters (LM only)** | 350M | 1.2B |
|
| 48 |
-
| **Vision encoder** | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shape‑optimized (400M) |
|
| 49 |
-
| **Backbone layers** | hybrid conv+attention | hybrid conv+attention |
|
| 50 |
-
| **Context (text)** | 32,768 tokens | 32,768 tokens |
|
| 51 |
-
| **Image tokens** | dynamic, user‑tunable | dynamic, user‑tunable |
|
| 52 |
-
| **Vocab size** | 65,536 | 65,536 |
|
| 53 |
-
| **Precision** | bfloat16 | bfloat16 |
|
| 54 |
-
| **License** | LFM Open License v1.0 | LFM Open License v1.0 |
|
| 55 |
|
| 56 |
-
**
|
| 57 |
|
| 58 |
-
**
|
| 59 |
-
-
|
| 60 |
-
-
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
```
|
| 65 |
-
<|startoftext|><|im_start|>system
|
| 66 |
-
You are a helpful multimodal assistant by Liquid AI.<|im_end|>
|
| 67 |
-
<|im_start|>user
|
| 68 |
-
<image>Describe this image.<|im_end|>
|
| 69 |
-
<|im_start|>assistant
|
| 70 |
-
This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>
|
| 71 |
-
```
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
- **Hybrid backbone**: Language model tower (LFM2-1.2B or LFM2-350M) paired with SigLIP2 NaFlex vision encoders (400M shape-optimized or 86M base variant)
|
| 79 |
-
- **Native resolution processing**: Handles images up to 512×512 pixels without upscaling and preserves non-standard aspect ratios without distortion
|
| 80 |
-
- **Tiling strategy**: Splits large images into non-overlapping 512×512 patches and includes thumbnail encoding for global context (in 1.6B model)
|
| 81 |
-
- **Efficient token mapping**: 2-layer MLP connector with pixel unshuffle reduces image tokens (e.g., 256×384 image → 96 tokens, 1000×3000 → 1,020 tokens)
|
| 82 |
-
- **Inference-time flexibility**: User-tunable maximum image tokens and patch count for speed/quality tradeoff without retraining
|
| 83 |
|
| 84 |
-
|
| 85 |
-
-
|
| 86 |
-
-
|
| 87 |
-
-
|
| 88 |
-
-
|
|
|
|
| 89 |
|
| 90 |
-
##
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
```bash
|
| 95 |
pip install -U transformers pillow
|
| 96 |
```
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
```python
|
| 101 |
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 102 |
from transformers.image_utils import load_image
|
| 103 |
|
| 104 |
# Load model and processor
|
| 105 |
-
model_id = "
|
| 106 |
model = AutoModelForImageTextToText.from_pretrained(
|
| 107 |
model_id,
|
| 108 |
device_map="auto",
|
|
@@ -111,20 +105,19 @@ model = AutoModelForImageTextToText.from_pretrained(
|
|
| 111 |
)
|
| 112 |
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 113 |
|
| 114 |
-
# Load image and create conversation
|
| 115 |
-
|
| 116 |
-
image = load_image(url)
|
| 117 |
conversation = [
|
| 118 |
{
|
| 119 |
"role": "user",
|
| 120 |
"content": [
|
| 121 |
{"type": "image", "image": image},
|
| 122 |
-
{"type": "text", "text": "
|
| 123 |
],
|
| 124 |
},
|
| 125 |
]
|
| 126 |
|
| 127 |
-
# Generate
|
| 128 |
inputs = processor.apply_chat_template(
|
| 129 |
conversation,
|
| 130 |
add_generation_prompt=True,
|
|
@@ -132,40 +125,164 @@ inputs = processor.apply_chat_template(
|
|
| 132 |
return_dict=True,
|
| 133 |
tokenize=True,
|
| 134 |
).to(model.device)
|
| 135 |
-
outputs = model.generate(**inputs, max_new_tokens=64)
|
| 136 |
-
processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 137 |
|
| 138 |
-
|
|
|
|
|
|
|
| 139 |
```
|
| 140 |
|
| 141 |
-
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
|
|
|
|
|
|
| 151 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
-
##
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
| InternVL3-1B | 57.00 | 31.14 | 54.94 | 798 | 43.00 | 52.30 | 43.20 | 46.90 | 71.20 | 58.70 | 1912.40 | 49.80 |
|
| 159 |
-
| SmolVLM2-2.2B | 57.50 | 19.42 | 37.75 | 725 | 42.30 | 46.00 | 41.60 | 51.50 | 71.30 | 34.90 | 1792.50 | - |
|
| 160 |
-
| LFM2-VL-1.6B | 65.23 | 37.66 | 58.68 | 742 | 44.40 | 49.53 | 38.44 | 51.10 | 71.97 | 48.07 | 1753.04 | 50.99 |
|
| 161 |
|
| 162 |
-
|
| 163 |
-
|-------------------|-------------|-----------|---------------|----------|-------|--------|------------|-----------|---------------|-------|----------|-------|
|
| 164 |
-
| SmolVLM2-500M | 49.90 | 11.27 | 24.64 | 609 | 40.70 | 38.20 | 34.10 | 37.50 | 62.20 | 29.90 | 1448.30 | - |
|
| 165 |
-
| LFM2-VL-450M | 52.29 | 26.18 | 46.51 | 655 | 41.98 | 40.87 | 33.11 | 44.70 | 63.50 | 33.76 | 1239.06 | 40.16 |
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
-
##
|
| 170 |
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
license_name: lfm1.0
|
| 5 |
license_link: LICENSE
|
| 6 |
language:
|
| 7 |
+
- ja
|
| 8 |
+
base_model: LiquidAI/LFM2-VL-1.6B
|
| 9 |
tags:
|
| 10 |
- liquid
|
| 11 |
- lfm2
|
| 12 |
- lfm2-vl
|
| 13 |
+
- vision-language
|
| 14 |
+
- japanese
|
| 15 |
+
- multimodal
|
| 16 |
+
- trl
|
| 17 |
+
- sft
|
| 18 |
+
pipeline_tag: image-text-to-text
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# LFM2-VL-1.6B-jp (Japanese)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
## Model Description
|
| 24 |
|
| 25 |
+
**LFM2-VL-1.6B-jp** is a Japanese fine-tuned variant of [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B), optimized for Japanese vision-language tasks. This model maintains the efficiency and performance characteristics of the original LFM2-VL 1.6B architecture while specializing in Japanese language understanding and image description. With 1.6B parameters, this model offers enhanced capabilities compared to the 450M variant while remaining lightweight and suitable for edge deployment.
|
|
|
|
| 26 |
|
| 27 |
+
- **Developed by:** Alfaxad
|
| 28 |
+
- **Base Model:** [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B)
|
| 29 |
+
- **Model type:** Vision-Language Model (Multimodal)
|
| 30 |
+
- **Language:** Japanese (日本語)
|
| 31 |
+
- **License:** LFM Open License v1.0
|
| 32 |
+
- **Finetuned from:** LiquidAI/LFM2-VL-1.6B (1.6B parameters)
|
| 33 |
|
| 34 |
+
## Key Features
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
- **Japanese Language Support:** Specialized for Japanese image understanding and description tasks
|
| 37 |
+
- **Enhanced Capabilities:** 1.6B parameters provide improved reasoning and generation quality
|
| 38 |
+
- **Advanced Vision Encoder:** SigLIP2 NaFlex shape-optimized (400M) for better visual understanding
|
| 39 |
+
- **Low Latency:** 2× faster inference speed on GPUs compared to similar-sized VLMs
|
| 40 |
+
- **Multi-turn Conversations:** Trained on conversational data for interactive vision-language tasks
|
| 41 |
+
- **Native Resolution Processing:** Handles images up to 512×512 pixels without upscaling, with intelligent tiling for larger images
|
| 42 |
|
| 43 |
+
## Model Details
|
| 44 |
|
| 45 |
+
| Property | Value |
|
| 46 |
+
|---|---:|
|
| 47 |
+
| **Parameters (LM only)** | 1.2B |
|
| 48 |
+
| **Vision encoder** | SigLIP2 NaFlex shape-optimized (400M) |
|
| 49 |
+
| **Total parameters** | ~1.6B |
|
| 50 |
+
| **Backbone layers** | hybrid conv+attention |
|
| 51 |
+
| **Context (text)** | 32,768 tokens |
|
| 52 |
+
| **Image tokens** | dynamic, user-tunable |
|
| 53 |
+
| **Vocab size** | 65,536 |
|
| 54 |
+
| **Precision** | bfloat16 |
|
| 55 |
|
| 56 |
+
## Training Data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
The model was fine-tuned on approximately **98,000 multi-turn conversational samples** from:
|
| 59 |
|
| 60 |
+
- **Dataset:** [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)
|
| 61 |
+
- **Content:** Japanese visual question-answering conversations
|
| 62 |
+
- **Format:** Multi-turn dialogues with image context
|
| 63 |
|
| 64 |
+
## Intended Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+
### Primary Use Cases
|
| 67 |
|
| 68 |
+
- Japanese image captioning and detailed description
|
| 69 |
+
- Visual question answering in Japanese with enhanced reasoning
|
| 70 |
+
- Multi-turn conversations about images in Japanese
|
| 71 |
+
- Japanese document understanding and OCR tasks
|
| 72 |
+
- Complex visual reasoning tasks in Japanese
|
| 73 |
+
- Edge AI applications requiring Japanese language support
|
| 74 |
|
| 75 |
+
### Recommended Applications
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
- Japanese e-commerce product analysis and description
|
| 78 |
+
- Japanese accessibility tools for visual content
|
| 79 |
+
- Japanese educational applications requiring visual understanding
|
| 80 |
+
- Japanese content moderation and detailed analysis
|
| 81 |
+
- Japanese chatbots with advanced visual understanding
|
| 82 |
+
- Japanese document processing and information extraction
|
| 83 |
|
| 84 |
+
## How to Use
|
| 85 |
|
| 86 |
+
### Installation
|
| 87 |
|
| 88 |
```bash
|
| 89 |
pip install -U transformers pillow
|
| 90 |
```
|
| 91 |
|
| 92 |
+
### Basic Usage
|
| 93 |
|
| 94 |
```python
|
| 95 |
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 96 |
from transformers.image_utils import load_image
|
| 97 |
|
| 98 |
# Load model and processor
|
| 99 |
+
model_id = "Alfaxad/LFM2-VL-1.6B-jp"
|
| 100 |
model = AutoModelForImageTextToText.from_pretrained(
|
| 101 |
model_id,
|
| 102 |
device_map="auto",
|
|
|
|
| 105 |
)
|
| 106 |
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 107 |
|
| 108 |
+
# Load image and create conversation in Japanese
|
| 109 |
+
image = load_image("your_image_url_or_path.jpg")
|
|
|
|
| 110 |
conversation = [
|
| 111 |
{
|
| 112 |
"role": "user",
|
| 113 |
"content": [
|
| 114 |
{"type": "image", "image": image},
|
| 115 |
+
{"type": "text", "text": "この画像について詳しく説明してください。"},
|
| 116 |
],
|
| 117 |
},
|
| 118 |
]
|
| 119 |
|
| 120 |
+
# Generate response
|
| 121 |
inputs = processor.apply_chat_template(
|
| 122 |
conversation,
|
| 123 |
add_generation_prompt=True,
|
|
|
|
| 125 |
return_dict=True,
|
| 126 |
tokenize=True,
|
| 127 |
).to(model.device)
|
|
|
|
|
|
|
| 128 |
|
| 129 |
+
outputs = model.generate(**inputs, max_new_tokens=256)
|
| 130 |
+
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 131 |
+
print(response)
|
| 132 |
```
|
| 133 |
|
| 134 |
+
### Multi-turn Conversation Example
|
| 135 |
|
| 136 |
+
```python
|
| 137 |
+
# Multi-turn conversation
|
| 138 |
+
conversation = [
|
| 139 |
+
{
|
| 140 |
+
"role": "user",
|
| 141 |
+
"content": [
|
| 142 |
+
{"type": "image", "image": image},
|
| 143 |
+
{"type": "text", "text": "この画像には何が写っていますか?"},
|
| 144 |
+
],
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"role": "assistant",
|
| 148 |
+
"content": [
|
| 149 |
+
{"type": "text", "text": "この画像には赤い車が道路に駐車されています。"},
|
| 150 |
+
],
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"role": "user",
|
| 154 |
+
"content": [
|
| 155 |
+
{"type": "text", "text": "車のメーカーはわかりますか?"},
|
| 156 |
+
],
|
| 157 |
+
},
|
| 158 |
+
]
|
| 159 |
|
| 160 |
+
inputs = processor.apply_chat_template(
|
| 161 |
+
conversation,
|
| 162 |
+
add_generation_prompt=True,
|
| 163 |
+
return_tensors="pt",
|
| 164 |
+
return_dict=True,
|
| 165 |
+
tokenize=True,
|
| 166 |
+
).to(model.device)
|
| 167 |
+
|
| 168 |
+
outputs = model.generate(**inputs, max_new_tokens=128)
|
| 169 |
+
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Recommended Generation Parameters
|
| 173 |
+
|
| 174 |
+
- **Temperature:** 0.1
|
| 175 |
+
- **min_p:** 0.15
|
| 176 |
+
- **repetition_penalty:** 1.05
|
| 177 |
+
- **min_image_tokens:** 64
|
| 178 |
+
- **max_image_tokens:** 256
|
| 179 |
+
- **do_image_splitting:** True
|
| 180 |
+
- **max_new_tokens:** 128-512 (depending on task complexity)
|
| 181 |
+
|
| 182 |
+
### Chat Template
|
| 183 |
+
|
| 184 |
+
The model uses a ChatML-like format:
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
<|startoftext|><|im_start|>system
|
| 188 |
+
あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
|
| 189 |
+
<|im_start|>user
|
| 190 |
+
<image>この画像を詳しく説明してください。<|im_end|>
|
| 191 |
+
<|im_start|>assistant
|
| 192 |
+
この画像には...<|im_end|>
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
## Architecture Highlights
|
| 196 |
+
|
| 197 |
+
- **Hybrid backbone:** LFM2-1.2B language model paired with SigLIP2 NaFlex shape-optimized vision encoder (400M)
|
| 198 |
+
- **Native resolution processing:** Handles images up to 512×512 pixels without upscaling
|
| 199 |
+
- **Tiling strategy:** Splits large images into non-overlapping 512×512 patches with thumbnail encoding for global context
|
| 200 |
+
- **Efficient token mapping:** 2-layer MLP connector with pixel unshuffle reduces image tokens efficiently
|
| 201 |
+
- **Inference-time flexibility:** User-tunable maximum image tokens and patch count for speed/quality tradeoff
|
| 202 |
+
|
| 203 |
+
## Training Details
|
| 204 |
|
| 205 |
+
### Training Procedure
|
| 206 |
|
| 207 |
+
- **Base Model:** LiquidAI/LFM2-VL-1.6B
|
| 208 |
+
- **Fine-tuning Method:** Supervised Fine-Tuning (SFT) with LoRA adapters
|
| 209 |
+
- **Framework:** Hugging Face TRL (Transformer Reinforcement Learning)
|
| 210 |
+
- **Training Data:** ~98,000 multi-turn conversations
|
| 211 |
+
- **Training Regime:** bfloat16 mixed precision
|
| 212 |
|
| 213 |
+
### Training Hyperparameters
|
| 214 |
+
|
| 215 |
+
- **Training approach:** LoRA (Low-Rank Adaptation) fine-tuning
|
| 216 |
+
- **Dataset size:** ~98,000 samples
|
| 217 |
+
- **Data format:** Multi-turn conversational VQA
|
| 218 |
+
- **Language focus:** Japanese
|
| 219 |
+
|
| 220 |
+
## Performance Considerations
|
| 221 |
+
|
| 222 |
+
As a fine-tuned variant of LFM2-VL-1.6B:
|
| 223 |
+
- **Enhanced Capabilities:** The 1.6B model offers improved reasoning, more detailed descriptions, and better handling of complex visual scenarios compared to the 450M variant
|
| 224 |
+
- **Optimized for Japanese:** Best performance on Japanese language tasks
|
| 225 |
+
- **Resource Efficient:** Still lightweight enough for edge devices while providing enhanced capabilities
|
| 226 |
+
- **Speed vs Quality:** Offers better balance between inference speed and output quality
|
| 227 |
+
- **Recommended Use:** Can be used out-of-the-box for many Japanese VLM tasks, though further fine-tuning on specific use cases will maximize performance
|
| 228 |
+
|
| 229 |
+
## Comparison with 450M Variant
|
| 230 |
+
|
| 231 |
+
| Aspect | LFM2-VL-450M-jp | LFM2-VL-1.6B-jp |
|
| 232 |
+
|--------|-----------------|-----------------|
|
| 233 |
+
| **Parameters** | 450M total | 1.6B total |
|
| 234 |
+
| **Vision Encoder** | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shape-optimized (400M) |
|
| 235 |
+
| **Use Case** | Highly constrained devices | More capable while still lightweight |
|
| 236 |
+
| **Output Quality** | Good for simple tasks | Better for complex reasoning |
|
| 237 |
+
| **Inference Speed** | Faster | Still fast, slightly slower |
|
| 238 |
+
| **Memory Usage** | Lower | Higher but manageable |
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
## Limitations
|
| 243 |
+
|
| 244 |
+
- **Language Specialization:** Primarily designed for Japanese; performance on other languages may be limited
|
| 245 |
+
- **Domain Specificity:** Performance is optimized for the types of conversations present in the training data
|
| 246 |
+
- **Safety:** Not intended for safety-critical decisions without additional validation
|
| 247 |
+
- **Complex Reasoning:** While improved over 450M, may still struggle with highly complex multi-step reasoning compared to much larger models
|
| 248 |
+
- **Cultural Context:** Trained on Japanese data; cultural nuances should be considered
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
## Citation
|
| 252 |
+
|
| 253 |
+
If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:
|
| 254 |
+
|
| 255 |
+
```bibtex
|
| 256 |
+
@misc{lfm2-vl-1.6b-jp,
|
| 257 |
+
author = {Alfaxad},
|
| 258 |
+
title = {LFM2-VL-1.6B-jp: Japanese Fine-tuned Vision-Language Model},
|
| 259 |
+
year = {2025},
|
| 260 |
+
publisher = {HuggingFace},
|
| 261 |
+
url = {https://huggingface.co/Alfaxad/LFM2-VL-1.6B-jp}
|
| 262 |
+
}
|
| 263 |
+
|
| 264 |
+
@misc{liquid-lfm2-vl,
|
| 265 |
+
author = {Liquid AI},
|
| 266 |
+
title = {LFM2-VL: Efficient Vision-Language Models},
|
| 267 |
+
year = {2025},
|
| 268 |
+
url = {https://huggingface.co/LiquidAI/LFM2-VL-1.6B}
|
| 269 |
+
}
|
| 270 |
+
```
|
| 271 |
|
| 272 |
+
## Acknowledgments
|
| 273 |
|
| 274 |
+
- **Base Model:** [Liquid AI](https://www.liquid.ai/) for the LFM2-VL architecture
|
| 275 |
+
- **Training Data:** [llm-jp](https://huggingface.co/llm-jp) for the ja-vg-vqa-conversation dataset
|
| 276 |
+
- **Framework:** Hugging Face for transformers and TRL libraries
|
|
|
|
|
|
|
|
|
|
| 277 |
|
| 278 |
+
## Contact
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
+
For questions or issues regarding this model, please open an issue on the model's Hugging Face page or contact the model developer.
|
| 281 |
|
| 282 |
+
## Additional Resources
|
| 283 |
|
| 284 |
+
- **Original Model:** [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B)
|
| 285 |
+
- **Smaller Variant:** [Alfaxad/LFM2-VL-450M-jp](https://huggingface.co/Alfaxad/LFM2-VL-450M-jp)
|
| 286 |
+
- **Training Dataset:** [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)
|
| 287 |
+
- **LFM2-VL Blog Post:** [Liquid AI Blog](https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models)
|
| 288 |
+
- **Original Paper/Documentation:** [LFM2 Blog Post](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models)
|