File size: 8,622 Bytes
c013d55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
library_name: transformers
license: other
license_name: lfm1.0
license_link: LICENSE
language:
- en
pipeline_tag: image-text-to-text
tags:
- liquid
- lfm2.5
- lfm2
- edge
- vision
base_model: LiquidAI/LFM2.5-VL-450M
---
<center>
<div style="text-align: center;">
  <img 
    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png" 
    alt="Liquid AI"
    style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
  />
</div>
<div style="display: flex; justify-content: center; gap: 0.5em;">
<a href="https://playground.liquid.ai/chat?model=lfm2.5-vl-450m"><strong>Try LFM</strong></a><a href="https://docs.liquid.ai/lfm/getting-started/welcome"><strong>Docs</strong></a><a href="https://leap.liquid.ai/"><strong>LEAP</strong></a><a href="https://discord.com/invite/liquid-ai"><strong>Discord</strong></a>
</div>
</center>

<br>

# LFM2.5-VL-450M-Extract

**LFM2.5-VL-450M-Extract** extracts user-defined fields from images and returns them as **JSON**. It is Liquid AI's first vision model in the [Liquid Nanos](https://huggingface.co/collections/LiquidAI/liquid-nanos) collection—compact, task-specific models built for production workflows—and extends the Extract family alongside [LFM2-350M-Extract](https://huggingface.co/LiquidAI/LFM2-350M-Extract) for text documents.

## ⚙️ How it works

You specify what to extract as a YAML field list in the system prompt, and the model returns a JSON object with those fields. Structured outputs integrate cleanly with rule-based systems and downstream pipelines. Use it out of the box or fine-tune for domain-specific extraction.

- **System prompt**:

```yaml
wood_color: The overall coloration of the wood surface
wood_texture: The tactile quality of the wood surface 
wood_pattern: The partern types visible on the wood surface
```

- **User prompt**:
<img src="https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png" width="300">

- **Output**:
```yaml
{
  "wood_color": "light to medium brown",
  "wood_texture": "smooth with visible grain",
  "wood_pattern": "parallel, irregular, wavy"
}
```

Our model supports the enum feature, which lets you provide a list of possible choices alongside the field description as follows, and the model will return one of the listed values as its answer. 

- **System prompt**:

```yaml
wood_color: The overall coloration of the wood surface, such as blue, red, or light tan
wood_texture: The tactile quality of the wood surface, select from smooth, rough, or grainy
wood_pattern: The partern types visible on the wood surface, e.g., straight, wavy, or curly
```

## 🌟 Use cases

- Detecting safety-critical events in images (e.g. fallen person, fire, leakage) to trigger automated safety systems.
- Collecting statistical information about objects across video frames for analytics pipelines.
- Auto-tag product images with structured attributes for Retail/E-commerce.

## 📄 Model details

| Property | Detail |
|---|---:|
| **Parameters (LM only)** | 350M |
| **Vision encoder** | SigLIP2 (~100M, [SigLIP-2 paper](https://arxiv.org/abs/2502.14786)) |
| **Backbone layers** | hybrid conv+attention |
| **Image input** | Single image, dynamic resolution |
| **Context** | 128,000 tokens |
| **Vocab size** | 65,536 (text) |
| **Precision** | bfloat16 |
| **License** | LFM Open License v1.0 |

## 📊 Performance

We evaluated LFM2.5-VL-450M-Extract on a 2,000-sample benchmark of
`(image, schema, JSON)` triples, with reference labels generated by an
ensemble of frontier multimodal models. Predictions are scored on the
following three dimensions:

- **JSON Validity** — share of samples producing strict-parseable JSON
- **Schema Consistency F1 Score** — set-level F1 over predicted vs requested field names, macro-averaged across samples
- **VLM Judge Score** — match against the image directly, judged by a separate vision model ([Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B))

<img src="https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/lfm2_vl_450m_metrics.png" width="800">

| Model | Params | JSON Validity | F1 Score | VLM Judge Score |
|---|---:|---:|---:|---:|
| **LFM2.5-VL-450M-Extract** | **0.45B** | **98.9** | **98.8** | **84.5** |
| LFM2.5-VL-450M | 0.45B | 97.7 | 93.5 | 73.4 |
| SmolVLM-500M-Instruct | 0.51B | 33.0 | 26.6 | 12.2 |
| FastVLM-0.5B | 0.76B | 22.5 | 19.3 | 16.3 |
| Qwen3.5-0.8B | 0.87B | 96.4 | 96.3 | 82.3 |
| InternVL3_5-1B | 1.06B | 98.0 | 96.5 | 80.7 |
| MiniCPM-V-4.6 | 1.30B | 61.8 | 60.4 | 57.5 |
| *(ref) InternVL3_5-2B* | 2.35B | 99.6 | 99.2 | 87.7 |
| *(ref) Qwen3.5-2B* | 2.27B | 97.9 | 97.7 | 89.7 |
| *(ref) gemma-4-E2B-it* | 2.3B | 97.4 | 97.1 | 84.4 |

LFM2-VL-450M-Extract outperforms similarly-sized (sub-1B) open-source VLMs on this benchmark and is competitive with models 4× its size.

**Reproducing these numbers**: The full evaluation pipeline, which includes extraction, VLM judging, and metric aggregation, is bundled in this repository under `model_eval/`. Setup, configuration, and run instructions are in the folder's [`README`](./model_eval/README.md).

**Scope**: These numbers characterize the model on the input/output form it is designed for: a single input image, a YAML field list as the schema, and a flat JSON object as the output. Performance is not expected to transfer to largely different tasks, e.g. multi-image reasoning or free-form VQA.

<!-- > Generic instruction-tuned VLMs (SmolVLM, moondream) cannot perform
> schema-based extraction zero-shot regardless of prompt strategy.
> Under the most permissive prompt setups, they either:
>   - produce free-form captions ignoring the JSON instruction, or
>   - produce valid-shaped JSON but echo the schema descriptions or
>     few-shot example values as field values (zero faithfulness to
>     the image).
>
> LFM2-VL-Extract's task-specific training is what enables strict-JSON
> output with faithful, image-grounded values in a single zero-shot
> call — no few-shot examples, no grammar constraints, no inference
> wrappers. -->


The full evaluation pipeline, which includes extraction, LLM/VLM judging, and
metric aggregation, is included in this repository under `model_eval/`. Usage details are in the folder's README.

## 🏃 How to run

You can run LFM2.5-VL-450M-Extract with Hugging Face [`transformers`](https://github.com/huggingface/transformers) v5.1 or newer:

```bash
pip install transformers pillow
```

```python
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

model_id = "LiquidAI/LFM2.5-VL-450M-Extract"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = load_image("https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png")

fields_yaml = """wood_color: The overall coloration of the wood surface
wood_texture: The tactile quality of the wood surface
wood_pattern: The pattern types visible on the wood surface"""

system_prompt = f"""Extract the following from the image:

{fields_yaml}

Respond with only a JSON object. Do not include any text outside the JSON."""

conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": [{"type": "image", "image": image}]},
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.batch_decode(
    outputs[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)[0]
print(response)
# {
#   "wood_color": "light to medium brown",
#   "wood_texture": "smooth with visible grain",
#   "wood_pattern": "parallel, irregular, wavy"
# }
```

> [!WARNING]
> The model is intended for single-turn conversations. We recommend using greedy decoding (`temperature=0`).

## 📬 Contact

- Got questions or want to connect? [Join our Discord community](https://discord.com/invite/liquid-ai)
- If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact).

## Citation

```bibtex
@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}
```