--- library_name: transformers license: other license_name: lfm1.0 license_link: LICENSE language: - en pipeline_tag: image-text-to-text tags: - liquid - lfm2.5 - lfm2 - edge - vision base_model: LiquidAI/LFM2.5-VL-450M ---
- **Output**:
```yaml
{
"wood_color": "light to medium brown",
"wood_texture": "smooth with visible grain",
"wood_pattern": "parallel, irregular, wavy"
}
```
Our model supports the enum feature, which lets you provide a list of possible choices alongside the field description as follows, and the model will return one of the listed values as its answer.
- **System prompt**:
```yaml
wood_color: The overall coloration of the wood surface, such as blue, red, or light tan
wood_texture: The tactile quality of the wood surface, select from smooth, rough, or grainy
wood_pattern: The partern types visible on the wood surface, e.g., straight, wavy, or curly
```
## 🌟 Use cases
- Detecting safety-critical events in images (e.g. fallen person, fire, leakage) to trigger automated safety systems.
- Collecting statistical information about objects across video frames for analytics pipelines.
- Auto-tag product images with structured attributes for Retail/E-commerce.
## 📄 Model details
| Property | Detail |
|---|---:|
| **Parameters (LM only)** | 350M |
| **Vision encoder** | SigLIP2 (~100M, [SigLIP-2 paper](https://arxiv.org/abs/2502.14786)) |
| **Backbone layers** | hybrid conv+attention |
| **Image input** | Single image, dynamic resolution |
| **Context** | 128,000 tokens |
| **Vocab size** | 65,536 (text) |
| **Precision** | bfloat16 |
| **License** | LFM Open License v1.0 |
## 📊 Performance
We evaluated LFM2.5-VL-450M-Extract on a 2,000-sample benchmark of
`(image, schema, JSON)` triples, with reference labels generated by an
ensemble of frontier multimodal models. Predictions are scored on the
following three dimensions:
- **JSON Validity** — share of samples producing strict-parseable JSON
- **Schema Consistency F1 Score** — set-level F1 over predicted vs requested field names, macro-averaged across samples
- **VLM Judge Score** — match against the image directly, judged by a separate vision model ([Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B))
| Model | Params | JSON Validity | F1 Score | VLM Judge Score |
|---|---:|---:|---:|---:|
| **LFM2.5-VL-450M-Extract** | **0.45B** | **98.9** | **98.8** | **84.5** |
| LFM2.5-VL-450M | 0.45B | 97.7 | 93.5 | 73.4 |
| SmolVLM-500M-Instruct | 0.51B | 33.0 | 26.6 | 12.2 |
| FastVLM-0.5B | 0.76B | 22.5 | 19.3 | 16.3 |
| Qwen3.5-0.8B | 0.87B | 96.4 | 96.3 | 82.3 |
| InternVL3_5-1B | 1.06B | 98.0 | 96.5 | 80.7 |
| MiniCPM-V-4.6 | 1.30B | 61.8 | 60.4 | 57.5 |
| *(ref) InternVL3_5-2B* | 2.35B | 99.6 | 99.2 | 87.7 |
| *(ref) Qwen3.5-2B* | 2.27B | 97.9 | 97.7 | 89.7 |
| *(ref) gemma-4-E2B-it* | 2.3B | 97.4 | 97.1 | 84.4 |
LFM2-VL-450M-Extract outperforms similarly-sized (sub-1B) open-source VLMs on this benchmark and is competitive with models 4× its size.
**Reproducing these numbers**: The full evaluation pipeline, which includes extraction, VLM judging, and metric aggregation, is bundled in this repository under `model_eval/`. Setup, configuration, and run instructions are in the folder's [`README`](./model_eval/README.md).
**Scope**: These numbers characterize the model on the input/output form it is designed for: a single input image, a YAML field list as the schema, and a flat JSON object as the output. Performance is not expected to transfer to largely different tasks, e.g. multi-image reasoning or free-form VQA.
The full evaluation pipeline, which includes extraction, LLM/VLM judging, and
metric aggregation, is included in this repository under `model_eval/`. Usage details are in the folder's README.
## 🏃 How to run
You can run LFM2.5-VL-450M-Extract with Hugging Face [`transformers`](https://github.com/huggingface/transformers) v5.1 or newer:
```bash
pip install transformers pillow
```
```python
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
model_id = "LiquidAI/LFM2.5-VL-450M-Extract"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
dtype="bfloat16",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
image = load_image("https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png")
fields_yaml = """wood_color: The overall coloration of the wood surface
wood_texture: The tactile quality of the wood surface
wood_pattern: The pattern types visible on the wood surface"""
system_prompt = f"""Extract the following from the image:
{fields_yaml}
Respond with only a JSON object. Do not include any text outside the JSON."""
conversation = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": [{"type": "image", "image": image}]},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.batch_decode(
outputs[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)[0]
print(response)
# {
# "wood_color": "light to medium brown",
# "wood_texture": "smooth with visible grain",
# "wood_pattern": "parallel, irregular, wavy"
# }
```
> [!WARNING]
> The model is intended for single-turn conversations. We recommend using greedy decoding (`temperature=0`).
## 📬 Contact
- Got questions or want to connect? [Join our Discord community](https://discord.com/invite/liquid-ai)
- If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact).
## Citation
```bibtex
@article{liquidai2025lfm2,
title={LFM2 Technical Report},
author={Liquid AI},
journal={arXiv preprint arXiv:2511.23404},
year={2025}
}
```