--- library_name: transformers license: other license_name: lfm1.0 license_link: LICENSE language: - en pipeline_tag: image-text-to-text tags: - liquid - lfm2.5 - lfm2 - edge - vision base_model: LiquidAI/LFM2.5-VL-450M ---

Try LFM • Docs • LEAP • Discord

# LFM2.5-VL-450M-Extract **LFM2.5-VL-450M-Extract** extracts user-defined fields from images and returns them as **JSON**. It is Liquid AI's first vision model in the [Liquid Nanos](https://huggingface.co/collections/LiquidAI/liquid-nanos) collection—compact, task-specific models built for production workflows—and extends the Extract family alongside [LFM2-350M-Extract](https://huggingface.co/LiquidAI/LFM2-350M-Extract) for text documents. ## ⚙️ How it works You specify what to extract as a YAML field list in the system prompt, and the model returns a JSON object with those fields. Structured outputs integrate cleanly with rule-based systems and downstream pipelines. Use it out of the box or fine-tune for domain-specific extraction. - **System prompt**: ```yaml wood_color: The overall coloration of the wood surface wood_texture: The tactile quality of the wood surface wood_pattern: The partern types visible on the wood surface ``` - **User prompt**:

- **Output**: ```yaml { "wood_color": "light to medium brown", "wood_texture": "smooth with visible grain", "wood_pattern": "parallel, irregular, wavy" } ``` Our model supports the enum feature, which lets you provide a list of possible choices alongside the field description as follows, and the model will return one of the listed values as its answer. - **System prompt**: ```yaml wood_color: The overall coloration of the wood surface, such as blue, red, or light tan wood_texture: The tactile quality of the wood surface, select from smooth, rough, or grainy wood_pattern: The partern types visible on the wood surface, e.g., straight, wavy, or curly ``` ## 🌟 Use cases - Detecting safety-critical events in images (e.g. fallen person, fire, leakage) to trigger automated safety systems. - Collecting statistical information about objects across video frames for analytics pipelines. - Auto-tag product images with structured attributes for Retail/E-commerce. ## 📄 Model details | Property | Detail | |---|---:| | **Parameters (LM only)** | 350M | | **Vision encoder** | SigLIP2 (~100M, [SigLIP-2 paper](https://arxiv.org/abs/2502.14786)) | | **Backbone layers** | hybrid conv+attention | | **Image input** | Single image, dynamic resolution | | **Context** | 128,000 tokens | | **Vocab size** | 65,536 (text) | | **Precision** | bfloat16 | | **License** | LFM Open License v1.0 | ## 📊 Performance We evaluated LFM2.5-VL-450M-Extract on a 2,000-sample benchmark of `(image, schema, JSON)` triples, with reference labels generated by an ensemble of frontier multimodal models. Predictions are scored on the following three dimensions: - **JSON Validity** — share of samples producing strict-parseable JSON - **Schema Consistency F1 Score** — set-level F1 over predicted vs requested field names, macro-averaged across samples - **VLM Judge Score** — match against the image directly, judged by a separate vision model ([Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B))

| Model | Params | JSON Validity | F1 Score | VLM Judge Score | |---|---:|---:|---:|---:| | **LFM2.5-VL-450M-Extract** | **0.45B** | **98.9** | **98.8** | **84.5** | | LFM2.5-VL-450M | 0.45B | 97.7 | 93.5 | 73.4 | | SmolVLM-500M-Instruct | 0.51B | 33.0 | 26.6 | 12.2 | | FastVLM-0.5B | 0.76B | 22.5 | 19.3 | 16.3 | | Qwen3.5-0.8B | 0.87B | 96.4 | 96.3 | 82.3 | | InternVL3_5-1B | 1.06B | 98.0 | 96.5 | 80.7 | | MiniCPM-V-4.6 | 1.30B | 61.8 | 60.4 | 57.5 | | *(ref) InternVL3_5-2B* | 2.35B | 99.6 | 99.2 | 87.7 | | *(ref) Qwen3.5-2B* | 2.27B | 97.9 | 97.7 | 89.7 | | *(ref) gemma-4-E2B-it* | 2.3B | 97.4 | 97.1 | 84.4 | LFM2-VL-450M-Extract outperforms similarly-sized (sub-1B) open-source VLMs on this benchmark and is competitive with models 4× its size. **Reproducing these numbers**: The full evaluation pipeline, which includes extraction, VLM judging, and metric aggregation, is bundled in this repository under `model_eval/`. Setup, configuration, and run instructions are in the folder's [`README`](./model_eval/README.md). **Scope**: These numbers characterize the model on the input/output form it is designed for: a single input image, a YAML field list as the schema, and a flat JSON object as the output. Performance is not expected to transfer to largely different tasks, e.g. multi-image reasoning or free-form VQA. The full evaluation pipeline, which includes extraction, LLM/VLM judging, and metric aggregation, is included in this repository under `model_eval/`. Usage details are in the folder's README. ## 🏃 How to run You can run LFM2.5-VL-450M-Extract with Hugging Face [`transformers`](https://github.com/huggingface/transformers) v5.1 or newer: ```bash pip install transformers pillow ``` ```python from transformers import AutoProcessor, AutoModelForImageTextToText from transformers.image_utils import load_image model_id = "LiquidAI/LFM2.5-VL-450M-Extract" model = AutoModelForImageTextToText.from_pretrained( model_id, device_map="auto", dtype="bfloat16", trust_remote_code=True, ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) image = load_image("https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png") fields_yaml = """wood_color: The overall coloration of the wood surface wood_texture: The tactile quality of the wood surface wood_pattern: The pattern types visible on the wood surface""" system_prompt = f"""Extract the following from the image: {fields_yaml} Respond with only a JSON object. Do not include any text outside the JSON.""" conversation = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": [{"type": "image", "image": image}]}, ] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, return_tensors="pt", return_dict=True, tokenize=True, ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) response = processor.batch_decode( outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True, )[0] print(response) # { # "wood_color": "light to medium brown", # "wood_texture": "smooth with visible grain", # "wood_pattern": "parallel, irregular, wavy" # } ``` > [!WARNING] > The model is intended for single-turn conversations. We recommend using greedy decoding (`temperature=0`). ## 📬 Contact - Got questions or want to connect? [Join our Discord community](https://discord.com/invite/liquid-ai) - If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact). ## Citation ```bibtex @article{liquidai2025lfm2, title={LFM2 Technical Report}, author={Liquid AI}, journal={arXiv preprint arXiv:2511.23404}, year={2025} } ```