Image-Text-to-Text
Transformers
Safetensors
English
lfm2_vl
liquid
lfm2.5
lfm2
edge
vision
conversational
Instructions to use LiquidAI/LFM2.5-VL-450M-Extract with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LiquidAI/LFM2.5-VL-450M-Extract with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="LiquidAI/LFM2.5-VL-450M-Extract") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("LiquidAI/LFM2.5-VL-450M-Extract") model = AutoModelForImageTextToText.from_pretrained("LiquidAI/LFM2.5-VL-450M-Extract") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use LiquidAI/LFM2.5-VL-450M-Extract with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LiquidAI/LFM2.5-VL-450M-Extract" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiquidAI/LFM2.5-VL-450M-Extract", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/LiquidAI/LFM2.5-VL-450M-Extract
- SGLang
How to use LiquidAI/LFM2.5-VL-450M-Extract with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LiquidAI/LFM2.5-VL-450M-Extract" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiquidAI/LFM2.5-VL-450M-Extract", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LiquidAI/LFM2.5-VL-450M-Extract" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiquidAI/LFM2.5-VL-450M-Extract", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use LiquidAI/LFM2.5-VL-450M-Extract with Docker Model Runner:
docker model run hf.co/LiquidAI/LFM2.5-VL-450M-Extract
| library_name: transformers | |
| license: other | |
| license_name: lfm1.0 | |
| license_link: LICENSE | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - liquid | |
| - lfm2.5 | |
| - lfm2 | |
| - edge | |
| - vision | |
| base_model: LiquidAI/LFM2.5-VL-450M | |
| <center> | |
| <div style="text-align: center;"> | |
| <img | |
| src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png" | |
| alt="Liquid AI" | |
| style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;" | |
| /> | |
| </div> | |
| <div style="display: flex; justify-content: center; gap: 0.5em;"> | |
| <a href="https://playground.liquid.ai/chat?model=lfm2.5-vl-450m"><strong>Try LFM</strong></a> β’ <a href="https://docs.liquid.ai/lfm/getting-started/welcome"><strong>Docs</strong></a> β’ <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a> β’ <a href="https://discord.com/invite/liquid-ai"><strong>Discord</strong></a> | |
| </div> | |
| </center> | |
| <br> | |
| # LFM2.5-VL-450M-Extract | |
| **LFM2.5-VL-450M-Extract** extracts user-defined fields from images and returns them as **JSON**. It is Liquid AI's first vision model in the [Liquid Nanos](https://huggingface.co/collections/LiquidAI/liquid-nanos) collectionβcompact, task-specific models built for production workflowsβand extends the Extract family alongside [LFM2-350M-Extract](https://huggingface.co/LiquidAI/LFM2-350M-Extract) for text documents. | |
| ## βοΈ How it works | |
| You specify what to extract as a YAML field list in the system prompt, and the model returns a JSON object with those fields. Structured outputs integrate cleanly with rule-based systems and downstream pipelines. Use it out of the box or fine-tune for domain-specific extraction. | |
| - **System prompt**: | |
| ```yaml | |
| wood_color: The overall coloration of the wood surface | |
| wood_texture: The tactile quality of the wood surface | |
| wood_pattern: The partern types visible on the wood surface | |
| ``` | |
| - **User prompt**: | |
| <img src="https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png" width="300"> | |
| - **Output**: | |
| ```yaml | |
| { | |
| "wood_color": "light to medium brown", | |
| "wood_texture": "smooth with visible grain", | |
| "wood_pattern": "parallel, irregular, wavy" | |
| } | |
| ``` | |
| Our model supports the enum feature, which lets you provide a list of possible choices alongside the field description as follows, and the model will return one of the listed values as its answer. | |
| - **System prompt**: | |
| ```yaml | |
| wood_color: The overall coloration of the wood surface, such as blue, red, or light tan | |
| wood_texture: The tactile quality of the wood surface, select from smooth, rough, or grainy | |
| wood_pattern: The partern types visible on the wood surface, e.g., straight, wavy, or curly | |
| ``` | |
| ## π Use cases | |
| - Detecting safety-critical events in images (e.g. fallen person, fire, leakage) to trigger automated safety systems. | |
| - Collecting statistical information about objects across video frames for analytics pipelines. | |
| - Auto-tag product images with structured attributes for Retail/E-commerce. | |
| ## π Model details | |
| | Property | Detail | | |
| |---|---:| | |
| | **Parameters (LM only)** | 350M | | |
| | **Vision encoder** | SigLIP2 (~100M, [SigLIP-2 paper](https://arxiv.org/abs/2502.14786)) | | |
| | **Backbone layers** | hybrid conv+attention | | |
| | **Image input** | Single image, dynamic resolution | | |
| | **Context** | 128,000 tokens | | |
| | **Vocab size** | 65,536 (text) | | |
| | **Precision** | bfloat16 | | |
| | **License** | LFM Open License v1.0 | | |
| ## π Performance | |
| We evaluated LFM2.5-VL-450M-Extract on a 2,000-sample benchmark of | |
| `(image, schema, JSON)` triples, with reference labels generated by an | |
| ensemble of frontier multimodal models. Predictions are scored on the | |
| following three dimensions: | |
| - **JSON Validity** β share of samples producing strict-parseable JSON | |
| - **Schema Consistency F1 Score** β set-level F1 over predicted vs requested field names, macro-averaged across samples | |
| - **VLM Judge Score** β match against the image directly, judged by a separate vision model ([Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)) | |
| <img src="https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/lfm2_vl_450m_metrics.png" width="800"> | |
| | Model | Params | JSON Validity | F1 Score | VLM Judge Score | | |
| |---|---:|---:|---:|---:| | |
| | **LFM2.5-VL-450M-Extract** | **0.45B** | **98.9** | **98.8** | **84.5** | | |
| | LFM2.5-VL-450M | 0.45B | 97.7 | 93.5 | 73.4 | | |
| | SmolVLM-500M-Instruct | 0.51B | 33.0 | 26.6 | 12.2 | | |
| | FastVLM-0.5B | 0.76B | 22.5 | 19.3 | 16.3 | | |
| | Qwen3.5-0.8B | 0.87B | 96.4 | 96.3 | 82.3 | | |
| | InternVL3_5-1B | 1.06B | 98.0 | 96.5 | 80.7 | | |
| | MiniCPM-V-4.6 | 1.30B | 61.8 | 60.4 | 57.5 | | |
| | *(ref) InternVL3_5-2B* | 2.35B | 99.6 | 99.2 | 87.7 | | |
| | *(ref) Qwen3.5-2B* | 2.27B | 97.9 | 97.7 | 89.7 | | |
| | *(ref) gemma-4-E2B-it* | 2.3B | 97.4 | 97.1 | 84.4 | | |
| LFM2-VL-450M-Extract outperforms similarly-sized (sub-1B) open-source VLMs on this benchmark and is competitive with models 4Γ its size. | |
| **Reproducing these numbers**: The full evaluation pipeline, which includes extraction, VLM judging, and metric aggregation, is bundled in this repository under `model_eval/`. Setup, configuration, and run instructions are in the folder's [`README`](./model_eval/README.md). | |
| **Scope**: These numbers characterize the model on the input/output form it is designed for: a single input image, a YAML field list as the schema, and a flat JSON object as the output. Performance is not expected to transfer to largely different tasks, e.g. multi-image reasoning or free-form VQA. | |
| <!-- > Generic instruction-tuned VLMs (SmolVLM, moondream) cannot perform | |
| > schema-based extraction zero-shot regardless of prompt strategy. | |
| > Under the most permissive prompt setups, they either: | |
| > - produce free-form captions ignoring the JSON instruction, or | |
| > - produce valid-shaped JSON but echo the schema descriptions or | |
| > few-shot example values as field values (zero faithfulness to | |
| > the image). | |
| > | |
| > LFM2-VL-Extract's task-specific training is what enables strict-JSON | |
| > output with faithful, image-grounded values in a single zero-shot | |
| > call β no few-shot examples, no grammar constraints, no inference | |
| > wrappers. --> | |
| The full evaluation pipeline, which includes extraction, LLM/VLM judging, and | |
| metric aggregation, is included in this repository under `model_eval/`. Usage details are in the folder's README. | |
| ## π How to run | |
| You can run LFM2.5-VL-450M-Extract with Hugging Face [`transformers`](https://github.com/huggingface/transformers) v5.1 or newer: | |
| ```bash | |
| pip install transformers pillow | |
| ``` | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForImageTextToText | |
| from transformers.image_utils import load_image | |
| model_id = "LiquidAI/LFM2.5-VL-450M-Extract" | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype="bfloat16", | |
| trust_remote_code=True, | |
| ) | |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) | |
| image = load_image("https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract/resolve/main/sample_image.png") | |
| fields_yaml = """wood_color: The overall coloration of the wood surface | |
| wood_texture: The tactile quality of the wood surface | |
| wood_pattern: The pattern types visible on the wood surface""" | |
| system_prompt = f"""Extract the following from the image: | |
| {fields_yaml} | |
| Respond with only a JSON object. Do not include any text outside the JSON.""" | |
| conversation = [ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": [{"type": "image", "image": image}]}, | |
| ] | |
| inputs = processor.apply_chat_template( | |
| conversation, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| return_dict=True, | |
| tokenize=True, | |
| ).to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) | |
| response = processor.batch_decode( | |
| outputs[:, inputs["input_ids"].shape[1]:], | |
| skip_special_tokens=True, | |
| )[0] | |
| print(response) | |
| # { | |
| # "wood_color": "light to medium brown", | |
| # "wood_texture": "smooth with visible grain", | |
| # "wood_pattern": "parallel, irregular, wavy" | |
| # } | |
| ``` | |
| > [!WARNING] | |
| > The model is intended for single-turn conversations. We recommend using greedy decoding (`temperature=0`). | |
| ## π¬ Contact | |
| - Got questions or want to connect? [Join our Discord community](https://discord.com/invite/liquid-ai) | |
| - If you are interested in custom solutions with edge deployment, please contact [our sales team](https://www.liquid.ai/contact). | |
| ## Citation | |
| ```bibtex | |
| @article{liquidai2025lfm2, | |
| title={LFM2 Technical Report}, | |
| author={Liquid AI}, | |
| journal={arXiv preprint arXiv:2511.23404}, | |
| year={2025} | |
| } | |
| ``` |