Instructions to use LiquidAI/LFM2.5-1.2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LiquidAI/LFM2.5-1.2B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LiquidAI/LFM2.5-1.2B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")
model = AutoModelForCausalLM.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LiquidAI/LFM2.5-1.2B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LiquidAI/LFM2.5-1.2B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiquidAI/LFM2.5-1.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LiquidAI/LFM2.5-1.2B-Instruct

SGLang

How to use LiquidAI/LFM2.5-1.2B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LiquidAI/LFM2.5-1.2B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiquidAI/LFM2.5-1.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LiquidAI/LFM2.5-1.2B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiquidAI/LFM2.5-1.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LiquidAI/LFM2.5-1.2B-Instruct with Docker Model Runner:
```
docker model run hf.co/LiquidAI/LFM2.5-1.2B-Instruct
```

ykhrustalev commited on Jan 9

Commit

286966c

verified ·

1 Parent(s): 4393ab8

Update README.md

Browse files

Files changed (1) hide show

README.md +166 -160

README.md CHANGED Viewed

@@ -1,217 +1,223 @@
 ---
 license: other
 license_name: lfm1.0
 license_link: LICENSE
 language:
 - en
 - ja
 - ko
-- fr
 - es
-- de
-- it
-- pt
-- ar
-- zh
 pipeline_tag: text-generation
 tags:
 - liquid
-- edge
 - lfm2.5
-- onnx
-- onnxruntime
-- webgpu
-base_model:
-- LiquidAI/LFM2.5-1.2B-Instruct
 ---
 <div align="center">
-  <img
-    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
-    alt="Liquid AI"
     style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
   />
   <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
-    <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
-    <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
     <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
   </div>
 </div>
-# LFM2.5-1.2B-Instruct-ONNX
-ONNX export of [LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) for cross-platform inference.
-LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware.
-## Recommended Variants
-| Precision | Size | Platform | Use Case |
-|-----------|------|----------|----------|
-| Q4 | ~1.2GB | WebGPU, Server | Recommended for most uses |
-| FP16 | ~2.4GB | WebGPU, Server | Higher quality |
-| Q8 | ~1.7GB | Server only | Balance of quality and size |
-- **WebGPU**: Use Q4 or FP16 (Q8 not supported)
-- **Server**: All variants supported
-## Model Files
 ```
-onnx/
-├── model.onnx              # FP32 model graph
-├── model.onnx_data*        # FP32 weights
-├── model_fp16.onnx         # FP16 model graph
-├── model_fp16.onnx_data*   # FP16 weights
-├── model_q4.onnx           # Q4 model graph (recommended)
-├── model_q4.onnx_data      # Q4 weights
-├── model_q8.onnx           # Q8 model graph
-└── model_q8.onnx_data      # Q8 weights
-* Large models (>2GB) split weights across multiple files:
-  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
-  All data files must be in the same directory as the .onnx file.
 ```
-## Python
-### Installation
-```bash
-pip install onnxruntime transformers numpy huggingface_hub
-# or with GPU support:
-pip install onnxruntime-gpu transformers numpy huggingface_hub
 ```
-### Inference
 ```python
-import numpy as np
-import onnxruntime as ort
-from huggingface_hub import hf_hub_download
-from transformers import AutoTokenizer
-# Download model (Q4 recommended)
-model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
-model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
-# Download all data files (handles multiple splits for large models)
-from huggingface_hub import list_repo_files
-for f in list_repo_files(model_id):
-    if f.startswith("onnx/model_q4.onnx_data"):
-        hf_hub_download(model_id, f)
-# Load model and tokenizer
-session = ort.InferenceSession(model_path)
-tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
-# Prepare chat input
-messages = [{"role": "user", "content": "What is the capital of France?"}]
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)
-# Initialize KV cache
-ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
-cache = {}
-for inp in session.get_inputs():
-    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
-        continue
-    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
-    for i, d in enumerate(inp.shape):
-        if isinstance(d, str) and "sequence" in d.lower():
-            shape[i] = 0
-    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))
-# Check if model uses position_ids
-input_names = {inp.name for inp in session.get_inputs()}
-use_position_ids = "position_ids" in input_names
-# Generate tokens
-seq_len = input_ids.shape[1]
-generated_tokens = []
-for step in range(100):  # max tokens
-    if step == 0:
-        ids = input_ids
-        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
-    else:
-        ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
-        pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)
-    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
-    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
-    if use_position_ids:
-        feed["position_ids"] = pos
-    outputs = session.run(None, feed)
-    next_token = int(np.argmax(outputs[0][0, -1]))
-    generated_tokens.append(next_token)
-    # Update cache
-    for i, out in enumerate(session.get_outputs()[1:], 1):
-        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
-        if name in cache:
-            cache[name] = outputs[i]
-    if next_token == tokenizer.eos_token_id:
-        break
-print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
 ```
-## WebGPU (Browser)
-### Installation
-```bash
-npm install @huggingface/transformers
-```
-### Enable WebGPU
-WebGPU is required for browser inference. To enable:
-1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
-2. **Verify**: Check `chrome://gpu` for "WebGPU" status
-3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console
-### Inference
-```javascript
-import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
-const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";
-// Load model and tokenizer
-const tokenizer = await AutoTokenizer.from_pretrained(modelId);
-const model = await AutoModelForCausalLM.from_pretrained(modelId, {
-  device: "webgpu",
-  dtype: "q4",  // or "fp16"
-});
-// Prepare input
-const messages = [{ role: "user", content: "What is the capital of France?" }];
-const input = tokenizer.apply_chat_template(messages, {
-  add_generation_prompt: true,
-  return_dict: true,
-});
-// Generate with streaming
-const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
-const output = await model.generate({
-  ...input,
-  max_new_tokens: 256,
-  do_sample: false,
-  streamer,
-});
-console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
-```
-### WebGPU Notes
-- Supported: Q4, FP16 (Q8 not supported on WebGPU)
-## License
-This model is released under the [LFM 1.0 License](LICENSE).

 ---
+library_name: transformers
 license: other
 license_name: lfm1.0
 license_link: LICENSE
 language:
 - en
+- ar
+- zh
+- fr
+- de
 - ja
 - ko
 - es
 pipeline_tag: text-generation
 tags:
 - liquid
 - lfm2.5
+- edge
+base_model: LiquidAI/LFM2.5-1.2B-Base
 ---
 <div align="center">
+  <img
+    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
+    alt="Liquid AI"
     style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
   />
   <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
+    <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
+    <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
     <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
   </div>
 </div>
+# LFM2.5-1.2B-Instruct
+LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
+- **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.
+- **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.
+- **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.
+![image](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/dxnYF2fuLpulismtFSGFi.png)
+Find more information about LFM2.5 in our [blog post](https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai).
+## 🗒️ Model Details
+| Model | Parameters | Description |
+|-------|------------|-------------|
+| [LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base) | 1.2B | Pre-trained base model for fine-tuning |
+| [**LFM2.5-1.2B-Instruct**](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) | 1.2B | General-purpose instruction-tuned model |
+| [LFM2.5-1.2B-JP](https://huggingface.co/LiquidAI/LFM2.5-1.2B-JP) | 1.2B | Japanese-optimized chat model |
+| [LFM2.5-VL-1.6B](https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B) | 1.6B | Vision-language model with fast inference |
+| [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) | 1.5B | Audio-language model for speech and text I/O |
+LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:
+- **Number of parameters**: 1.17B
+- **Number of layers**: 16 (10 double-gated LIV convolution blocks + 6 GQA blocks)
+- **Training budget**: 28T tokens
+- **Context length**: 32,768 tokens
+- **Vocabulary size**: 65,536
+- **Languages**: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish
+- **Generation parameters**:
+  - `temperature: 0.1`
+  - `top_k: 50`
+  - `top_p: 0.1`
+  - `repetition_penalty: 1.05`
+| Model | Description |
+|-------|-------------|
+| [**LFM2.5-1.2B-Instruct**](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) | Original model checkpoint in native format. Best for fine-tuning or inference with Transformers and vLLM. |
+| [LFM2.5-1.2B-Instruct-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF) | Quantized format for llama.cpp and compatible tools. Optimized for CPU inference and local deployment with reduced memory usage. |
+| [LFM2.5-1.2B-Instruct-ONNX](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-ONNX) | ONNX Runtime format for cross-platform deployment. Enables hardware-accelerated inference across diverse environments (cloud, edge, mobile). |
+| [LFM2.5-1.2B-Instruct-MLX](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit) | MLX format for Apple Silicon. Optimized for fast inference on Mac devices using the MLX framework. |
+We recommend using it for agentic tasks, data extraction, and RAG. It is not recommended for knowledge-intensive tasks and programming.
+### Chat Template
+LFM2.5 uses a ChatML-like format. See the [Chat Template documentation](https://docs.liquid.ai/lfm/key-concepts/chat-template) for details. Example:
 ```
+<|startoftext|><|im_start|>system
+You are a helpful assistant trained by Liquid AI.<|im_end|>
+<|im_start|>user
+What is C. elegans?<|im_end|>
+<|im_start|>assistant
 ```
+You can use [`tokenizer.apply_chat_template()`](https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate) to format your messages automatically.
+### Tool Use
+LFM2.5 supports function calling as follows:
+1. **Function definition**: We recommend providing the list of tools as a JSON object in the system prompt. You can also use the [`tokenizer.apply_chat_template()`](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools) function with tools.
+2. **Function call**: By default, LFM2.5 writes Pythonic function calls (a Python list between `<|tool_call_start|>` and `<|tool_call_end|>` special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.
+3. **Function execution**: The function call is executed, and the result is returned as a "tool" role.
+4. **Final answer**: LFM2 interprets the outcome of the function call to address the original user prompt in plain text.
+See the [Tool Use documentation](https://docs.liquid.ai/lfm/key-concepts/tool-use) for the full guide. Example:
+```
+<|startoftext|><|im_start|>system
+List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
+<|im_start|>user
+What is the current status of candidate ID 12345?<|im_end|>
+<|im_start|>assistant
+<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
+<|im_start|>tool
+[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
+<|im_start|>assistant
+The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>
 ```
+## 🏃 Inference
+LFM2.5 is supported by many inference frameworks. See the [Inference documentation](https://docs.liquid.ai/lfm/inference/transformers) for the full list.
+| Name | Description | Docs | Notebook |
+|------|-------------|------|:--------:|
+| [Transformers](https://github.com/huggingface/transformers) | Simple inference with direct access to model internals. | <a href="https://docs.liquid.ai/lfm/inference/transformers">Link</a> | <a href="https://colab.research.google.com/drive/1_q3jQ6LtyiuPzFZv7Vw8xSfPU5FwkKZY?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+| [vLLM](https://github.com/vllm-project/vllm) | High-throughput production deployments with GPU. | <a href="https://docs.liquid.ai/lfm/inference/vllm">Link</a> | <a href="https://colab.research.google.com/drive/1VfyscuHP8A3we_YpnzuabYJzr5ju0Mit?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+| [llama.cpp](https://github.com/ggml-org/llama.cpp) | Cross-platform inference with CPU offloading. | <a href="https://docs.liquid.ai/lfm/inference/llama-cpp">Link</a> | <a href="https://colab.research.google.com/drive/1ohLl3w47OQZA4ELo46i5E4Z6oGWBAyo8?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+| [MLX](https://github.com/ml-explore/mlx) | Apple's machine learning framework optimized for Apple Silicon. | <a href="https://docs.liquid.ai/lfm/inference/mlx">Link</a> | — |
+| [LM Studio](https://lmstudio.ai/) | Desktop application for running LLMs locally. | <a href="https://docs.liquid.ai/lfm/inference/lm-studio">Link</a> | — |
+Here's a quick start example with Transformers:
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    dtype="bfloat16",
+#   attn_implementation="flash_attention_2" <- uncomment on compatible GPU
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+prompt = "What is C. elegans?"
+input_ids = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    return_tensors="pt",
+    tokenize=True,
+).to(model.device)
+output = model.generate(
+    input_ids,
+    do_sample=True,
+    temperature=0.1,
+    top_k=50,
+    top_p=0.1,
+    repetition_penalty=1.05,
+    max_new_tokens=512,
+    streamer=streamer,
+)
 ```
+## 🔧 Fine-Tuning
+We recommend fine-tuning LFM2.5 for your specific use case to achieve the best results.
+| Name | Description | Docs | Notebook |
+|------|-------------|------|----------|
+| SFT ([Unsloth](https://github.com/unslothai/unsloth)) | Supervised Fine-Tuning with LoRA using Unsloth. | <a href="https://docs.liquid.ai/lfm/fine-tuning/unsloth">Link</a> | <a href="https://colab.research.google.com/drive/1HROdGaPFt1tATniBcos11-doVaH7kOI3?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+| SFT ([TRL](https://github.com/huggingface/trl)) | Supervised Fine-Tuning with LoRA using TRL. | <a href="https://docs.liquid.ai/lfm/fine-tuning/trl">Link</a> | <a href="https://colab.research.google.com/drive/1j5Hk_SyBb2soUsuhU0eIEA9GwLNRnElF?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+| DPO ([TRL](https://github.com/huggingface/trl)) | Direct Preference Optimization with LoRA using TRL. | <a href="https://docs.liquid.ai/lfm/fine-tuning/trl">Link</a> | <a href="https://colab.research.google.com/drive/1MQdsPxFHeZweGsNx4RH7Ia8lG8PiGE1t?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
+## 📊 Performance
+### Benchmarks
+We compared LFM2.5-1.2B-Instruct with relevant sub-2B models on a diverse suite of benchmarks.
+| Model | GPQA | MMLU-Pro | IFEval | IFBench | Multi-IF | AIME25 | BFCLv3 |
+|-------|------|----------|--------|---------|----------|--------|--------|
+| **LFM2.5-1.2B-Instruct** | 38.89 | 44.35 | 86.23 | 47.33 | 60.98 | 14.00 | 49.12 |
+| Qwen3-1.7B (instruct)| 34.85 | 42.91 | 73.68 | 21.33 | 56.48 | 9.33 | 46.30 |
+| Granite 4.0-1B | 24.24 | 33.53 | 79.61 | 21.00 | 43.65 | 3.33 | 52.43 |
+| Llama 3.2 1B Instruct | 16.57 | 20.80 | 52.37 | 15.93 | 30.16 | 0.33 | 21.44 |
+| Gemma 3 1B IT | 24.24 | 14.04 | 63.25 | 20.47 | 44.31 | 1.00 | 16.64 |
+GPQA, MMLU-Pro, IFBench, and AIME25 follow [ArtificialAnalysis's methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking). For IFEval and Multi-IF, we report the average score across strict and loose prompt and instruction accuracies. For BFCLv3, we report the final weighted average score with a custom Liquid handler to support our tool use template.
+### Inference speed
+LFM2.5-1.2B-Instruct offers extremely fast inference speed on CPUs with a low memory profile compared to similar-sized models.
+![image](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/dbbI-15p9re2ROhAkqnZm.png)
+In addition, we are partnering with AMD, Qualcomm, and Nexa AI to bring the LFM2.5 family to NPUs. These optimized models are available through our partners, enabling highly efficient on-device inference.
+| Device                                               | Inference | Framework        | Model                | Prefill (tok/s) | Decode (tok/s) | Memory (GB) |
+| ---------------------------------------------------- | --------- | ---------------- | -------------------- | --------------- | -------------- | ----------- |
+| Qualcomm Snapdragon® X Elite                         | NPU       | NexaML           | LFM2.5-1.2B-Instruct | 2591            | 63             | 0.9GB       |
+| Qualcomm Snapdragon® Gen4 (ROG Phone9 Pro)           | NPU       | NexaML           | LFM2.5-1.2B-Instruct | 4391            | 82             | 0.9GB       |
+| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU       | llama.cpp (Q4_0) | LFM2.5-1.2B-Instruct | 335             | 70             | 719MB       |
+| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU       | llama.cpp (Q4_0) | Qwen3-1.7B           | 181             | 40             | 1306MB      |
+These capabilities unlock new deployment scenarios across various devices, including vehicles, mobile devices, laptops, IoT devices, and embedded systems.
+## Contact
+For enterprise solutions and edge deployment, contact [sales@liquid.ai](mailto:sales@liquid.ai).
+## Citation
+```bibtex
+@article{liquidai2025lfm2,
+  title={LFM2 Technical Report},
+  author={Liquid AI},
+  journal={arXiv preprint arXiv:2511.23404},
+  year={2025}
+}
+```