LiquidAI
/

LFM2.5-1.2B-Instruct

@@ -1,223 +1,217 @@
 ---
-library_name: transformers
 license: other
 license_name: lfm1.0
 license_link: LICENSE
 language:
 - en
-- ar
-- zh
-- fr
-- de
 - ja
 - ko
 - es
 pipeline_tag: text-generation
 tags:
 - liquid
-- lfm2.5
 - edge
-base_model: LiquidAI/LFM2.5-1.2B-Base
 ---
 <div align="center">
-  <img
-    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
-    alt="Liquid AI"
     style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
   />
   <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
-    <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
-    <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
     <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
   </div>
 </div>
-# LFM2.5-1.2B-Instruct
-LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
-- **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.
-- **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.
-- **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.
-![image](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/dxnYF2fuLpulismtFSGFi.png)
-Find more information about LFM2.5 in our [blog post](https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai).
-## 🗒️ Model Details
-| Model | Parameters | Description |
-|-------|------------|-------------|
-| [LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base) | 1.2B | Pre-trained base model for fine-tuning |
-| [**LFM2.5-1.2B-Instruct**](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) | 1.2B | General-purpose instruction-tuned model |
-| [LFM2.5-1.2B-JP](https://huggingface.co/LiquidAI/LFM2.5-1.2B-JP) | 1.2B | Japanese-optimized chat model |
-| [LFM2.5-VL-1.6B](https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B) | 1.6B | Vision-language model with fast inference |
-| [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) | 1.5B | Audio-language model for speech and text I/O |
-LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:
-- **Number of parameters**: 1.17B
-- **Number of layers**: 16 (10 double-gated LIV convolution blocks + 6 GQA blocks)
-- **Training budget**: 28T tokens
-- **Context length**: 32,768 tokens
-- **Vocabulary size**: 65,536
-- **Languages**: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish
-- **Generation parameters**:
-  - `temperature: 0.1`
-  - `top_k: 50`
-  - `top_p: 0.1`
-  - `repetition_penalty: 1.05`
-| Model | Description |
-|-------|-------------|
-| [**LFM2.5-1.2B-Instruct**](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) | Original model checkpoint in native format. Best for fine-tuning or inference with Transformers and vLLM. |
-| [LFM2.5-1.2B-Instruct-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF) | Quantized format for llama.cpp and compatible tools. Optimized for CPU inference and local deployment with reduced memory usage. |
-| [LFM2.5-1.2B-Instruct-ONNX](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-ONNX) | ONNX Runtime format for cross-platform deployment. Enables hardware-accelerated inference across diverse environments (cloud, edge, mobile). |
-| [LFM2.5-1.2B-Instruct-MLX](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit) | MLX format for Apple Silicon. Optimized for fast inference on Mac devices using the MLX framework. |
-We recommend using it for agentic tasks, data extraction, and RAG. It is not recommended for knowledge-intensive tasks and programming.
-### Chat Template
-LFM2.5 uses a ChatML-like format. See the [Chat Template documentation](https://docs.liquid.ai/lfm/key-concepts/chat-template) for details. Example:
 ```
-<|startoftext|><|im_start|>system
-You are a helpful assistant trained by Liquid AI.<|im_end|>
-<|im_start|>user
-What is C. elegans?<|im_end|>
-<|im_start|>assistant
 ```
-You can use [`tokenizer.apply_chat_template()`](https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate) to format your messages automatically.
-### Tool Use
-LFM2.5 supports function calling as follows:
-1. **Function definition**: We recommend providing the list of tools as a JSON object in the system prompt. You can also use the [`tokenizer.apply_chat_template()`](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools) function with tools.
-2. **Function call**: By default, LFM2.5 writes Pythonic function calls (a Python list between `<|tool_call_start|>` and `<|tool_call_end|>` special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.
-3. **Function execution**: The function call is executed, and the result is returned as a "tool" role.
-4. **Final answer**: LFM2 interprets the outcome of the function call to address the original user prompt in plain text.
-See the [Tool Use documentation](https://docs.liquid.ai/lfm/key-concepts/tool-use) for the full guide. Example:
-```
-<|startoftext|><|im_start|>system
-List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
-<|im_start|>user
-What is the current status of candidate ID 12345?<|im_end|>
-<|im_start|>assistant
-<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
-<|im_start|>tool
-[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
-<|im_start|>assistant
-The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>
 ```
-## 🏃 Inference
-LFM2.5 is supported by many inference frameworks. See the [Inference documentation](https://docs.liquid.ai/lfm/inference/transformers) for the full list.
-| Name | Description | Docs | Notebook |
-|------|-------------|------|:--------:|
-| [Transformers](https://github.com/huggingface/transformers) | Simple inference with direct access to model internals. | <a href="https://docs.liquid.ai/lfm/inference/transformers">Link</a> | <a href="https://colab.research.google.com/drive/1_q3jQ6LtyiuPzFZv7Vw8xSfPU5FwkKZY?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-| [vLLM](https://github.com/vllm-project/vllm) | High-throughput production deployments with GPU. | <a href="https://docs.liquid.ai/lfm/inference/vllm">Link</a> | <a href="https://colab.research.google.com/drive/1VfyscuHP8A3we_YpnzuabYJzr5ju0Mit?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-| [llama.cpp](https://github.com/ggml-org/llama.cpp) | Cross-platform inference with CPU offloading. | <a href="https://docs.liquid.ai/lfm/inference/llama-cpp">Link</a> | <a href="https://colab.research.google.com/drive/1ohLl3w47OQZA4ELo46i5E4Z6oGWBAyo8?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-| [MLX](https://github.com/ml-explore/mlx) | Apple's machine learning framework optimized for Apple Silicon. | <a href="https://docs.liquid.ai/lfm/inference/mlx">Link</a> | — |
-| [LM Studio](https://lmstudio.ai/) | Desktop application for running LLMs locally. | <a href="https://docs.liquid.ai/lfm/inference/lm-studio">Link</a> | — |
-Here's a quick start example with Transformers:
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
-model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    device_map="auto",
-    dtype="bfloat16",
-#   attn_implementation="flash_attention_2" <- uncomment on compatible GPU
-)
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
-prompt = "What is C. elegans?"
-input_ids = tokenizer.apply_chat_template(
-    [{"role": "user", "content": prompt}],
-    add_generation_prompt=True,
-    return_tensors="pt",
-    tokenize=True,
-).to(model.device)
-output = model.generate(
-    input_ids,
-    do_sample=True,
-    temperature=0.1,
-    top_k=50,
-    top_p=0.1,
-    repetition_penalty=1.05,
-    max_new_tokens=512,
-    streamer=streamer,
-)
 ```
-## 🔧 Fine-Tuning
-We recommend fine-tuning LFM2.5 for your specific use case to achieve the best results.
-| Name | Description | Docs | Notebook |
-|------|-------------|------|----------|
-| SFT ([Unsloth](https://github.com/unslothai/unsloth)) | Supervised Fine-Tuning with LoRA using Unsloth. | <a href="https://docs.liquid.ai/lfm/fine-tuning/unsloth">Link</a> | <a href="https://colab.research.google.com/drive/1HROdGaPFt1tATniBcos11-doVaH7kOI3?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-| SFT ([TRL](https://github.com/huggingface/trl)) | Supervised Fine-Tuning with LoRA using TRL. | <a href="https://docs.liquid.ai/lfm/fine-tuning/trl">Link</a> | <a href="https://colab.research.google.com/drive/1j5Hk_SyBb2soUsuhU0eIEA9GwLNRnElF?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-| DPO ([TRL](https://github.com/huggingface/trl)) | Direct Preference Optimization with LoRA using TRL. | <a href="https://docs.liquid.ai/lfm/fine-tuning/trl">Link</a> | <a href="https://colab.research.google.com/drive/1MQdsPxFHeZweGsNx4RH7Ia8lG8PiGE1t?usp=sharing"><img src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" width="110" alt="Colab link"></a> |
-## 📊 Performance
-### Benchmarks
-We compared LFM2.5-1.2B-Instruct with relevant sub-2B models on a diverse suite of benchmarks.
-| Model | GPQA | MMLU-Pro | IFEval | IFBench | Multi-IF | AIME25 | BFCLv3 |
-|-------|------|----------|--------|---------|----------|--------|--------|
-| **LFM2.5-1.2B-Instruct** | 38.89 | 44.35 | 86.23 | 47.33 | 60.98 | 14.00 | 49.12 |
-| Qwen3-1.7B (instruct)| 34.85 | 42.91 | 73.68 | 21.33 | 56.48 | 9.33 | 46.30 |
-| Granite 4.0-1B | 24.24 | 33.53 | 79.61 | 21.00 | 43.65 | 3.33 | 52.43 |
-| Llama 3.2 1B Instruct | 16.57 | 20.80 | 52.37 | 15.93 | 30.16 | 0.33 | 21.44 |
-| Gemma 3 1B IT | 24.24 | 14.04 | 63.25 | 20.47 | 44.31 | 1.00 | 16.64 |
-GPQA, MMLU-Pro, IFBench, and AIME25 follow [ArtificialAnalysis's methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking). For IFEval and Multi-IF, we report the average score across strict and loose prompt and instruction accuracies. For BFCLv3, we report the final weighted average score with a custom Liquid handler to support our tool use template.
-### Inference speed
-LFM2.5-1.2B-Instruct offers extremely fast inference speed on CPUs with a low memory profile compared to similar-sized models.
-![image](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/dbbI-15p9re2ROhAkqnZm.png)
-In addition, we are partnering with AMD, Qualcomm, and Nexa AI to bring the LFM2.5 family to NPUs. These optimized models are available through our partners, enabling highly efficient on-device inference.
-| Device                                               | Inference | Framework        | Model                | Prefill (tok/s) | Decode (tok/s) | Memory (GB) |
-| ---------------------------------------------------- | --------- | ---------------- | -------------------- | --------------- | -------------- | ----------- |
-| Qualcomm Snapdragon® X Elite                         | NPU       | NexaML           | LFM2.5-1.2B-Instruct | 2591            | 63             | 0.9GB       |
-| Qualcomm Snapdragon® Gen4 (ROG Phone9 Pro)           | NPU       | NexaML           | LFM2.5-1.2B-Instruct | 4391            | 82             | 0.9GB       |
-| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU       | llama.cpp (Q4_0) | LFM2.5-1.2B-Instruct | 335             | 70             | 719MB       |
-| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU       | llama.cpp (Q4_0) | Qwen3-1.7B           | 181             | 40             | 1306MB      |
-These capabilities unlock new deployment scenarios across various devices, including vehicles, mobile devices, laptops, IoT devices, and embedded systems.
-## Contact
-For enterprise solutions and edge deployment, contact [sales@liquid.ai](mailto:sales@liquid.ai).
-## Citation
-```bibtex
-@article{liquidai2025lfm2,
-  title={LFM2 Technical Report},
-  author={Liquid AI},
-  journal={arXiv preprint arXiv:2511.23404},
-  year={2025}
-}
-```

 ---
 license: other
 license_name: lfm1.0
 license_link: LICENSE
 language:
 - en
 - ja
 - ko
+- fr
 - es
+- de
+- it
+- pt
+- ar
+- zh
 pipeline_tag: text-generation
 tags:
 - liquid
 - edge
+- lfm2.5
+- onnx
+- onnxruntime
+- webgpu
+base_model:
+- LiquidAI/LFM2.5-1.2B-Instruct
 ---
 <div align="center">
+  <img
+    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
+    alt="Liquid AI"
     style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
   />
   <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
+    <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
+    <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
     <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
   </div>
 </div>
+# LFM2.5-1.2B-Instruct-ONNX
+ONNX export of [LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) for cross-platform inference.
+LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware.
+## Recommended Variants
+| Precision | Size | Platform | Use Case |
+|-----------|------|----------|----------|
+| Q4 | ~1.2GB | WebGPU, Server | Recommended for most uses |
+| FP16 | ~2.4GB | WebGPU, Server | Higher quality |
+| Q8 | ~1.7GB | Server only | Balance of quality and size |
+- **WebGPU**: Use Q4 or FP16 (Q8 not supported)
+- **Server**: All variants supported
+## Model Files
 ```
+onnx/
+├── model.onnx              # FP32 model graph
+├── model.onnx_data*        # FP32 weights
+├── model_fp16.onnx         # FP16 model graph
+├── model_fp16.onnx_data*   # FP16 weights
+├── model_q4.onnx           # Q4 model graph (recommended)
+├── model_q4.onnx_data      # Q4 weights
+├── model_q8.onnx           # Q8 model graph
+└── model_q8.onnx_data      # Q8 weights
+* Large models (>2GB) split weights across multiple files:
+  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
+  All data files must be in the same directory as the .onnx file.
 ```
+## Python
+### Installation
+```bash
+pip install onnxruntime transformers numpy huggingface_hub
+# or with GPU support:
+pip install onnxruntime-gpu transformers numpy huggingface_hub
 ```
+### Inference
 ```python
+import numpy as np
+import onnxruntime as ort
+from huggingface_hub import hf_hub_download
+from transformers import AutoTokenizer
+# Download model (Q4 recommended)
+model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
+model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
+# Download all data files (handles multiple splits for large models)
+from huggingface_hub import list_repo_files
+for f in list_repo_files(model_id):
+    if f.startswith("onnx/model_q4.onnx_data"):
+        hf_hub_download(model_id, f)
+# Load model and tokenizer
+session = ort.InferenceSession(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+# Prepare chat input
+messages = [{"role": "user", "content": "What is the capital of France?"}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)
+# Initialize KV cache
+ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
+cache = {}
+for inp in session.get_inputs():
+    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
+        continue
+    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
+    for i, d in enumerate(inp.shape):
+        if isinstance(d, str) and "sequence" in d.lower():
+            shape[i] = 0
+    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))
+# Check if model uses position_ids
+input_names = {inp.name for inp in session.get_inputs()}
+use_position_ids = "position_ids" in input_names
+# Generate tokens
+seq_len = input_ids.shape[1]
+generated_tokens = []
+for step in range(100):  # max tokens
+    if step == 0:
+        ids = input_ids
+        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
+    else:
+        ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
+        pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)
+    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
+    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
+    if use_position_ids:
+        feed["position_ids"] = pos
+    outputs = session.run(None, feed)
+    next_token = int(np.argmax(outputs[0][0, -1]))
+    generated_tokens.append(next_token)
+    # Update cache
+    for i, out in enumerate(session.get_outputs()[1:], 1):
+        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
+        if name in cache:
+            cache[name] = outputs[i]
+    if next_token == tokenizer.eos_token_id:
+        break
+print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
 ```
+## WebGPU (Browser)
+### Installation
+```bash
+npm install @huggingface/transformers
+```
+### Enable WebGPU
+WebGPU is required for browser inference. To enable:
+1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
+2. **Verify**: Check `chrome://gpu` for "WebGPU" status
+3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console
+### Inference
+```javascript
+import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
+const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";
+// Load model and tokenizer
+const tokenizer = await AutoTokenizer.from_pretrained(modelId);
+const model = await AutoModelForCausalLM.from_pretrained(modelId, {
+  device: "webgpu",
+  dtype: "q4",  // or "fp16"
+});
+// Prepare input
+const messages = [{ role: "user", content: "What is the capital of France?" }];
+const input = tokenizer.apply_chat_template(messages, {
+  add_generation_prompt: true,
+  return_dict: true,
+});
+// Generate with streaming
+const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
+const output = await model.generate({
+  ...input,
+  max_new_tokens: 256,
+  do_sample: false,
+  streamer,
+});
+console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
+```
+### WebGPU Notes
+- Supported: Q4, FP16 (Q8 not supported on WebGPU)
+## License
+This model is released under the [LFM 1.0 License](LICENSE).