--- library_name: transformers license: other license_name: lfm1.0 license_link: LICENSE language: - en - ar - zh - fr - de - ja - ko - es pipeline_tag: text-generation tags: - liquid - lfm2.5 - edge base_model: LiquidAI/LFM2.5-1.2B-Base ---
|
| [vLLM](https://github.com/vllm-project/vllm) | High-throughput production deployments with GPU. | Link |
|
| [llama.cpp](https://github.com/ggml-org/llama.cpp) | Cross-platform inference with CPU offloading. | Link |
|
| [MLX](https://github.com/ml-explore/mlx) | Apple's machine learning framework optimized for Apple Silicon. | Link | — |
| [LM Studio](https://lmstudio.ai/) | Desktop application for running LLMs locally. | Link | — |
Here's a quick start example with Transformers:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_id = "LiquidAI/LFM2.5-1.2B-Thinking"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="bfloat16",
# attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "What is C. elegans?"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
return_tensors="pt",
tokenize=True,
).to(model.device)
output = model.generate(
input_ids,
do_sample=True,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05,
max_new_tokens=512,
streamer=streamer,
)
```
## 🔧 Fine-Tuning
We recommend fine-tuning LFM2.5 for your specific use case to achieve the best results.
| Name | Description | Docs | Notebook |
|------|-------------|------|----------|
| SFT ([Unsloth](https://github.com/unslothai/unsloth)) | Supervised Fine-Tuning with LoRA using Unsloth. | Link |
|
| SFT ([TRL](https://github.com/huggingface/trl)) | Supervised Fine-Tuning with LoRA using TRL. | Link |
|
| DPO ([TRL](https://github.com/huggingface/trl)) | Direct Preference Optimization with LoRA using TRL. | Link |
|
## 📊 Performance
### Benchmarks
We compared LFM2.5-1.2B-Thinking with relevant sub-2B models on a diverse suite of benchmarks.
| Model | GPQA | MMLU-Pro | IFEval | IFBench | Multi-IF | AIME25 | BFCLv3 |
|-------|------|----------|--------|---------|----------|--------|--------|
| LFM2.5-1.2B-Thinking | - | - | - | - | - | - | - |
| LFM2.5-1.2B-Instruct | 38.89 | 44.35 | 86.23 | 47.33 | 60.98 | 14.00 | 49.12 |
| Qwen3-1.7B (Thinking)| - | - | - | - | - | - | - |
| Qwen3-1.7B (Instruct)| 34.85 | 42.91 | 73.68 | 21.33 | 56.48 | 9.33 | 46.30 |
| Granite 4.0-1B | 24.24 | 33.53 | 79.61 | 21.00 | 43.65 | 3.33 | 52.43 |
| Llama 3.2 1B Instruct | 16.57 | 20.80 | 52.37 | 15.93 | 30.16 | 0.33 | 21.44 |
| Gemma 3 1B IT | 24.24 | 14.04 | 63.25 | 20.47 | 44.31 | 1.00 | 16.64 |
GPQA, MMLU-Pro, IFBench, and AIME25 follow [ArtificialAnalysis's methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking). For IFEval and Multi-IF, we report the average score across strict and loose prompt and instruction accuracies. For BFCLv3, we report the final weighted average score with a custom Liquid handler to support our tool use template.
### Inference speed
LFM2.5-1.2B-Thinking offers extremely fast inference speed on CPUs with a low memory profile compared to similar-sized models.

In addition, we are partnering with AMD, Qualcomm, Nexa AI, and FastFlowLM to bring the LFM2.5 family to NPUs. These optimized models are available through our partners, enabling highly efficient on-device inference.
#### **Prefill Performance**
We report prefill throughput evaluated over a range of prompt lengths.
| Platform / Device | Inference | Framework | Model | 1K Prefill (tok/s) | 4K Prefill (tok/s) | 16K Prefill (tok/s) | Memory |
|----------------------------------------------------|-----------|------------------|-------------------|-------------------:|-------------------:|--------------------:| -------:|
| AMD Ryzen™ AI 395+ | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 1,487 | 2,226 | 1,670 | 1.6 GB (full context) |
| AMD Ryzen™ AI 9 HX 370 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 1,487 | 2,226 |1,670 | 1.6 GB (full context) |
| AMD Ryzen™ AI 7 HX 350 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 1,431 | 2,032 |1,519 | 1.6 GB (full context) |
| AMD Ryzen™ AI 5 HX 340 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 1,431 | 2,032 |1,519 | 1.6 GB (full context) |
| AMD Ryzen™ AI 9 HX 370 | CPU | llama.cpp (Q4_0) | LFM2.5-1.2B-Thinking| 2,975 | N/A | N/A | 856 MB |
| Qualcomm Snapdragon® X Elite | NPU | NexaML | LFM2.5-1.2B-Thinking| 2,591 | N/A | N/A | 0.9 GB |
| Qualcomm Snapdragon® Gen4 (ROG Phone 9 Pro) | NPU | NexaML | LFM2.5-1.2B-Thinking| 4,391 | N/A | N/A | 0.9 GB |
| Qualcomm Dragonwing IQ9 (IQ-9075, IoT) | NPU | NexaML | LFM2.5-1.2B-Thinking| 2,143 | N/A | N/A | 0.9 GB |
| Qualcomm Snapdragon® Gen4 (Galaxy S25 Ultra) | CPU | llama.cpp (Q4_0) | LFM2.5-1.2B-Thinking| 335 | N/A | N/A | 719 MB |
#### **Decode Performance**
The reported results correspond to decoding 100 tokens at different context lengths.
| Platform / Device | Inference | Framework | Model | Decode @1K (tok/s) | Decode @4K (tok/s) | Decode @16K (tok/s) | Memory |
|----------------------------------------------------|-----------|------------------|-------------------|-------------------:|-------------------:|--------------------:| -------:|
| AMD Ryzen™ AI 395+ | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 60 | 54 | 49 | 1.6 GB (full context) |
| AMD Ryzen™ AI 9 HX 370 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 57 | 54 |49 | 1.6 GB (full context) |
| AMD Ryzen™ AI 7 HX 350 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 63 | 59 |52 | 1.6 GB (full context) |
| AMD Ryzen™ AI 5 HX 340 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking| 63 | 59 |52 | 1.6 GB (full context) |
| AMD Ryzen™ AI 9 HX 370 | CPU | llama.cpp (Q4_0) | LFM2.5-1.2B-Thinking| 116 | N/A | N/A | 856 MB |
| Qualcomm Snapdragon® X Elite | NPU | NexaML | LFM2.5-1.2B-Thinking| 63 | N/A | N/A | 0.9 GB |
| Qualcomm Snapdragon® Gen4 (ROG Phone 9 Pro) | NPU | NexaML | LFM2.5-1.2B-Thinking| 82 | N/A | N/A | 0.9 GB |
| Qualcomm Dragonwing IQ9 (IQ-9075, IoT) | NPU | NexaML | LFM2.5-1.2B-Thinking | 53 | N/A | N/A | 0.9 GB |
| Qualcomm Snapdragon® Gen4 (Galaxy S25 Ultra) | CPU | llama.cpp (Q4_0) | LFM2.5-1.2B-Thinking| 70 | N/A | N/A | 719 MB |
**LFM2.5-1.2B-Thinking excels at long-context inference.**
On AMD NPUs with FastFlowLM, decoding throughput sustains ~46 tok/s even at the full 32K context, indicating robust long-context scalability.
See detailed benchmark results (up to full context length) [here](https://fastflowlm.com/docs/benchmarks/lfm2_results/).
These capabilities unlock new deployment scenarios across various devices, including vehicles, mobile devices, laptops, IoT devices, and embedded systems.
## Contact
For enterprise solutions and edge deployment, contact [sales@liquid.ai](mailto:sales@liquid.ai).
## Citation
```bibtex
@article{liquidai2025lfm2,
title={LFM2 Technical Report},
author={Liquid AI},
journal={arXiv preprint arXiv:2511.23404},
year={2025}
}
```