metadata
license: other
license_name: lfm1.0
license_link: LICENSE
language:
- en
- ja
- ko
- fr
- es
- de
- it
- pt
- ar
- zh
pipeline_tag: text-generation
tags:
- liquid
- edge
- lfm2.5
- onnx
- onnxruntime
- webgpu
base_model:
- LiquidAI/LFM2.5-1.2B-Base
LFM2.5-1.2B-Base-ONNX
ONNX export of LFM2.5-1.2B-Base for cross-platform inference.
LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware. This is the base (pretrained) model for text completion tasks.
Recommended Variants
| Precision | Size | Platform | Use Case |
|---|---|---|---|
| Q4 | ~1.2GB | WebGPU, Server | Recommended for most uses |
| FP16 | ~2.4GB | WebGPU, Server | Higher quality |
| Q8 | ~1.7GB | Server only | Balance of quality and size |
- WebGPU: Use Q4 or FP16 (Q8 not supported)
- Server: All variants supported
Model Files
onnx/
βββ model.onnx # FP32 model graph
βββ model.onnx_data* # FP32 weights
βββ model_fp16.onnx # FP16 model graph
βββ model_fp16.onnx_data* # FP16 weights
βββ model_q4.onnx # Q4 model graph (recommended)
βββ model_q4.onnx_data # Q4 weights
βββ model_q8.onnx # Q8 model graph
βββ model_q8.onnx_data # Q8 weights
* Large models (>2GB) split weights across multiple files:
model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
All data files must be in the same directory as the .onnx file.
Python
Installation
pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub
Inference
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
# Download model (Q4 recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Base-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
# Download all data files (handles multiple splits for large models)
from huggingface_hub import list_repo_files
for f in list_repo_files(model_id):
if f.startswith("onnx/model_q4.onnx_data"):
hf_hub_download(model_id, f)
# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Prepare text completion input
prompt = "The quick brown fox"
input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=True)], dtype=np.int64)
# Initialize KV cache
ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
cache = {}
for inp in session.get_inputs():
if inp.name in {"input_ids", "attention_mask", "position_ids"}:
continue
shape = [d if isinstance(d, int) else 1 for d in inp.shape]
for i, d in enumerate(inp.shape):
if isinstance(d, str) and "sequence" in d.lower():
shape[i] = 0
cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))
# Check if model uses position_ids
input_names = {inp.name for inp in session.get_inputs()}
use_position_ids = "position_ids" in input_names
# Generate tokens
seq_len = input_ids.shape[1]
generated_tokens = []
for step in range(50): # max tokens
if step == 0:
ids = input_ids
pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
else:
ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)
attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
if use_position_ids:
feed["position_ids"] = pos
outputs = session.run(None, feed)
next_token = int(np.argmax(outputs[0][0, -1]))
generated_tokens.append(next_token)
# Update cache
for i, out in enumerate(session.get_outputs()[1:], 1):
name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
if name in cache:
cache[name] = outputs[i]
if next_token == tokenizer.eos_token_id:
break
print(prompt + tokenizer.decode(generated_tokens, skip_special_tokens=True))
WebGPU (Browser)
Installation
npm install @huggingface/transformers
Enable WebGPU
WebGPU is required for browser inference. To enable:
- Chrome/Edge: Navigate to
chrome://flags/#enable-unsafe-webgpu, enable, and restart - Verify: Check
chrome://gpufor "WebGPU" status - Test: Run
navigator.gpu.requestAdapter()in DevTools console
Inference
import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
const modelId = "LiquidAI/LFM2.5-1.2B-Base-ONNX";
// Load model and tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
device: "webgpu",
dtype: "q4", // or "fp16"
});
// Prepare input (text completion)
const prompt = "The quick brown fox";
const inputIds = tokenizer.encode(prompt);
// Generate with streaming
const streamer = new TextStreamer(tokenizer, { skip_prompt: false });
const output = await model.generate({
input_ids: inputIds,
max_new_tokens: 50,
do_sample: false,
streamer,
});
console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
WebGPU Notes
- Supported: Q4, FP16 (Q8 not supported on WebGPU)
License
This model is released under the LFM 1.0 License.