File size: 6,511 Bytes

---
license: other
license_name: lfm1.0
license_link: LICENSE
language:
- en
pipeline_tag: text-generation
tags:
- liquid
- edge
- lfm2
- transcript
- meeting
- summarization
- onnx
- onnxruntime
- webgpu
base_model:
- LiquidAI/LFM2-2.6B-Transcript
---

<div align="center">
  <img
    src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
    alt="Liquid AI"
    style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
  />
  <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
    <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
    <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
    <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
  </div>
</div>

# LFM2-2.6B-Transcript-ONNX

ONNX export of [LFM2-2.6B-Transcript](https://huggingface.co/LiquidAI/LFM2-2.6B-Transcript) for cross-platform inference.

LFM2-2.6B-Transcript is optimized for processing and summarizing meeting transcripts, extracting key points, action items, and decisions from conversational text.

## Recommended Variants

| Precision | Size | Platform | Use Case |
|-----------|------|----------|----------|
| Q4 | ~2.0GB | WebGPU, Server | Recommended for most uses |
| FP16 | ~4.8GB | WebGPU, Server | Higher quality |
| Q8 | ~3.0GB | Server only | Balance of quality and size |

- **WebGPU**: Use Q4 or FP16 (Q8 not supported)
- **Server**: All variants supported

## Model Files

```
onnx/
├── model.onnx              # FP32 model graph
├── model.onnx_data*        # FP32 weights
├── model_fp16.onnx         # FP16 model graph
├── model_fp16.onnx_data*   # FP16 weights
├── model_q4.onnx           # Q4 model graph (recommended)
├── model_q4.onnx_data      # Q4 weights
├── model_q8.onnx           # Q8 model graph
└── model_q8.onnx_data      # Q8 weights

* Large models (>2GB) split weights across multiple files:
  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
  All data files must be in the same directory as the .onnx file.
```

## Python

### Installation

```bash
pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub
```

### Inference

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# Download model (Q4 recommended)
model_id = "LiquidAI/LFM2-2.6B-Transcript-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download all data files (handles multiple splits for large models)
from huggingface_hub import list_repo_files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare chat input
messages = [{"role": "user", "content": "Summarize this meeting transcript: ..."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)

# Initialize KV cache
ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))

# Check if model uses position_ids
input_names = {inp.name for inp in session.get_inputs()}
use_position_ids = "position_ids" in input_names

# Generate tokens
seq_len = input_ids.shape[1]
generated_tokens = []

for step in range(100):  # max tokens
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if use_position_ids:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated_tokens.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
```

## WebGPU (Browser)

### Installation

```bash
npm install @huggingface/transformers
```

### Enable WebGPU

WebGPU is required for browser inference. To enable:

1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
2. **Verify**: Check `chrome://gpu` for "WebGPU" status
3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console

### Inference

```javascript
import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2-2.6B-Transcript-ONNX";

// Load model and tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Prepare input
const messages = [{ role: "user", content: "Summarize this meeting transcript: ..." }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

// Generate with streaming
const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
```

### WebGPU Notes

- Supported: Q4, FP16 (Q8 not supported on WebGPU)

## License

This model is released under the [LFM 1.0 License](LICENSE).