--- license: other license_name: lfm1.0 license_link: LICENSE language: - en pipeline_tag: text-generation tags: - liquid - edge - lfm2 - transcript - meeting - summarization - onnx - onnxruntime - webgpu base_model: - LiquidAI/LFM2-2.6B-Transcript ---
Liquid AI
Try LFMDocumentationLEAP
# LFM2-2.6B-Transcript-ONNX ONNX export of [LFM2-2.6B-Transcript](https://huggingface.co/LiquidAI/LFM2-2.6B-Transcript) for cross-platform inference. LFM2-2.6B-Transcript is optimized for processing and summarizing meeting transcripts, extracting key points, action items, and decisions from conversational text. ## Recommended Variants | Precision | Size | Platform | Use Case | |-----------|------|----------|----------| | Q4 | ~2.0GB | WebGPU, Server | Recommended for most uses | | FP16 | ~4.8GB | WebGPU, Server | Higher quality | | Q8 | ~3.0GB | Server only | Balance of quality and size | - **WebGPU**: Use Q4 or FP16 (Q8 not supported) - **Server**: All variants supported ## Model Files ``` onnx/ ├── model.onnx # FP32 model graph ├── model.onnx_data* # FP32 weights ├── model_fp16.onnx # FP16 model graph ├── model_fp16.onnx_data* # FP16 weights ├── model_q4.onnx # Q4 model graph (recommended) ├── model_q4.onnx_data # Q4 weights ├── model_q8.onnx # Q8 model graph └── model_q8.onnx_data # Q8 weights * Large models (>2GB) split weights across multiple files: model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc. All data files must be in the same directory as the .onnx file. ``` ## Python ### Installation ```bash pip install onnxruntime transformers numpy huggingface_hub # or with GPU support: pip install onnxruntime-gpu transformers numpy huggingface_hub ``` ### Inference ```python import numpy as np import onnxruntime as ort from huggingface_hub import hf_hub_download from transformers import AutoTokenizer # Download model (Q4 recommended) model_id = "LiquidAI/LFM2-2.6B-Transcript-ONNX" model_path = hf_hub_download(model_id, "onnx/model_q4.onnx") # Download all data files (handles multiple splits for large models) from huggingface_hub import list_repo_files for f in list_repo_files(model_id): if f.startswith("onnx/model_q4.onnx_data"): hf_hub_download(model_id, f) # Load model and tokenizer session = ort.InferenceSession(model_path) tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) # Prepare chat input messages = [{"role": "user", "content": "Summarize this meeting transcript: ..."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64) # Initialize KV cache ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64} cache = {} for inp in session.get_inputs(): if inp.name in {"input_ids", "attention_mask", "position_ids"}: continue shape = [d if isinstance(d, int) else 1 for d in inp.shape] for i, d in enumerate(inp.shape): if isinstance(d, str) and "sequence" in d.lower(): shape[i] = 0 cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32)) # Check if model uses position_ids input_names = {inp.name for inp in session.get_inputs()} use_position_ids = "position_ids" in input_names # Generate tokens seq_len = input_ids.shape[1] generated_tokens = [] for step in range(100): # max tokens if step == 0: ids = input_ids pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1) else: ids = np.array([[generated_tokens[-1]]], dtype=np.int64) pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64) attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64) feed = {"input_ids": ids, "attention_mask": attn_mask, **cache} if use_position_ids: feed["position_ids"] = pos outputs = session.run(None, feed) next_token = int(np.argmax(outputs[0][0, -1])) generated_tokens.append(next_token) # Update cache for i, out in enumerate(session.get_outputs()[1:], 1): name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.") if name in cache: cache[name] = outputs[i] if next_token == tokenizer.eos_token_id: break print(tokenizer.decode(generated_tokens, skip_special_tokens=True)) ``` ## WebGPU (Browser) ### Installation ```bash npm install @huggingface/transformers ``` ### Enable WebGPU WebGPU is required for browser inference. To enable: 1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart 2. **Verify**: Check `chrome://gpu` for "WebGPU" status 3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console ### Inference ```javascript import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers"; const modelId = "LiquidAI/LFM2-2.6B-Transcript-ONNX"; // Load model and tokenizer const tokenizer = await AutoTokenizer.from_pretrained(modelId); const model = await AutoModelForCausalLM.from_pretrained(modelId, { device: "webgpu", dtype: "q4", // or "fp16" }); // Prepare input const messages = [{ role: "user", content: "Summarize this meeting transcript: ..." }]; const input = tokenizer.apply_chat_template(messages, { add_generation_prompt: true, return_dict: true, }); // Generate with streaming const streamer = new TextStreamer(tokenizer, { skip_prompt: true }); const output = await model.generate({ ...input, max_new_tokens: 256, do_sample: false, streamer, }); console.log(tokenizer.decode(output[0], { skip_special_tokens: true })); ``` ### WebGPU Notes - Supported: Q4, FP16 (Q8 not supported on WebGPU) ## License This model is released under the [LFM 1.0 License](LICENSE).