Upload README.md with huggingface_hub

dbe7f5c verified 4 days ago

6.51 kB

	---
	license: other
	license_name: lfm1.0
	license_link: LICENSE
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- liquid
	- edge
	- lfm2
	- transcript
	- meeting
	- summarization
	- onnx
	- onnxruntime
	- webgpu
	base_model:
	- LiquidAI/LFM2-2.6B-Transcript
	---

	<div align="center">
	<img
	src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
	alt="Liquid AI"
	style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
	/>
	<div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
	<a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
	<a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
	<a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
	</div>
	</div>

	# LFM2-2.6B-Transcript-ONNX

	ONNX export of [LFM2-2.6B-Transcript](https://huggingface.co/LiquidAI/LFM2-2.6B-Transcript) for cross-platform inference.

	LFM2-2.6B-Transcript is optimized for processing and summarizing meeting transcripts, extracting key points, action items, and decisions from conversational text.

	## Recommended Variants

	\| Precision \| Size \| Platform \| Use Case \|
	\|-----------\|------\|----------\|----------\|
	\| Q4 \| ~2.0GB \| WebGPU, Server \| Recommended for most uses \|
	\| FP16 \| ~4.8GB \| WebGPU, Server \| Higher quality \|
	\| Q8 \| ~3.0GB \| Server only \| Balance of quality and size \|

	- WebGPU: Use Q4 or FP16 (Q8 not supported)
	- Server: All variants supported

	## Model Files

	```
	onnx/
	├── model.onnx # FP32 model graph
	├── model.onnx_data* # FP32 weights
	├── model_fp16.onnx # FP16 model graph
	├── model_fp16.onnx_data* # FP16 weights
	├── model_q4.onnx # Q4 model graph (recommended)
	├── model_q4.onnx_data # Q4 weights
	├── model_q8.onnx # Q8 model graph
	└── model_q8.onnx_data # Q8 weights

	* Large models (>2GB) split weights across multiple files:
	model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
	All data files must be in the same directory as the .onnx file.
	```

	## Python

	### Installation

	```bash
	pip install onnxruntime transformers numpy huggingface_hub
	# or with GPU support:
	pip install onnxruntime-gpu transformers numpy huggingface_hub
	```

	### Inference

	```python
	import numpy as np
	import onnxruntime as ort
	from huggingface_hub import hf_hub_download
	from transformers import AutoTokenizer

	# Download model (Q4 recommended)
	model_id = "LiquidAI/LFM2-2.6B-Transcript-ONNX"
	model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

	# Download all data files (handles multiple splits for large models)
	from huggingface_hub import list_repo_files
	for f in list_repo_files(model_id):
	if f.startswith("onnx/model_q4.onnx_data"):
	hf_hub_download(model_id, f)

	# Load model and tokenizer
	session = ort.InferenceSession(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

	# Prepare chat input
	messages = [{"role": "user", "content": "Summarize this meeting transcript: ..."}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)

	# Initialize KV cache
	ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
	cache = {}
	for inp in session.get_inputs():
	if inp.name in {"input_ids", "attention_mask", "position_ids"}:
	continue
	shape = [d if isinstance(d, int) else 1 for d in inp.shape]
	for i, d in enumerate(inp.shape):
	if isinstance(d, str) and "sequence" in d.lower():
	shape[i] = 0
	cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))

	# Check if model uses position_ids
	input_names = {inp.name for inp in session.get_inputs()}
	use_position_ids = "position_ids" in input_names

	# Generate tokens
	seq_len = input_ids.shape[1]
	generated_tokens = []

	for step in range(100): # max tokens
	if step == 0:
	ids = input_ids
	pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
	else:
	ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
	pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)

	attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
	feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
	if use_position_ids:
	feed["position_ids"] = pos

	outputs = session.run(None, feed)
	next_token = int(np.argmax(outputs[0][0, -1]))
	generated_tokens.append(next_token)

	# Update cache
	for i, out in enumerate(session.get_outputs()[1:], 1):
	name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
	if name in cache:
	cache[name] = outputs[i]

	if next_token == tokenizer.eos_token_id:
	break

	print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
	```

	## WebGPU (Browser)

	### Installation

	```bash
	npm install @huggingface/transformers
	```

	### Enable WebGPU

	WebGPU is required for browser inference. To enable:

	1. Chrome/Edge: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
	2. Verify: Check `chrome://gpu` for "WebGPU" status
	3. Test: Run `navigator.gpu.requestAdapter()` in DevTools console

	### Inference

	```javascript
	import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

	const modelId = "LiquidAI/LFM2-2.6B-Transcript-ONNX";

	// Load model and tokenizer
	const tokenizer = await AutoTokenizer.from_pretrained(modelId);
	const model = await AutoModelForCausalLM.from_pretrained(modelId, {
	device: "webgpu",
	dtype: "q4", // or "fp16"
	});

	// Prepare input
	const messages = [{ role: "user", content: "Summarize this meeting transcript: ..." }];
	const input = tokenizer.apply_chat_template(messages, {
	add_generation_prompt: true,
	return_dict: true,
	});

	// Generate with streaming
	const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
	const output = await model.generate({
	...input,
	max_new_tokens: 256,
	do_sample: false,
	streamer,
	});

	console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
	```

	### WebGPU Notes

	- Supported: Q4, FP16 (Q8 not supported on WebGPU)

	## License

	This model is released under the [LFM 1.0 License](LICENSE).