LiquidAI
/

LFM2-8B-A1B-ONNX

@@ -13,7 +13,6 @@ tags:
 - mixture-of-experts
 - onnx
 - onnxruntime
-- webgpu
 base_model:
 - LiquidAI/LFM2-8B-A1B
 ---
@@ -39,23 +38,30 @@ LFM2-MoE is a Mixture of Experts model with 8B total parameters and ~1B active p
 ## Recommended Variants
-| Precision | Size | Platform | Use Case |
-|-----------|------|----------|----------|
-| Q4F16 | ~15GB | WebGPU, Server | Recommended (Q4 MoE + FP16 dense) |
-| FP16 | ~16GB | WebGPU, Server | Higher quality |
-| Q4 | ~30GB | Server only | Full Q4 (larger due to expert weights) |
-- **WebGPU**: Use Q4F16 or FP16 (requires high-memory GPU, Q4 not supported)
-- **Server**: All variants supported
 ## Model Files
 ```
 onnx/
-├── model.onnx              # FP32
-├── model_fp16.onnx         # FP16
-├── model_q4.onnx           # Q4
-└── model_q4f16.onnx        # Q4 MoE experts + FP16 dense (recommended)
 ```
 ## Python
@@ -79,7 +85,12 @@ from transformers import AutoTokenizer
 # Download model (Q4F16 recommended)
 model_id = "LiquidAI/LFM2-MoE-8B-A1B-ONNX"
 model_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx")
-data_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx_data")
 # Load model and tokenizer
 session = ort.InferenceSession(model_path)
@@ -139,60 +150,6 @@ for step in range(100):  # max tokens
 print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
 ```
-## WebGPU (Browser)
-### Installation
-```bash
-npm install @huggingface/transformers
-```
-### Enable WebGPU
-WebGPU is required for browser inference. To enable:
-1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
-2. **Verify**: Check `chrome://gpu` for "WebGPU" status
-3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console
-### Inference
-```javascript
-import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
-const modelId = "LiquidAI/LFM2-MoE-8B-A1B-ONNX";
-// Load model and tokenizer (requires ~15GB+ VRAM)
-const tokenizer = await AutoTokenizer.from_pretrained(modelId);
-const model = await AutoModelForCausalLM.from_pretrained(modelId, {
-  device: "webgpu",
-  dtype: "q4f16",  // or "fp16"
-});
-// Prepare input
-const messages = [{ role: "user", content: "Explain mixture of experts in one sentence." }];
-const input = tokenizer.apply_chat_template(messages, {
-  add_generation_prompt: true,
-  return_dict: true,
-});
-// Generate with streaming
-const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
-const output = await model.generate({
-  ...input,
-  max_new_tokens: 256,
-  do_sample: false,
-  streamer,
-});
-console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
-```
-### WebGPU Notes
-- Supported: Q4F16, FP16 (Q4 full not supported on WebGPU)
-- Requires high-memory GPU (~15GB+ VRAM)
 ## Model Architecture
 - **Total Parameters**: 8B

 - mixture-of-experts
 - onnx
 - onnxruntime
 base_model:
 - LiquidAI/LFM2-8B-A1B
 ---
 ## Recommended Variants
+| Precision | Size | Use Case |
+|-----------|------|----------|
+| Q4F16 | ~5GB | Recommended (Q4 MoE + FP16 dense) |
+| FP16 | ~16GB | Higher quality |
+| Q4 | ~5GB | Smallest size |
+Note: This model is too large for WebGPU browser inference.
 ## Model Files
 ```
 onnx/
+├── model.onnx              # FP32 model graph
+├── model.onnx_data*        # FP32 weights
+├── model_fp16.onnx         # FP16 model graph
+├── model_fp16.onnx_data*   # FP16 weights
+├── model_q4.onnx           # Q4 model graph
+├── model_q4.onnx_data*     # Q4 weights
+├── model_q4f16.onnx        # Q4 MoE experts + FP16 dense (recommended)
+└── model_q4f16.onnx_data*  # Q4F16 weights
+* Large models (>2GB) split weights across multiple files:
+  model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
+  All data files must be in the same directory as the .onnx file.
 ```
 ## Python
 # Download model (Q4F16 recommended)
 model_id = "LiquidAI/LFM2-MoE-8B-A1B-ONNX"
 model_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx")
+# Download all data files (handles multiple splits for large models)
+from huggingface_hub import list_repo_files
+for f in list_repo_files(model_id):
+    if f.startswith("onnx/model_q4f16.onnx_data"):
+        hf_hub_download(model_id, f)
 # Load model and tokenizer
 session = ort.InferenceSession(model_path)
 print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
 ```
 ## Model Architecture
 - **Total Parameters**: 8B