ykhrustalev commited on
Commit
8e119fe
Β·
verified Β·
1 Parent(s): 04f7420

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +24 -67
README.md CHANGED
@@ -13,7 +13,6 @@ tags:
13
  - mixture-of-experts
14
  - onnx
15
  - onnxruntime
16
- - webgpu
17
  base_model:
18
  - LiquidAI/LFM2-8B-A1B
19
  ---
@@ -39,23 +38,30 @@ LFM2-MoE is a Mixture of Experts model with 8B total parameters and ~1B active p
39
 
40
  ## Recommended Variants
41
 
42
- | Precision | Size | Platform | Use Case |
43
- |-----------|------|----------|----------|
44
- | Q4F16 | ~15GB | WebGPU, Server | Recommended (Q4 MoE + FP16 dense) |
45
- | FP16 | ~16GB | WebGPU, Server | Higher quality |
46
- | Q4 | ~30GB | Server only | Full Q4 (larger due to expert weights) |
47
 
48
- - **WebGPU**: Use Q4F16 or FP16 (requires high-memory GPU, Q4 not supported)
49
- - **Server**: All variants supported
50
 
51
  ## Model Files
52
 
53
  ```
54
  onnx/
55
- β”œβ”€β”€ model.onnx # FP32
56
- β”œβ”€β”€ model_fp16.onnx # FP16
57
- β”œβ”€β”€ model_q4.onnx # Q4
58
- └── model_q4f16.onnx # Q4 MoE experts + FP16 dense (recommended)
 
 
 
 
 
 
 
 
59
  ```
60
 
61
  ## Python
@@ -79,7 +85,12 @@ from transformers import AutoTokenizer
79
  # Download model (Q4F16 recommended)
80
  model_id = "LiquidAI/LFM2-MoE-8B-A1B-ONNX"
81
  model_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx")
82
- data_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx_data")
 
 
 
 
 
83
 
84
  # Load model and tokenizer
85
  session = ort.InferenceSession(model_path)
@@ -139,60 +150,6 @@ for step in range(100): # max tokens
139
  print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
140
  ```
141
 
142
- ## WebGPU (Browser)
143
-
144
- ### Installation
145
-
146
- ```bash
147
- npm install @huggingface/transformers
148
- ```
149
-
150
- ### Enable WebGPU
151
-
152
- WebGPU is required for browser inference. To enable:
153
-
154
- 1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
155
- 2. **Verify**: Check `chrome://gpu` for "WebGPU" status
156
- 3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console
157
-
158
- ### Inference
159
-
160
- ```javascript
161
- import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
162
-
163
- const modelId = "LiquidAI/LFM2-MoE-8B-A1B-ONNX";
164
-
165
- // Load model and tokenizer (requires ~15GB+ VRAM)
166
- const tokenizer = await AutoTokenizer.from_pretrained(modelId);
167
- const model = await AutoModelForCausalLM.from_pretrained(modelId, {
168
- device: "webgpu",
169
- dtype: "q4f16", // or "fp16"
170
- });
171
-
172
- // Prepare input
173
- const messages = [{ role: "user", content: "Explain mixture of experts in one sentence." }];
174
- const input = tokenizer.apply_chat_template(messages, {
175
- add_generation_prompt: true,
176
- return_dict: true,
177
- });
178
-
179
- // Generate with streaming
180
- const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
181
- const output = await model.generate({
182
- ...input,
183
- max_new_tokens: 256,
184
- do_sample: false,
185
- streamer,
186
- });
187
-
188
- console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
189
- ```
190
-
191
- ### WebGPU Notes
192
-
193
- - Supported: Q4F16, FP16 (Q4 full not supported on WebGPU)
194
- - Requires high-memory GPU (~15GB+ VRAM)
195
-
196
  ## Model Architecture
197
 
198
  - **Total Parameters**: 8B
 
13
  - mixture-of-experts
14
  - onnx
15
  - onnxruntime
 
16
  base_model:
17
  - LiquidAI/LFM2-8B-A1B
18
  ---
 
38
 
39
  ## Recommended Variants
40
 
41
+ | Precision | Size | Use Case |
42
+ |-----------|------|----------|
43
+ | Q4F16 | ~5GB | Recommended (Q4 MoE + FP16 dense) |
44
+ | FP16 | ~16GB | Higher quality |
45
+ | Q4 | ~5GB | Smallest size |
46
 
47
+ Note: This model is too large for WebGPU browser inference.
 
48
 
49
  ## Model Files
50
 
51
  ```
52
  onnx/
53
+ β”œβ”€β”€ model.onnx # FP32 model graph
54
+ β”œβ”€β”€ model.onnx_data* # FP32 weights
55
+ β”œβ”€β”€ model_fp16.onnx # FP16 model graph
56
+ β”œβ”€β”€ model_fp16.onnx_data* # FP16 weights
57
+ β”œβ”€β”€ model_q4.onnx # Q4 model graph
58
+ β”œβ”€β”€ model_q4.onnx_data* # Q4 weights
59
+ β”œβ”€β”€ model_q4f16.onnx # Q4 MoE experts + FP16 dense (recommended)
60
+ └── model_q4f16.onnx_data* # Q4F16 weights
61
+
62
+ * Large models (>2GB) split weights across multiple files:
63
+ model.onnx_data, model.onnx_data_1, model.onnx_data_2, etc.
64
+ All data files must be in the same directory as the .onnx file.
65
  ```
66
 
67
  ## Python
 
85
  # Download model (Q4F16 recommended)
86
  model_id = "LiquidAI/LFM2-MoE-8B-A1B-ONNX"
87
  model_path = hf_hub_download(model_id, "onnx/model_q4f16.onnx")
88
+
89
+ # Download all data files (handles multiple splits for large models)
90
+ from huggingface_hub import list_repo_files
91
+ for f in list_repo_files(model_id):
92
+ if f.startswith("onnx/model_q4f16.onnx_data"):
93
+ hf_hub_download(model_id, f)
94
 
95
  # Load model and tokenizer
96
  session = ort.InferenceSession(model_path)
 
150
  print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
151
  ```
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ## Model Architecture
154
 
155
  - **Total Parameters**: 8B