ykhrustalev commited on
Commit
b9f6d0d
·
verified ·
1 Parent(s): 26ab596

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +362 -0
README.md ADDED
@@ -0,0 +1,362 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: lfm1.0
4
+ license_link: LICENSE
5
+ language:
6
+ - en
7
+ pipeline_tag: text-to-audio
8
+ tags:
9
+ - liquid
10
+ - edge
11
+ - lfm2.5-audio
12
+ - lfm2.5
13
+ - onnx
14
+ - onnxruntime
15
+ - webgpu
16
+ - tts
17
+ - asr
18
+ - speech
19
+ base_model:
20
+ - LiquidAI/LFM2.5-Audio-1.5B
21
+ ---
22
+
23
+ <div align="center">
24
+ <img
25
+ src="https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/2b08LKpev0DNEk6DlnWkY.png"
26
+ alt="Liquid AI"
27
+ style="width: 100%; max-width: 100%; height: auto; display: inline-block; margin-bottom: 0.5em; margin-top: 0.5em;"
28
+ />
29
+ <div style="display: flex; justify-content: center; gap: 0.5em; margin-bottom: 1em;">
30
+ <a href="https://playground.liquid.ai/"><strong>Try LFM</strong></a> •
31
+ <a href="https://docs.liquid.ai/lfm"><strong>Documentation</strong></a> •
32
+ <a href="https://leap.liquid.ai/"><strong>LEAP</strong></a>
33
+ </div>
34
+ </div>
35
+
36
+ # LFM2.5-Audio-1.5B-ONNX
37
+
38
+ ONNX export of [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for cross-platform inference.
39
+
40
+ LFM2.5-Audio is a multimodal model supporting three modes:
41
+ - **ASR** (Automatic Speech Recognition): Audio → Text
42
+ - **TTS** (Text-to-Speech): Text → Audio
43
+ - **Interleaved**: Mixed text and audio input/output
44
+
45
+ ## Recommended Variants
46
+
47
+ | Decoder | Vocoder | Size | Platform | Use Case |
48
+ |---------|---------|------|----------|----------|
49
+ | Q4 | Q4 | ~1.5GB | WebGPU, Server | Recommended for most uses |
50
+ | FP16 | FP16 | ~3.2GB | Server | Higher quality |
51
+
52
+ - **WebGPU**: Use Q4 decoder + Q4 vocoder (Q8 not supported)
53
+ - **Server**: Q4 for efficiency, FP16 for quality
54
+
55
+ ## Model Files
56
+
57
+ ```
58
+ onnx/
59
+ ├── decoder.onnx # LFM2 backbone (FP32)
60
+ ├── decoder.onnx_data*
61
+ ├── decoder_fp16.onnx # LFM2 backbone (FP16)
62
+ ├── decoder_fp16.onnx_data*
63
+ ├── decoder_q4.onnx # LFM2 backbone (Q4, recommended)
64
+ ├── decoder_q4.onnx_data
65
+ ├── audio_encoder.onnx # Conformer encoder for ASR (FP32)
66
+ ├── audio_encoder.onnx_data
67
+ ├── audio_encoder_fp16.onnx # Conformer encoder (FP16)
68
+ ├── audio_encoder_fp16.onnx_data
69
+ ├── audio_encoder_q4.onnx # Conformer encoder (Q4)
70
+ ├── audio_encoder_q4.onnx_data
71
+ ├── audio_embedding.onnx # Audio code embeddings (FP32)
72
+ ├── audio_embedding_fp16.onnx # Audio code embeddings (FP16)
73
+ ├── audio_embedding_q4.onnx # Audio code embeddings (Q4)
74
+ ├── audio_detokenizer.onnx # Neural vocoder STFT (FP32)
75
+ ├── audio_detokenizer.onnx_data
76
+ ├── audio_detokenizer_fp16.onnx # Neural vocoder (FP16)
77
+ ├── audio_detokenizer_fp16.onnx_data
78
+ ├── audio_detokenizer_q4.onnx # Neural vocoder (Q4)
79
+ ├── audio_detokenizer_q4.onnx_data
80
+ ├── vocoder_depthformer.onnx # Audio codebook prediction (FP32)
81
+ ├── vocoder_depthformer.onnx_data
82
+ ├── vocoder_depthformer_fp16.onnx # Audio codebook prediction (FP16)
83
+ ├── vocoder_depthformer_fp16.onnx_data
84
+ ├── vocoder_depthformer_q4.onnx # Audio codebook prediction (Q4)
85
+ ├── vocoder_depthformer_q4.onnx_data
86
+ ├── embed_tokens.bin # Text embeddings (binary)
87
+ ├── embed_tokens.json # Text embeddings metadata
88
+ ├── audio_embedding.bin # Audio embeddings (binary, for direct lookup)
89
+ ├── audio_embedding.json # Audio embeddings metadata
90
+ └── mel_config.json # Mel spectrogram configuration
91
+
92
+ * Large models (>2GB) split weights across multiple files:
93
+ decoder.onnx_data, decoder.onnx_data_1, decoder.onnx_data_2, etc.
94
+ All data files must be in the same directory as the .onnx file.
95
+ ```
96
+
97
+ ## Python
98
+
99
+ Use the [onnx-export](https://github.com/Liquid4All/onnx-export) repository for inference.
100
+
101
+ ### Installation
102
+
103
+ ```bash
104
+ git clone https://github.com/Liquid4All/onnx-export.git
105
+ cd onnx-export
106
+ uv sync
107
+ ```
108
+
109
+ ### ASR (Speech Recognition)
110
+
111
+ Transcribe audio to text:
112
+
113
+ ```bash
114
+ uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
115
+ --mode asr \
116
+ --audio input.wav \
117
+ --precision q4
118
+ ```
119
+
120
+ ### TTS (Text-to-Speech)
121
+
122
+ Generate audio from text:
123
+
124
+ ```bash
125
+ uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
126
+ --mode tts \
127
+ --prompt "Hello, this is a test of text to speech synthesis." \
128
+ --output output.wav \
129
+ --precision q4
130
+ ```
131
+
132
+ Options:
133
+ - `--system "Perform TTS. Use the UK female voice."` - Custom system prompt
134
+ - `--audio-temperature 0.8` - Audio sampling temperature
135
+ - `--audio-top-k 64` - Top-k sampling for audio
136
+
137
+ ### Interleaved (Mixed Audio/Text)
138
+
139
+ Generate interleaved text and audio response from audio input:
140
+
141
+ ```bash
142
+ uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
143
+ --mode interleaved \
144
+ --audio input.wav \
145
+ --output output.wav \
146
+ --precision q4
147
+ ```
148
+
149
+ Or from text prompt:
150
+
151
+ ```bash
152
+ uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
153
+ --mode interleaved \
154
+ --prompt "Respond with audio" \
155
+ --output output.wav \
156
+ --precision q4
157
+ ```
158
+
159
+ ### CLI Options
160
+
161
+ ```bash
162
+ uv run lfm2-audio-infer --help
163
+ ```
164
+
165
+ | Option | Description |
166
+ |--------|-------------|
167
+ | `--mode` | `asr`, `tts`, or `interleaved` |
168
+ | `--precision` | `fp16`, `q4`, or `q8` (default: fp32) |
169
+ | `--audio` | Input audio file (WAV) |
170
+ | `--output` | Output audio file (WAV) |
171
+ | `--prompt` | Text prompt |
172
+ | `--system` | System prompt |
173
+ | `--max-tokens` | Maximum tokens to generate |
174
+ | `--temperature` | Text sampling temperature |
175
+ | `--audio-temperature` | Audio sampling temperature |
176
+ | `--audio-top-k` | Top-k sampling for audio |
177
+ | `--seed` | Random seed for reproducibility |
178
+
179
+ ## WebGPU (Browser)
180
+
181
+ ### Installation
182
+
183
+ ```bash
184
+ npm install onnxruntime-web @huggingface/transformers
185
+ ```
186
+
187
+ ### Enable WebGPU
188
+
189
+ WebGPU is required for browser inference. To enable:
190
+
191
+ 1. **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
192
+ 2. **Verify**: Check `chrome://gpu` for "WebGPU" status
193
+ 3. **Test**: Run `navigator.gpu.requestAdapter()` in DevTools console
194
+
195
+ ### Inference
196
+
197
+ ```javascript
198
+ import * as ort from "onnxruntime-web/webgpu";
199
+ import { AutoTokenizer } from "@huggingface/transformers";
200
+
201
+ // Check WebGPU availability
202
+ if (!navigator.gpu) {
203
+ throw new Error("WebGPU not available. Enable at chrome://flags/#enable-unsafe-webgpu");
204
+ }
205
+
206
+ ort.env.wasm.numThreads = 1;
207
+
208
+ const modelId = "LiquidAI/LFM2.5-Audio-1.5B-ONNX";
209
+ const modelBase = `https://huggingface.co/${modelId}/resolve/main`;
210
+
211
+ // Load tokenizer
212
+ const tokenizer = await AutoTokenizer.from_pretrained(modelId);
213
+
214
+ // Load ONNX sessions
215
+ async function loadSession(name, dataFiles = 1) {
216
+ const onnxPath = `${modelBase}/onnx/${name}.onnx`;
217
+ const externalData = [];
218
+ for (let i = 0; i < dataFiles; i++) {
219
+ const suffix = i === 0 ? "" : `_${i}`;
220
+ const fileName = `${name}.onnx_data${suffix}`;
221
+ externalData.push({ path: fileName, data: `${modelBase}/onnx/${fileName}` });
222
+ }
223
+ return ort.InferenceSession.create(onnxPath, {
224
+ executionProviders: ["webgpu"],
225
+ externalData,
226
+ });
227
+ }
228
+
229
+ // Load models (Q4 recommended for WebGPU)
230
+ const decoder = await loadSession("decoder_q4");
231
+ const audioEmbedding = await loadSession("audio_embedding_q4");
232
+ const detokenizer = await loadSession("audio_detokenizer_q4");
233
+ const depthformer = await loadSession("vocoder_depthformer_q4");
234
+
235
+ // Load text embeddings binary
236
+ const embedResponse = await fetch(`${modelBase}/onnx/embed_tokens.bin`);
237
+ const embedBuffer = await embedResponse.arrayBuffer();
238
+ const embedMetaResponse = await fetch(`${modelBase}/onnx/embed_tokens.json`);
239
+ const embedMeta = await embedMetaResponse.json();
240
+ const embedWeight = new Float32Array(embedBuffer);
241
+
242
+ function getTextEmbeddings(ids) {
243
+ const hiddenSize = embedMeta.hidden_size;
244
+ const embeds = new Float32Array(ids.length * hiddenSize);
245
+ for (let i = 0; i < ids.length; i++) {
246
+ const offset = ids[i] * hiddenSize;
247
+ embeds.set(embedWeight.subarray(offset, offset + hiddenSize), i * hiddenSize);
248
+ }
249
+ return new ort.Tensor("float32", embeds, [1, ids.length, hiddenSize]);
250
+ }
251
+
252
+ // Model config
253
+ const hiddenSize = 2048;
254
+ const numCodebooks = 8;
255
+ const codebookVocab = 2049;
256
+
257
+ // TTS example
258
+ const text = "Hello, this is a test.";
259
+ const prompt = `<|startoftext|><|im_start|>system
260
+ Perform TTS. Use the UK female voice.<|im_end|>
261
+ <|im_start|>user
262
+ ${text}<|im_end|>
263
+ <|im_start|>assistant
264
+ `;
265
+
266
+ const inputIds = tokenizer.encode(prompt);
267
+ let embeds = getTextEmbeddings(inputIds);
268
+
269
+ // Initialize KV cache
270
+ const cache = {};
271
+ for (const name of decoder.inputNames) {
272
+ if (name.startsWith("past_conv")) {
273
+ cache[name] = new ort.Tensor("float32", new Float32Array(hiddenSize * 3), [1, hiddenSize, 3]);
274
+ } else if (name.startsWith("past_key_values")) {
275
+ cache[name] = new ort.Tensor("float32", new Float32Array(0), [1, 8, 0, 64]);
276
+ }
277
+ }
278
+
279
+ // Generation loop
280
+ const audioCodes = [];
281
+ let inAudioMode = false;
282
+ let curLen = inputIds.length;
283
+
284
+ for (let step = 0; step < 1024; step++) {
285
+ const attentionMask = new ort.Tensor("int64", new BigInt64Array(curLen).fill(1n), [1, curLen]);
286
+ const outputs = await decoder.run({ inputs_embeds: embeds, attention_mask: attentionMask, ...cache });
287
+
288
+ // Update cache
289
+ for (const [name, tensor] of Object.entries(outputs)) {
290
+ if (name.startsWith("present_conv")) {
291
+ cache[name.replace("present_conv", "past_conv")] = tensor;
292
+ } else if (name.startsWith("present.")) {
293
+ cache[name.replace("present.", "past_key_values.")] = tensor;
294
+ }
295
+ }
296
+
297
+ if (inAudioMode) {
298
+ // Use depthformer to generate audio codes
299
+ const hiddenStates = outputs.hidden_states;
300
+ const lastHidden = /* extract last position */;
301
+
302
+ // Autoregressive codebook generation (8 steps per frame)
303
+ const frameCodes = await generateAudioFrame(depthformer, lastHidden);
304
+
305
+ if (frameCodes[0] === 2048) {
306
+ // End of audio
307
+ break;
308
+ }
309
+
310
+ audioCodes.push(frameCodes);
311
+
312
+ // Get audio embeddings for feedback
313
+ const audioTokens = frameCodes.map((code, cb) => cb * codebookVocab + code);
314
+ const audioEmbedsResult = await audioEmbedding.run({
315
+ audio_codes: new ort.Tensor("int64", new BigInt64Array(audioTokens.map(BigInt)), [1, 8])
316
+ });
317
+ // Sum embeddings across codebooks
318
+ embeds = sumEmbeddings(audioEmbedsResult.audio_embeds);
319
+ } else {
320
+ // Text generation
321
+ const logits = outputs.logits;
322
+ const nextToken = argmax(logits);
323
+
324
+ if (nextToken === 128) {
325
+ // <|audio_start|> - switch to audio mode
326
+ inAudioMode = true;
327
+ }
328
+
329
+ embeds = getTextEmbeddings([nextToken]);
330
+ }
331
+
332
+ curLen++;
333
+ }
334
+
335
+ // Decode audio codes to waveform using detokenizer + ISTFT
336
+ const waveform = await decodeAudio(detokenizer, audioCodes);
337
+ ```
338
+
339
+ ### WebGPU Notes
340
+
341
+ - Recommended: Q4 models for all components
342
+ - Audio generation is autoregressive: 8 depthformer calls per audio frame
343
+ - Each audio frame = 80ms of audio (24kHz, 320 hop length, 6x upsampling)
344
+ - End-of-audio token is 2048 in any codebook
345
+ - Large models (>2GB) split weights across multiple files
346
+
347
+ ## Audio Processing Details
348
+
349
+ ### Input (ASR)
350
+ - Sample rate: 16kHz
351
+ - Mel spectrogram: 128 bins, 512 FFT, 160 hop, 400 window
352
+ - Pre-emphasis: 0.97
353
+
354
+ ### Output (TTS)
355
+ - Sample rate: 24kHz
356
+ - 8 codebooks with 2049 tokens each (0-2047 audio, 2048 end-of-audio)
357
+ - STFT reconstruction: 1280 FFT, 320 hop
358
+ - Detokenizer provides 6x temporal upsampling
359
+
360
+ ## License
361
+
362
+ This model is released under the [LFM 1.0 License](LICENSE).