prince-canuma commited on
Commit
27220b7
·
verified ·
1 Parent(s): 350f895

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +332 -18
README.md CHANGED
@@ -24,28 +24,342 @@ pipeline_tag: audio-to-audio
24
  base_model:
25
  - LiquidAI/LFM2-1.2B
26
  ---
27
- # mlx-community/LFM2.5-Audio-1.5B-4bit
28
 
29
- This model was converted to MLX format from [`LiquidAI/LFM2.5-Audio-1.5B`](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) using mlx-audio version **0.2.10**.
30
 
31
- Refer to the [original model card](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for more details on the model.
32
 
33
- ## Use with mlx-audio
34
 
35
- ```bash
36
- pip install -U mlx-audio
37
- ```
38
 
39
- ### CLI Example:
40
- ```bash
41
- python -m mlx_audio.sts.generate --model mlx-community/LFM2.5-Audio-1.5B-4bit --audio "audio.wav"
42
- ```
43
 
44
- ### Python Example:
45
- ```python
46
- from mlx_audio.sts.utils import load_model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- model = load_model("mlx-community/LFM2.5-Audio-1.5B-4bit")
49
- # Usage depends on the specific STS model type
50
- # See model documentation for details
51
- ```
 
24
  base_model:
25
  - LiquidAI/LFM2-1.2B
26
  ---
27
+ # mlx-community/LFM2.5-Audio-1.5B-4bit
28
 
29
+ This model was converted to MLX format from [`LiquidAI/LFM2.5-Audio-1.5B`](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) using mlx-audio version **0.3.0**.
30
 
31
+ Refer to the [original model card](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for more details on the model.
32
 
33
+ ## Use with mlx-audio
34
 
35
+ ```bash
36
+ pip install -U mlx-audio
37
+ ```
38
 
39
+ ## Features
 
 
 
40
 
41
+ - **Text-to-Speech (TTS)**: Generate natural speech from text
42
+ - **Speech-to-Text (ASR)**: Transcribe audio to text
43
+ - **Speech-to-Speech (STS)**: Voice conversations with audio input and output
44
+ - **Interleaved Generation**: Mixed text and audio responses in a single turn
45
+ - **Streaming**: Real-time token-by-token generation for low-latency applications
46
+
47
+ ## Installation
48
+
49
+ ```bash
50
+ pip install mlx-audio
51
+ ```
52
+
53
+ ## Quick Start
54
+
55
+ ### Text-to-Speech (TTS)
56
+
57
+ ```python
58
+ import mlx.core as mx
59
+ from mlx_audio.sts.models.lfm_audio import (
60
+ LFM2AudioModel,
61
+ LFM2AudioProcessor,
62
+ ChatState,
63
+ LFMModality,
64
+ )
65
+
66
+ # Load model and processor
67
+ model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
68
+ processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
69
+
70
+ # Create chat state
71
+ chat = ChatState(processor)
72
+ chat.new_turn("system")
73
+ chat.add_text("Respond with audio.")
74
+ chat.end_turn()
75
+ chat.new_turn("user")
76
+ chat.add_text("Say: Hello, welcome to MLX Audio!")
77
+ chat.end_turn()
78
+ chat.new_turn("assistant")
79
+
80
+ # Generate with interleaved text and audio
81
+ text_out, audio_out = [], []
82
+ for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
83
+ mx.eval(token)
84
+ if modality == LFMModality.TEXT:
85
+ text_out.append(token)
86
+ print(processor.decode_text(token[None]), end="", flush=True)
87
+ else:
88
+ audio_out.append(token)
89
+
90
+ # Decode audio - each token is (8,) for all codebooks
91
+ if audio_out:
92
+ audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
93
+ waveform = processor.decode_with_detokenizer(audio_codes)
94
+ # Or use Mimi codec: waveform = processor.decode_audio(audio_codes[0])
95
+
96
+ # Save audio (24kHz sample rate)
97
+ import soundfile as sf
98
+ sf.write("output.wav", waveform[0].tolist(), 24000)
99
+ ```
100
+
101
+ ### Speech-to-Text (ASR)
102
+
103
+ ```python
104
+ import mlx.core as mx
105
+ import numpy as np
106
+ import soundfile as sf
107
+ from mlx_audio.sts.models.lfm_audio import (
108
+ LFM2AudioModel,
109
+ LFM2AudioProcessor,
110
+ ChatState,
111
+ LFMModality,
112
+ )
113
+
114
+ # Load model and processor
115
+ model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
116
+ processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
117
+
118
+ # Load audio (must be 24kHz for audio input)
119
+ audio, sr = sf.read("input.wav")
120
+ audio = mx.array(audio.astype(np.float32))
121
+
122
+ # Create chat state with audio input
123
+ chat = ChatState(processor)
124
+ chat.new_turn("user")
125
+ chat.add_audio(audio, sample_rate=sr)
126
+ chat.add_text("Transcribe the audio.")
127
+ chat.end_turn()
128
+ chat.new_turn("assistant")
129
+
130
+ # Generate text response
131
+ text_out = []
132
+ for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=512):
133
+ mx.eval(token)
134
+ if modality == LFMModality.TEXT:
135
+ text_out.append(token)
136
+ print(processor.decode_text(token[None]), end="", flush=True)
137
+ ```
138
+
139
+ ### Speech-to-Speech (STS)
140
+
141
+ ```python
142
+ import mlx.core as mx
143
+ import numpy as np
144
+ import soundfile as sf
145
+ from mlx_audio.sts.models.lfm_audio import (
146
+ LFM2AudioModel,
147
+ LFM2AudioProcessor,
148
+ ChatState,
149
+ LFMModality,
150
+ )
151
+
152
+ # Load model and processor
153
+ model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
154
+ processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-4bit")
155
+
156
+ # Load input audio (24kHz)
157
+ audio, sr = sf.read("input.wav")
158
+ audio = mx.array(audio.astype(np.float32))
159
+
160
+ # Create chat state with audio input
161
+ chat = ChatState(processor)
162
+ chat.new_turn("system")
163
+ chat.add_text("Respond with interleaved text and audio.")
164
+ chat.end_turn()
165
+ chat.new_turn("user")
166
+ chat.add_audio(audio, sample_rate=sr)
167
+ chat.end_turn()
168
+ chat.new_turn("assistant")
169
+
170
+ # Generate response with both text and audio
171
+ text_out, audio_out = [], []
172
+ for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
173
+ mx.eval(token)
174
+ if modality == LFMModality.TEXT:
175
+ text_out.append(token)
176
+ print(processor.decode_text(token[None]), end="", flush=True)
177
+ else:
178
+ audio_out.append(token)
179
+
180
+ # Decode audio response
181
+ if audio_out:
182
+ audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
183
+ waveform = processor.decode_with_detokenizer(audio_codes)
184
+ sf.write("response.wav", waveform[0].tolist(), 24000)
185
+ ```
186
+
187
+ ## Interleaved Text and Audio Generation
188
+
189
+ LFM2.5-Audio uses `generate_interleaved` for mixed text and audio output. The model can respond with text, audio, or both interleaved together.
190
+
191
+ Each audio token returned by `generate_interleaved` is a complete frame of shape `(8,)` containing all 8 codebook values:
192
+
193
+ ```python
194
+ from mlx_audio.sts.models.lfm_audio import LFMModality
195
+
196
+ text_out, audio_out = [], []
197
+ for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
198
+ mx.eval(token)
199
+ if modality == LFMModality.TEXT:
200
+ text_out.append(token)
201
+ # Stream text output
202
+ print(processor.decode_text(token[None]), end="", flush=True)
203
+ else: # LFMModality.AUDIO_OUT
204
+ audio_out.append(token) # token shape: (8,)
205
+
206
+ # Stack audio frames: list of (8,) -> (8, T)
207
+ if audio_out:
208
+ audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
209
+ waveform = processor.decode_with_detokenizer(audio_codes)
210
+ ```
211
+
212
+ ## Audio Decoding Options
213
+
214
+ LFM2.5-Audio supports two methods for decoding audio codes to waveforms:
215
+
216
+ ### 1. Detokenizer (Recommended for TTS)
217
+
218
+ The neural detokenizer reconstructs audio using ISTFT from predicted spectrograms:
219
+
220
+ ```python
221
+ # Decode using detokenizer
222
+ audio = processor.decode_with_detokenizer(codes[None]) # (1, T_audio)
223
+ ```
224
+
225
+ ### 2. Mimi Codec
226
+
227
+ The Mimi neural codec provides an alternative decoding path:
228
+
229
+ ```python
230
+ # Decode using Mimi codec
231
+ audio = processor.decode_audio(codes) # (1, 1, T_audio)
232
+ ```
233
+
234
+ ## Generation Configuration
235
+
236
+ ```python
237
+ from mlx_audio.sts.models.lfm_audio import GenerationConfig
238
+
239
+ config = GenerationConfig(
240
+ max_new_tokens=2048, # Maximum tokens to generate
241
+ temperature=0.9, # Text sampling temperature
242
+ top_k=50, # Text top-k sampling
243
+ top_p=1.0, # Text nucleus sampling
244
+ audio_temperature=0.7, # Audio sampling temperature
245
+ audio_top_k=30, # Audio top-k sampling
246
+ )
247
+ ```
248
+
249
+ ## Streaming Generation
250
+
251
+ For real-time audio playback during generation:
252
+
253
+ ```python
254
+ from mlx_audio.sts.models.lfm_audio import LFMModality
255
+
256
+ FRAMES_PER_CHUNK = 10 # Decode every 10 audio frames
257
+
258
+ audio_buffer = []
259
+ for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
260
+ mx.eval(token)
261
+ if modality == LFMModality.AUDIO_OUT:
262
+ audio_buffer.append(token)
263
+
264
+ # Decode when we have enough frames
265
+ if len(audio_buffer) >= FRAMES_PER_CHUNK:
266
+ codes = mx.stack(audio_buffer, axis=1)[None, :] # (1, 8, T)
267
+ chunk = processor.decode_with_detokenizer(codes)
268
+ # Play chunk with your audio library...
269
+ audio_buffer = []
270
+
271
+ elif modality == LFMModality.TEXT:
272
+ # Stream text output
273
+ print(processor.decode_text(token[None]), end="", flush=True)
274
+ ```
275
+
276
+ ## Model Architecture
277
+
278
+ LFM2.5-Audio consists of:
279
+
280
+ - **Audio Encoder**: Conformer-based encoder for processing input audio
281
+ - **LFM Backbone**: 1.5B parameter Liquid Foundation Model for multimodal reasoning
282
+ - **Audio Decoder**: Depthformer for generating audio codes
283
+ - **Detokenizer**: ISTFT-based neural vocoder for waveform reconstruction
284
+
285
+ ## API Reference
286
+
287
+ ### LFM2AudioModel
288
+
289
+ ```python
290
+ class LFM2AudioModel:
291
+ @classmethod
292
+ def from_pretrained(cls, model_name: str) -> "LFM2AudioModel":
293
+ """Load pretrained model from HuggingFace Hub."""
294
+
295
+ def generate_interleaved(
296
+ self,
297
+ text_tokens: mx.array,
298
+ audio_features: mx.array,
299
+ modalities: mx.array,
300
+ max_new_tokens: int = 512,
301
+ temperature: float = 0.9,
302
+ audio_temperature: float = 0.7,
303
+ audio_top_k: int = 30,
304
+ ) -> Generator[Tuple[mx.array, LFMModality], None, None]:
305
+ """Generate interleaved text and audio tokens.
306
+
307
+ Yields:
308
+ (token, modality) tuples where:
309
+ - For TEXT: token is scalar, modality is LFMModality.TEXT
310
+ - For AUDIO_OUT: token is (8,) array, modality is LFMModality.AUDIO_OUT
311
+ """
312
+ ```
313
+
314
+ ### LFM2AudioProcessor
315
+
316
+ ```python
317
+ class LFM2AudioProcessor:
318
+ @classmethod
319
+ def from_pretrained(cls, model_name: str) -> "LFM2AudioProcessor":
320
+ """Load pretrained processor from HuggingFace Hub."""
321
+
322
+ def preprocess_audio(self, audio: mx.array, sample_rate: int) -> mx.array:
323
+ """Convert audio to mel spectrogram features."""
324
+
325
+ def tokenize_audio(self, audio: mx.array, sample_rate: int) -> mx.array:
326
+ """Tokenize audio using Mimi codec."""
327
+
328
+ def decode_audio(self, codes: mx.array) -> mx.array:
329
+ """Decode audio codes using Mimi codec."""
330
+
331
+ def decode_with_detokenizer(self, codes: mx.array) -> mx.array:
332
+ """Decode audio codes using neural detokenizer."""
333
+
334
+ def tokenize_text(self, text: str) -> mx.array:
335
+ """Tokenize text."""
336
+
337
+ def decode_text(self, tokens: mx.array) -> str:
338
+ """Decode text tokens."""
339
+ ```
340
+
341
+ ### ChatState
342
+
343
+ ```python
344
+ class ChatState:
345
+ def __init__(self, processor: LFM2AudioProcessor):
346
+ """Initialize chat state."""
347
+
348
+ def new_turn(self, role: str):
349
+ """Start a new turn (user/assistant/system)."""
350
+
351
+ def end_turn(self):
352
+ """End the current turn."""
353
+
354
+ def add_text(self, text: str):
355
+ """Add text to current turn."""
356
+
357
+ def add_audio(self, audio: mx.array, sample_rate: int):
358
+ """Add audio to current turn."""
359
+ ```
360
+
361
+ ## License
362
+
363
+ This implementation follows the license terms of the original LFM2.5-Audio model.
364
+ See [LiquidAI/LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for details.
365