prince-canuma commited on
Commit
f14a5a4
·
verified ·
1 Parent(s): 29bec2a

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +37 -334
  2. model.safetensors +2 -2
  3. model.safetensors.index.json +65 -65
README.md CHANGED
@@ -1,348 +1,51 @@
1
  ---
2
- license_name: lfm1.0
3
- license_link: LICENSE
4
  language:
5
  - en
6
- base_model:
7
- - LiquidAI/LFM2.5-Audio-1.5B
8
- pipeline_tag: audio-to-audio
9
- library_name: mlx-audio
10
  tags:
11
- - audio-to-audio
12
- - speech
13
- - speech generation
14
- - voice isolation
15
- - sts
16
- - mlx
17
  - liquid
18
- - lfm2.5
19
- - edge
 
 
 
 
 
 
20
  - audio
 
 
 
 
 
 
 
 
 
 
 
21
  ---
22
- # mlx-community/LFM2.5-Audio-1.5B-bf16
23
- This model was converted to MLX format from [`LiquidAI/LFM2.5-Audio-1.5B`](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) using mlx-audio version **0.3.0**.
24
- Refer to the [original model card](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for more details on the model.
25
-
26
- ## Use with mlx
27
- ```bash
28
- pip install -U mlx-audio
29
- ```
30
-
31
- ## Features
32
-
33
- - **Text-to-Speech (TTS)**: Generate natural speech from text
34
- - **Speech-to-Text (ASR)**: Transcribe audio to text
35
- - **Speech-to-Speech (STS)**: Voice conversations with audio input and output
36
- - **Interleaved Generation**: Mixed text and audio responses in a single turn
37
- - **Streaming**: Real-time token-by-token generation for low-latency applications
38
-
39
- ## Quick Start
40
-
41
- ### Text-to-Speech (TTS)
42
-
43
- ```python
44
- import mlx.core as mx
45
- from mlx_audio.sts.models.lfm_audio import (
46
- LFM2AudioModel,
47
- LFM2AudioProcessor,
48
- ChatState,
49
- LFMModality,
50
- )
51
-
52
- # Load model and processor
53
- model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
54
- processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
55
-
56
- # Create chat state
57
- chat = ChatState(processor)
58
- chat.new_turn("system")
59
- chat.add_text("Respond with audio.")
60
- chat.end_turn()
61
- chat.new_turn("user")
62
- chat.add_text("Say: Hello, welcome to MLX Audio!")
63
- chat.end_turn()
64
- chat.new_turn("assistant")
65
-
66
- # Generate with interleaved text and audio
67
- text_out, audio_out = [], []
68
- for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
69
- mx.eval(token)
70
- if modality == LFMModality.TEXT:
71
- text_out.append(token)
72
- print(processor.decode_text(token[None]), end="", flush=True)
73
- else:
74
- audio_out.append(token)
75
-
76
- # Decode audio - each token is (8,) for all codebooks
77
- if audio_out:
78
- audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
79
- waveform = processor.decode_with_detokenizer(audio_codes)
80
- # Or use Mimi codec: waveform = processor.decode_audio(audio_codes[0])
81
-
82
- # Save audio (24kHz sample rate)
83
- import soundfile as sf
84
- sf.write("output.wav", waveform[0].tolist(), 24000)
85
- ```
86
-
87
- ### Speech-to-Text (ASR)
88
-
89
- ```python
90
- import mlx.core as mx
91
- import numpy as np
92
- import soundfile as sf
93
- from mlx_audio.sts.models.lfm_audio import (
94
- LFM2AudioModel,
95
- LFM2AudioProcessor,
96
- ChatState,
97
- LFMModality,
98
- )
99
-
100
- # Load model and processor
101
- model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
102
- processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
103
-
104
- # Load audio (must be 24kHz for audio input)
105
- audio, sr = sf.read("input.wav")
106
- audio = mx.array(audio.astype(np.float32))
107
-
108
- # Create chat state with audio input
109
- chat = ChatState(processor)
110
- chat.new_turn("user")
111
- chat.add_audio(audio, sample_rate=sr)
112
- chat.add_text("Transcribe the audio.")
113
- chat.end_turn()
114
- chat.new_turn("assistant")
115
-
116
- # Generate text response
117
- text_out = []
118
- for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=512):
119
- mx.eval(token)
120
- if modality == LFMModality.TEXT:
121
- text_out.append(token)
122
- print(processor.decode_text(token[None]), end="", flush=True)
123
- ```
124
-
125
- ### Speech-to-Speech (STS)
126
-
127
- ```python
128
- import mlx.core as mx
129
- import numpy as np
130
- import soundfile as sf
131
- from mlx_audio.sts.models.lfm_audio import (
132
- LFM2AudioModel,
133
- LFM2AudioProcessor,
134
- ChatState,
135
- LFMModality,
136
- )
137
-
138
- # Load model and processor
139
- model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
140
- processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-bf16")
141
-
142
- # Load input audio (24kHz)
143
- audio, sr = sf.read("input.wav")
144
- audio = mx.array(audio.astype(np.float32))
145
-
146
- # Create chat state with audio input
147
- chat = ChatState(processor)
148
- chat.new_turn("system")
149
- chat.add_text("Respond with interleaved text and audio.")
150
- chat.end_turn()
151
- chat.new_turn("user")
152
- chat.add_audio(audio, sample_rate=sr)
153
- chat.end_turn()
154
- chat.new_turn("assistant")
155
-
156
- # Generate response with both text and audio
157
- text_out, audio_out = [], []
158
- for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
159
- mx.eval(token)
160
- if modality == LFMModality.TEXT:
161
- text_out.append(token)
162
- print(processor.decode_text(token[None]), end="", flush=True)
163
- else:
164
- audio_out.append(token)
165
-
166
- # Decode audio response
167
- if audio_out:
168
- audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
169
- waveform = processor.decode_with_detokenizer(audio_codes)
170
- sf.write("response.wav", waveform[0].tolist(), 24000)
171
- ```
172
-
173
- ## Interleaved Text and Audio Generation
174
-
175
- LFM2.5-Audio uses `generate_interleaved` for mixed text and audio output. The model can respond with text, audio, or both interleaved together.
176
-
177
- Each audio token returned by `generate_interleaved` is a complete frame of shape `(8,)` containing all 8 codebook values:
178
-
179
- ```python
180
- from mlx_audio.sts.models.lfm_audio import LFMModality
181
-
182
- text_out, audio_out = [], []
183
- for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
184
- mx.eval(token)
185
- if modality == LFMModality.TEXT:
186
- text_out.append(token)
187
- # Stream text output
188
- print(processor.decode_text(token[None]), end="", flush=True)
189
- else: # LFMModality.AUDIO_OUT
190
- audio_out.append(token) # token shape: (8,)
191
-
192
- # Stack audio frames: list of (8,) -> (8, T)
193
- if audio_out:
194
- audio_codes = mx.stack(audio_out[:-1], axis=1)[None, :] # (1, 8, T)
195
- waveform = processor.decode_with_detokenizer(audio_codes)
196
- ```
197
-
198
- ## Audio Decoding Options
199
-
200
- LFM2.5-Audio supports two methods for decoding audio codes to waveforms:
201
-
202
- ### 1. Detokenizer (Recommended for TTS)
203
-
204
- The neural detokenizer reconstructs audio using ISTFT from predicted spectrograms:
205
-
206
- ```python
207
- # Decode using detokenizer
208
- audio = processor.decode_with_detokenizer(codes[None]) # (1, T_audio)
209
- ```
210
-
211
- ### 2. Mimi Codec
212
-
213
- The Mimi neural codec provides an alternative decoding path:
214
-
215
- ```python
216
- # Decode using Mimi codec
217
- audio = processor.decode_audio(codes, codec="mimi") # (1, 1, T_audio)
218
- ```
219
-
220
- ## Generation Configuration
221
-
222
- ```python
223
- from mlx_audio.sts.models.lfm_audio import GenerationConfig
224
-
225
- config = GenerationConfig(
226
- max_new_tokens=2048, # Maximum tokens to generate
227
- temperature=0.9, # Text sampling temperature
228
- top_k=50, # Text top-k sampling
229
- top_p=1.0, # Text nucleus sampling
230
- audio_temperature=0.7, # Audio sampling temperature
231
- audio_top_k=30, # Audio top-k sampling
232
- )
233
- ```
234
-
235
- ## Streaming Generation
236
-
237
- For real-time audio playback during generation:
238
-
239
- ```python
240
- from mlx_audio.sts.models.lfm_audio import LFMModality
241
-
242
- FRAMES_PER_CHUNK = 10 # Decode every 10 audio frames
243
-
244
- audio_buffer = []
245
- for token, modality in model.generate_interleaved(**dict(chat), max_new_tokens=2048):
246
- mx.eval(token)
247
- if modality == LFMModality.AUDIO_OUT:
248
- audio_buffer.append(token)
249
-
250
- # Decode when we have enough frames
251
- if len(audio_buffer) >= FRAMES_PER_CHUNK:
252
- codes = mx.stack(audio_buffer, axis=1)[None, :] # (1, 8, T)
253
- chunk = processor.decode_with_detokenizer(codes)
254
- # Play chunk with your audio library...
255
- audio_buffer = []
256
-
257
- elif modality == LFMModality.TEXT:
258
- # Stream text output
259
- print(processor.decode_text(token[None]), end="", flush=True)
260
- ```
261
-
262
- ## Model Architecture
263
-
264
- LFM2.5-Audio consists of:
265
-
266
- - **Audio Encoder**: Conformer-based encoder for processing input audio
267
- - **LFM Backbone**: 1.5B parameter Liquid Foundation Model for multimodal reasoning
268
- - **Audio Decoder**: Depthformer for generating audio codes
269
- - **Detokenizer**: ISTFT-based neural vocoder for waveform reconstruction
270
-
271
- ## API Reference
272
-
273
- ### LFM2AudioModel
274
-
275
- ```python
276
- class LFM2AudioModel:
277
- @classmethod
278
- def from_pretrained(cls, model_name: str) -> "LFM2AudioModel":
279
- """Load pretrained model from HuggingFace Hub."""
280
-
281
- def generate_interleaved(
282
- self,
283
- text_tokens: mx.array,
284
- audio_features: mx.array,
285
- modalities: mx.array,
286
- max_new_tokens: int = 512,
287
- temperature: float = 0.9,
288
- audio_temperature: float = 0.7,
289
- audio_top_k: int = 30,
290
- ) -> Generator[Tuple[mx.array, LFMModality], None, None]:
291
- """Generate interleaved text and audio tokens.
292
-
293
- Yields:
294
- (token, modality) tuples where:
295
- - For TEXT: token is scalar, modality is LFMModality.TEXT
296
- - For AUDIO_OUT: token is (8,) array, modality is LFMModality.AUDIO_OUT
297
- """
298
- ```
299
-
300
- ### LFM2AudioProcessor
301
-
302
- ```python
303
- class LFM2AudioProcessor:
304
- @classmethod
305
- def from_pretrained(cls, model_name: str) -> "LFM2AudioProcessor":
306
- """Load pretrained processor from HuggingFace Hub."""
307
-
308
- def preprocess_audio(self, audio: mx.array, sample_rate: int) -> mx.array:
309
- """Convert audio to mel spectrogram features."""
310
-
311
- def tokenize_audio(self, audio: mx.array, sample_rate: int) -> mx.array:
312
- """Tokenize audio using Mimi codec."""
313
-
314
- def decode_audio(self, codes: mx.array, codec="detokenizer") -> mx.array:
315
- """Decode audio codes using Detokenizer or Mimi codec."""
316
-
317
- def tokenize_text(self, text: str) -> mx.array:
318
- """Tokenize text."""
319
-
320
- def decode_text(self, tokens: mx.array) -> str:
321
- """Decode text tokens."""
322
- ```
323
-
324
- ### ChatState
325
-
326
- ```python
327
- class ChatState:
328
- def __init__(self, processor: LFM2AudioProcessor):
329
- """Initialize chat state."""
330
 
331
- def new_turn(self, role: str):
332
- """Start a new turn (user/assistant/system)."""
333
 
334
- def end_turn(self):
335
- """End the current turn."""
336
 
337
- def add_text(self, text: str):
338
- """Add text to current turn."""
339
 
340
- def add_audio(self, audio: mx.array, sample_rate: int):
341
- """Add audio to current turn."""
342
- ```
343
 
344
- ## License
 
 
 
345
 
346
- This implementation follows the license terms of the original LFM2.5-Audio model.
347
- See [LiquidAI/LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for details.
 
348
 
 
 
 
 
 
1
  ---
 
 
2
  language:
3
  - en
 
 
 
 
4
  tags:
 
 
 
 
 
 
5
  - liquid
6
+ - lfm2
7
+ - audio
8
+ - lfm2-audio
9
+ - speech-to-speech
10
+ - liquid-audio
11
+ - mlx
12
+ - speech-to-speech
13
+ - speech
14
  - audio
15
+ - speech enhancement
16
+ - audio separation
17
+ - sts
18
+ - mlx-audio
19
+ license: other
20
+ license_name: lfm1.0
21
+ license_link: LICENSE
22
+ library_name: mlx-audio
23
+ pipeline_tag: audio-to-audio
24
+ base_model:
25
+ - LiquidAI/LFM2-1.2B
26
  ---
27
+ # mlx-community/LFM2.5-Audio-1.5B-bf16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ This model was converted to MLX format from [`LiquidAI/LFM2.5-Audio-1.5B`](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) using mlx-audio version **0.2.10**.
 
30
 
31
+ Refer to the [original model card](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) for more details on the model.
 
32
 
33
+ ## Use with mlx-audio
 
34
 
35
+ ```bash
36
+ pip install -U mlx-audio
37
+ ```
38
 
39
+ ### CLI Example:
40
+ ```bash
41
+ python -m mlx_audio.sts.generate --model mlx-community/LFM2.5-Audio-1.5B-bf16 --audio "audio.wav"
42
+ ```
43
 
44
+ ### Python Example:
45
+ ```python
46
+ from mlx_audio.sts.utils import load_model
47
 
48
+ model = load_model("mlx-community/LFM2.5-Audio-1.5B-bf16")
49
+ # Usage depends on the specific STS model type
50
+ # See model documentation for details
51
+ ```
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a7c8eb3d154134af4386b242c01b5027d9f464328211433fe03b942cb3d5323
3
- size 2940723623
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:261c5cd8db83e10bd63ad2c0b9bdfad488ac77bfa1c72b9482e4c06c59b3ab43
3
+ size 2940723367
model.safetensors.index.json CHANGED
@@ -10,6 +10,9 @@
10
  "audio_adapter.layers.1.weight": "model.safetensors",
11
  "audio_adapter.layers.3.bias": "model.safetensors",
12
  "audio_adapter.layers.3.weight": "model.safetensors",
 
 
 
13
  "audio_encoder.layers.0.attn.k_proj.bias": "model.safetensors",
14
  "audio_encoder.layers.0.attn.k_proj.weight": "model.safetensors",
15
  "audio_encoder.layers.0.attn.out_proj.bias": "model.safetensors",
@@ -673,113 +676,110 @@
673
  "audio_encoder.layers.9.ff2_norm.weight": "model.safetensors",
674
  "audio_encoder.layers.9.final_norm.bias": "model.safetensors",
675
  "audio_encoder.layers.9.final_norm.weight": "model.safetensors",
676
- "audio_encoder.subsampling.conv.0.bias": "model.safetensors",
677
- "audio_encoder.subsampling.conv.0.weight": "model.safetensors",
678
- "audio_encoder.subsampling.conv.2.bias": "model.safetensors",
679
- "audio_encoder.subsampling.conv.2.weight": "model.safetensors",
680
- "audio_encoder.subsampling.conv.3.bias": "model.safetensors",
681
- "audio_encoder.subsampling.conv.3.weight": "model.safetensors",
682
- "audio_encoder.subsampling.conv.5.bias": "model.safetensors",
683
- "audio_encoder.subsampling.conv.5.weight": "model.safetensors",
684
- "audio_encoder.subsampling.conv.6.bias": "model.safetensors",
685
- "audio_encoder.subsampling.conv.6.weight": "model.safetensors",
686
- "audio_encoder.subsampling.out_proj.bias": "model.safetensors",
687
- "audio_encoder.subsampling.out_proj.weight": "model.safetensors",
688
- "audio_head.depthformer.blocks.0.attn.k_norm.scale": "model.safetensors",
689
  "audio_head.depthformer.blocks.0.attn.k_proj.weight": "model.safetensors",
690
  "audio_head.depthformer.blocks.0.attn.o_proj.weight": "model.safetensors",
691
- "audio_head.depthformer.blocks.0.attn.q_norm.scale": "model.safetensors",
692
  "audio_head.depthformer.blocks.0.attn.q_proj.weight": "model.safetensors",
693
  "audio_head.depthformer.blocks.0.attn.v_proj.weight": "model.safetensors",
694
- "audio_head.depthformer.blocks.0.attn_norm.scale": "model.safetensors",
695
  "audio_head.depthformer.blocks.0.ffn.w1.weight": "model.safetensors",
696
  "audio_head.depthformer.blocks.0.ffn.w2.weight": "model.safetensors",
697
  "audio_head.depthformer.blocks.0.ffn.w3.weight": "model.safetensors",
698
- "audio_head.depthformer.blocks.0.ffn_norm.scale": "model.safetensors",
699
- "audio_head.depthformer.blocks.1.attn.k_norm.scale": "model.safetensors",
700
  "audio_head.depthformer.blocks.1.attn.k_proj.weight": "model.safetensors",
701
  "audio_head.depthformer.blocks.1.attn.o_proj.weight": "model.safetensors",
702
- "audio_head.depthformer.blocks.1.attn.q_norm.scale": "model.safetensors",
703
  "audio_head.depthformer.blocks.1.attn.q_proj.weight": "model.safetensors",
704
  "audio_head.depthformer.blocks.1.attn.v_proj.weight": "model.safetensors",
705
- "audio_head.depthformer.blocks.1.attn_norm.scale": "model.safetensors",
706
  "audio_head.depthformer.blocks.1.ffn.w1.weight": "model.safetensors",
707
  "audio_head.depthformer.blocks.1.ffn.w2.weight": "model.safetensors",
708
  "audio_head.depthformer.blocks.1.ffn.w3.weight": "model.safetensors",
709
- "audio_head.depthformer.blocks.1.ffn_norm.scale": "model.safetensors",
710
- "audio_head.depthformer.blocks.2.attn.k_norm.scale": "model.safetensors",
711
  "audio_head.depthformer.blocks.2.attn.k_proj.weight": "model.safetensors",
712
  "audio_head.depthformer.blocks.2.attn.o_proj.weight": "model.safetensors",
713
- "audio_head.depthformer.blocks.2.attn.q_norm.scale": "model.safetensors",
714
  "audio_head.depthformer.blocks.2.attn.q_proj.weight": "model.safetensors",
715
  "audio_head.depthformer.blocks.2.attn.v_proj.weight": "model.safetensors",
716
- "audio_head.depthformer.blocks.2.attn_norm.scale": "model.safetensors",
717
  "audio_head.depthformer.blocks.2.ffn.w1.weight": "model.safetensors",
718
  "audio_head.depthformer.blocks.2.ffn.w2.weight": "model.safetensors",
719
  "audio_head.depthformer.blocks.2.ffn.w3.weight": "model.safetensors",
720
- "audio_head.depthformer.blocks.2.ffn_norm.scale": "model.safetensors",
721
- "audio_head.depthformer.blocks.3.attn.k_norm.scale": "model.safetensors",
722
  "audio_head.depthformer.blocks.3.attn.k_proj.weight": "model.safetensors",
723
  "audio_head.depthformer.blocks.3.attn.o_proj.weight": "model.safetensors",
724
- "audio_head.depthformer.blocks.3.attn.q_norm.scale": "model.safetensors",
725
  "audio_head.depthformer.blocks.3.attn.q_proj.weight": "model.safetensors",
726
  "audio_head.depthformer.blocks.3.attn.v_proj.weight": "model.safetensors",
727
- "audio_head.depthformer.blocks.3.attn_norm.scale": "model.safetensors",
728
  "audio_head.depthformer.blocks.3.ffn.w1.weight": "model.safetensors",
729
  "audio_head.depthformer.blocks.3.ffn.w2.weight": "model.safetensors",
730
  "audio_head.depthformer.blocks.3.ffn.w3.weight": "model.safetensors",
731
- "audio_head.depthformer.blocks.3.ffn_norm.scale": "model.safetensors",
732
- "audio_head.depthformer.blocks.4.attn.k_norm.scale": "model.safetensors",
733
  "audio_head.depthformer.blocks.4.attn.k_proj.weight": "model.safetensors",
734
  "audio_head.depthformer.blocks.4.attn.o_proj.weight": "model.safetensors",
735
- "audio_head.depthformer.blocks.4.attn.q_norm.scale": "model.safetensors",
736
  "audio_head.depthformer.blocks.4.attn.q_proj.weight": "model.safetensors",
737
  "audio_head.depthformer.blocks.4.attn.v_proj.weight": "model.safetensors",
738
- "audio_head.depthformer.blocks.4.attn_norm.scale": "model.safetensors",
739
  "audio_head.depthformer.blocks.4.ffn.w1.weight": "model.safetensors",
740
  "audio_head.depthformer.blocks.4.ffn.w2.weight": "model.safetensors",
741
  "audio_head.depthformer.blocks.4.ffn.w3.weight": "model.safetensors",
742
- "audio_head.depthformer.blocks.4.ffn_norm.scale": "model.safetensors",
743
- "audio_head.depthformer.blocks.5.attn.k_norm.scale": "model.safetensors",
744
  "audio_head.depthformer.blocks.5.attn.k_proj.weight": "model.safetensors",
745
  "audio_head.depthformer.blocks.5.attn.o_proj.weight": "model.safetensors",
746
- "audio_head.depthformer.blocks.5.attn.q_norm.scale": "model.safetensors",
747
  "audio_head.depthformer.blocks.5.attn.q_proj.weight": "model.safetensors",
748
  "audio_head.depthformer.blocks.5.attn.v_proj.weight": "model.safetensors",
749
- "audio_head.depthformer.blocks.5.attn_norm.scale": "model.safetensors",
750
  "audio_head.depthformer.blocks.5.ffn.w1.weight": "model.safetensors",
751
  "audio_head.depthformer.blocks.5.ffn.w2.weight": "model.safetensors",
752
  "audio_head.depthformer.blocks.5.ffn.w3.weight": "model.safetensors",
753
- "audio_head.depthformer.blocks.5.ffn_norm.scale": "model.safetensors",
754
- "audio_head.in_proj.bias": "model.safetensors",
755
- "audio_head.in_proj.weight": "model.safetensors",
756
- "audio_in_emb.embedding.weight": "model.safetensors",
757
- "audio_in_emb.embedding_norm.scale": "model.safetensors",
758
- "audio_in_emb.to_logits.weight": "model.safetensors",
759
- "depth_embeddings.embeddings.0.embedding.weight": "model.safetensors",
760
- "depth_embeddings.embeddings.0.embedding_norm.scale": "model.safetensors",
761
- "depth_embeddings.embeddings.0.to_logits.weight": "model.safetensors",
762
- "depth_embeddings.embeddings.1.embedding.weight": "model.safetensors",
763
- "depth_embeddings.embeddings.1.embedding_norm.scale": "model.safetensors",
764
- "depth_embeddings.embeddings.1.to_logits.weight": "model.safetensors",
765
- "depth_embeddings.embeddings.2.embedding.weight": "model.safetensors",
766
- "depth_embeddings.embeddings.2.embedding_norm.scale": "model.safetensors",
767
- "depth_embeddings.embeddings.2.to_logits.weight": "model.safetensors",
768
- "depth_embeddings.embeddings.3.embedding.weight": "model.safetensors",
769
- "depth_embeddings.embeddings.3.embedding_norm.scale": "model.safetensors",
770
- "depth_embeddings.embeddings.3.to_logits.weight": "model.safetensors",
771
- "depth_embeddings.embeddings.4.embedding.weight": "model.safetensors",
772
- "depth_embeddings.embeddings.4.embedding_norm.scale": "model.safetensors",
773
- "depth_embeddings.embeddings.4.to_logits.weight": "model.safetensors",
774
- "depth_embeddings.embeddings.5.embedding.weight": "model.safetensors",
775
- "depth_embeddings.embeddings.5.embedding_norm.scale": "model.safetensors",
776
- "depth_embeddings.embeddings.5.to_logits.weight": "model.safetensors",
777
- "depth_embeddings.embeddings.6.embedding.weight": "model.safetensors",
778
- "depth_embeddings.embeddings.6.embedding_norm.scale": "model.safetensors",
779
- "depth_embeddings.embeddings.6.to_logits.weight": "model.safetensors",
780
- "depth_embeddings.embeddings.7.embedding.weight": "model.safetensors",
781
- "depth_embeddings.embeddings.7.embedding_norm.scale": "model.safetensors",
782
- "depth_embeddings.embeddings.7.to_logits.weight": "model.safetensors",
783
  "lfm.embed_tokens.weight": "model.safetensors",
784
  "lfm.embedding_norm.weight": "model.safetensors",
785
  "lfm.layers.0.conv.conv.weight": "model.safetensors",
 
10
  "audio_adapter.layers.1.weight": "model.safetensors",
11
  "audio_adapter.layers.3.bias": "model.safetensors",
12
  "audio_adapter.layers.3.weight": "model.safetensors",
13
+ "audio_embedding.embedding.weight": "model.safetensors",
14
+ "audio_embedding.embedding_norm.weight": "model.safetensors",
15
+ "audio_embedding.to_logits.weight": "model.safetensors",
16
  "audio_encoder.layers.0.attn.k_proj.bias": "model.safetensors",
17
  "audio_encoder.layers.0.attn.k_proj.weight": "model.safetensors",
18
  "audio_encoder.layers.0.attn.out_proj.bias": "model.safetensors",
 
676
  "audio_encoder.layers.9.ff2_norm.weight": "model.safetensors",
677
  "audio_encoder.layers.9.final_norm.bias": "model.safetensors",
678
  "audio_encoder.layers.9.final_norm.weight": "model.safetensors",
679
+ "audio_encoder.pre_encode.conv.0.bias": "model.safetensors",
680
+ "audio_encoder.pre_encode.conv.0.weight": "model.safetensors",
681
+ "audio_encoder.pre_encode.conv.2.bias": "model.safetensors",
682
+ "audio_encoder.pre_encode.conv.2.weight": "model.safetensors",
683
+ "audio_encoder.pre_encode.conv.3.bias": "model.safetensors",
684
+ "audio_encoder.pre_encode.conv.3.weight": "model.safetensors",
685
+ "audio_encoder.pre_encode.conv.5.bias": "model.safetensors",
686
+ "audio_encoder.pre_encode.conv.5.weight": "model.safetensors",
687
+ "audio_encoder.pre_encode.conv.6.bias": "model.safetensors",
688
+ "audio_encoder.pre_encode.conv.6.weight": "model.safetensors",
689
+ "audio_encoder.pre_encode.out.bias": "model.safetensors",
690
+ "audio_encoder.pre_encode.out.weight": "model.safetensors",
691
+ "audio_head.depthformer.blocks.0.attn.k_norm.weight": "model.safetensors",
692
  "audio_head.depthformer.blocks.0.attn.k_proj.weight": "model.safetensors",
693
  "audio_head.depthformer.blocks.0.attn.o_proj.weight": "model.safetensors",
694
+ "audio_head.depthformer.blocks.0.attn.q_norm.weight": "model.safetensors",
695
  "audio_head.depthformer.blocks.0.attn.q_proj.weight": "model.safetensors",
696
  "audio_head.depthformer.blocks.0.attn.v_proj.weight": "model.safetensors",
697
+ "audio_head.depthformer.blocks.0.attn_norm.weight": "model.safetensors",
698
  "audio_head.depthformer.blocks.0.ffn.w1.weight": "model.safetensors",
699
  "audio_head.depthformer.blocks.0.ffn.w2.weight": "model.safetensors",
700
  "audio_head.depthformer.blocks.0.ffn.w3.weight": "model.safetensors",
701
+ "audio_head.depthformer.blocks.0.ffn_norm.weight": "model.safetensors",
702
+ "audio_head.depthformer.blocks.1.attn.k_norm.weight": "model.safetensors",
703
  "audio_head.depthformer.blocks.1.attn.k_proj.weight": "model.safetensors",
704
  "audio_head.depthformer.blocks.1.attn.o_proj.weight": "model.safetensors",
705
+ "audio_head.depthformer.blocks.1.attn.q_norm.weight": "model.safetensors",
706
  "audio_head.depthformer.blocks.1.attn.q_proj.weight": "model.safetensors",
707
  "audio_head.depthformer.blocks.1.attn.v_proj.weight": "model.safetensors",
708
+ "audio_head.depthformer.blocks.1.attn_norm.weight": "model.safetensors",
709
  "audio_head.depthformer.blocks.1.ffn.w1.weight": "model.safetensors",
710
  "audio_head.depthformer.blocks.1.ffn.w2.weight": "model.safetensors",
711
  "audio_head.depthformer.blocks.1.ffn.w3.weight": "model.safetensors",
712
+ "audio_head.depthformer.blocks.1.ffn_norm.weight": "model.safetensors",
713
+ "audio_head.depthformer.blocks.2.attn.k_norm.weight": "model.safetensors",
714
  "audio_head.depthformer.blocks.2.attn.k_proj.weight": "model.safetensors",
715
  "audio_head.depthformer.blocks.2.attn.o_proj.weight": "model.safetensors",
716
+ "audio_head.depthformer.blocks.2.attn.q_norm.weight": "model.safetensors",
717
  "audio_head.depthformer.blocks.2.attn.q_proj.weight": "model.safetensors",
718
  "audio_head.depthformer.blocks.2.attn.v_proj.weight": "model.safetensors",
719
+ "audio_head.depthformer.blocks.2.attn_norm.weight": "model.safetensors",
720
  "audio_head.depthformer.blocks.2.ffn.w1.weight": "model.safetensors",
721
  "audio_head.depthformer.blocks.2.ffn.w2.weight": "model.safetensors",
722
  "audio_head.depthformer.blocks.2.ffn.w3.weight": "model.safetensors",
723
+ "audio_head.depthformer.blocks.2.ffn_norm.weight": "model.safetensors",
724
+ "audio_head.depthformer.blocks.3.attn.k_norm.weight": "model.safetensors",
725
  "audio_head.depthformer.blocks.3.attn.k_proj.weight": "model.safetensors",
726
  "audio_head.depthformer.blocks.3.attn.o_proj.weight": "model.safetensors",
727
+ "audio_head.depthformer.blocks.3.attn.q_norm.weight": "model.safetensors",
728
  "audio_head.depthformer.blocks.3.attn.q_proj.weight": "model.safetensors",
729
  "audio_head.depthformer.blocks.3.attn.v_proj.weight": "model.safetensors",
730
+ "audio_head.depthformer.blocks.3.attn_norm.weight": "model.safetensors",
731
  "audio_head.depthformer.blocks.3.ffn.w1.weight": "model.safetensors",
732
  "audio_head.depthformer.blocks.3.ffn.w2.weight": "model.safetensors",
733
  "audio_head.depthformer.blocks.3.ffn.w3.weight": "model.safetensors",
734
+ "audio_head.depthformer.blocks.3.ffn_norm.weight": "model.safetensors",
735
+ "audio_head.depthformer.blocks.4.attn.k_norm.weight": "model.safetensors",
736
  "audio_head.depthformer.blocks.4.attn.k_proj.weight": "model.safetensors",
737
  "audio_head.depthformer.blocks.4.attn.o_proj.weight": "model.safetensors",
738
+ "audio_head.depthformer.blocks.4.attn.q_norm.weight": "model.safetensors",
739
  "audio_head.depthformer.blocks.4.attn.q_proj.weight": "model.safetensors",
740
  "audio_head.depthformer.blocks.4.attn.v_proj.weight": "model.safetensors",
741
+ "audio_head.depthformer.blocks.4.attn_norm.weight": "model.safetensors",
742
  "audio_head.depthformer.blocks.4.ffn.w1.weight": "model.safetensors",
743
  "audio_head.depthformer.blocks.4.ffn.w2.weight": "model.safetensors",
744
  "audio_head.depthformer.blocks.4.ffn.w3.weight": "model.safetensors",
745
+ "audio_head.depthformer.blocks.4.ffn_norm.weight": "model.safetensors",
746
+ "audio_head.depthformer.blocks.5.attn.k_norm.weight": "model.safetensors",
747
  "audio_head.depthformer.blocks.5.attn.k_proj.weight": "model.safetensors",
748
  "audio_head.depthformer.blocks.5.attn.o_proj.weight": "model.safetensors",
749
+ "audio_head.depthformer.blocks.5.attn.q_norm.weight": "model.safetensors",
750
  "audio_head.depthformer.blocks.5.attn.q_proj.weight": "model.safetensors",
751
  "audio_head.depthformer.blocks.5.attn.v_proj.weight": "model.safetensors",
752
+ "audio_head.depthformer.blocks.5.attn_norm.weight": "model.safetensors",
753
  "audio_head.depthformer.blocks.5.ffn.w1.weight": "model.safetensors",
754
  "audio_head.depthformer.blocks.5.ffn.w2.weight": "model.safetensors",
755
  "audio_head.depthformer.blocks.5.ffn.w3.weight": "model.safetensors",
756
+ "audio_head.depthformer.blocks.5.ffn_norm.weight": "model.safetensors",
757
+ "depth_embeddings.0.embedding.weight": "model.safetensors",
758
+ "depth_embeddings.0.embedding_norm.weight": "model.safetensors",
759
+ "depth_embeddings.0.to_logits.weight": "model.safetensors",
760
+ "depth_embeddings.1.embedding.weight": "model.safetensors",
761
+ "depth_embeddings.1.embedding_norm.weight": "model.safetensors",
762
+ "depth_embeddings.1.to_logits.weight": "model.safetensors",
763
+ "depth_embeddings.2.embedding.weight": "model.safetensors",
764
+ "depth_embeddings.2.embedding_norm.weight": "model.safetensors",
765
+ "depth_embeddings.2.to_logits.weight": "model.safetensors",
766
+ "depth_embeddings.3.embedding.weight": "model.safetensors",
767
+ "depth_embeddings.3.embedding_norm.weight": "model.safetensors",
768
+ "depth_embeddings.3.to_logits.weight": "model.safetensors",
769
+ "depth_embeddings.4.embedding.weight": "model.safetensors",
770
+ "depth_embeddings.4.embedding_norm.weight": "model.safetensors",
771
+ "depth_embeddings.4.to_logits.weight": "model.safetensors",
772
+ "depth_embeddings.5.embedding.weight": "model.safetensors",
773
+ "depth_embeddings.5.embedding_norm.weight": "model.safetensors",
774
+ "depth_embeddings.5.to_logits.weight": "model.safetensors",
775
+ "depth_embeddings.6.embedding.weight": "model.safetensors",
776
+ "depth_embeddings.6.embedding_norm.weight": "model.safetensors",
777
+ "depth_embeddings.6.to_logits.weight": "model.safetensors",
778
+ "depth_embeddings.7.embedding.weight": "model.safetensors",
779
+ "depth_embeddings.7.embedding_norm.weight": "model.safetensors",
780
+ "depth_embeddings.7.to_logits.weight": "model.safetensors",
781
+ "depth_linear.bias": "model.safetensors",
782
+ "depth_linear.weight": "model.safetensors",
 
 
 
783
  "lfm.embed_tokens.weight": "model.safetensors",
784
  "lfm.embedding_norm.weight": "model.safetensors",
785
  "lfm.layers.0.conv.conv.weight": "model.safetensors",